Initial
This commit is contained in:
23
.gitignore
vendored
Normal file
23
.gitignore
vendored
Normal file
@@ -0,0 +1,23 @@
|
||||
# Output files
|
||||
output/*
|
||||
!output/.gitkeep
|
||||
*.mp4
|
||||
*.wav
|
||||
*.jpg
|
||||
*_lastframe.jpg
|
||||
concat_list.txt
|
||||
generation.log
|
||||
nohup.out
|
||||
|
||||
# Python
|
||||
__pycache__/
|
||||
*.py[cod]
|
||||
.venv/
|
||||
venv/
|
||||
|
||||
# macOS
|
||||
.DS_Store
|
||||
|
||||
# IDE
|
||||
.idea/
|
||||
.vscode/
|
||||
174
README.md
174
README.md
@@ -1,2 +1,176 @@
|
||||
# mlx-video-maker
|
||||
|
||||
Generate multi-scene AI videos with seamless transitions using I2V (Image-to-Video) chaining on Apple Silicon.
|
||||
|
||||
## Example Output
|
||||
|
||||
https://github.com/user-attachments/assets/REPLACE_WITH_VIDEO_ASSET_ID
|
||||
|
||||
> Sample scene from "The Local AI Revolution" - see [examples/](examples/) for the video file and [stories/local_ai_revolution.txt](stories/local_ai_revolution.txt) for the full story prompts.
|
||||
|
||||
## The Power: LLM + Prompting Guide
|
||||
|
||||
The real magic is combining an LLM (Claude, local models, etc.) with the included **[promptguide.md](promptguide.md)**. Feed the guide to your LLM, describe your scene in plain language, and get optimized prompts following best practices:
|
||||
|
||||
- Flowing narrative paragraphs (not bullet lists)
|
||||
- Present-tense action verbs
|
||||
- Explicit camera movements and lens choices
|
||||
- Audio cues for synchronized generation
|
||||
- The six essential elements for every scene
|
||||
|
||||
Example workflow:
|
||||
```
|
||||
You: "Write me a 5-scene story about a detective investigating an abandoned warehouse"
|
||||
LLM: [Uses promptguide.md to craft cinematic, LTX-2 optimized prompts]
|
||||
```
|
||||
|
||||
The guide covers lens language, shutter terminology, video type strategies, and troubleshooting tips.
|
||||
|
||||
## The Technique
|
||||
|
||||
```
|
||||
Scene 1 (T2V) → Extract Last Frame → Scene 2 (I2V) → Extract Last Frame → Scene 3 (I2V) → ...
|
||||
```
|
||||
|
||||
Each scene's last frame becomes the input image for the next scene, creating visual continuity across an entire movie. In theory, you could generate hour-long films this way.
|
||||
|
||||
## How It Works
|
||||
|
||||
1. **First scene**: Text-to-Video generation (no image input)
|
||||
2. **Frame extraction**: `ffprobe` counts frames, `ffmpeg` extracts the last frame
|
||||
3. **Subsequent scenes**: Image-to-Video with `--image` pointing to the previous scene's last frame
|
||||
4. **Final concat**: High-quality `ffmpeg` merge of all scenes
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# Clone the repo
|
||||
git clone https://github.com/YOUR_USERNAME/mlx-video-maker.git
|
||||
cd mlx-video-maker
|
||||
|
||||
# Create a story file (one prompt per line)
|
||||
cat > stories/my_story.txt << 'EOF'
|
||||
# Scene 1
|
||||
Wide aerial shot of misty mountains at dawn, camera slowly descending, cinematic, 4K
|
||||
|
||||
# Scene 2
|
||||
Camera continues revealing a lone hiker on the ridge, steady tracking shot, cinematic, 4K
|
||||
|
||||
# Scene 3
|
||||
Close-up of hiker's face illuminated by golden sunrise, emotional, cinematic, 4K
|
||||
EOF
|
||||
|
||||
# Generate the movie
|
||||
./generate_story.sh stories/my_story.txt output/
|
||||
```
|
||||
|
||||
For long generations:
|
||||
```bash
|
||||
nohup ./generate_story.sh stories/my_story.txt output/ > output/nohup.out 2>&1 &
|
||||
tail -f output/nohup.out # Monitor progress
|
||||
```
|
||||
|
||||
## Options
|
||||
|
||||
```
|
||||
./generate_story.sh <story_file> [output_dir] [options]
|
||||
|
||||
Options:
|
||||
--width Video width (default: 1920, must be divisible by 64)
|
||||
--height Video height (default: 1088, must be divisible by 64)
|
||||
--frames Frames per scene (default: 121, must satisfy 1 + 8*k)
|
||||
--strength I2V conditioning strength 0.0-1.0 (default: 0.7)
|
||||
--fps Output framerate (default: 24)
|
||||
--python Python executable (default: ./venv/bin/python)
|
||||
```
|
||||
|
||||
### Image Strength Guide
|
||||
|
||||
| Value | Effect |
|
||||
|-------|--------|
|
||||
| 0.5-0.6 | Strong visual continuity, less motion freedom |
|
||||
| **0.7** | **Sweet spot** - balanced continuity and new content |
|
||||
| 0.8-0.9 | More variation, potential visual jumps |
|
||||
|
||||
## Story File Format
|
||||
|
||||
Plain text, one prompt per line:
|
||||
- Lines starting with `#` are comments (ignored)
|
||||
- Empty lines are ignored
|
||||
- Each non-comment line = one scene
|
||||
|
||||
**Pro tip**: Use consistent style suffixes across all prompts for visual coherence:
|
||||
```
|
||||
..., cinematic, nature documentary style, 4K
|
||||
```
|
||||
|
||||
## Prompt Engineering
|
||||
|
||||
See **[promptguide.md](promptguide.md)** for the complete guide. Key points:
|
||||
|
||||
- Write flowing narrative paragraphs, not lists
|
||||
- Use present-tense verbs: "walks", "turns", "reaches"
|
||||
- Specify camera explicitly: "slow dolly forward", "steady tracking shot"
|
||||
- Include audio cues: "distant traffic hum", "footsteps echoing"
|
||||
- Add consistent style suffixes across all scenes
|
||||
|
||||
**Example prompt:**
|
||||
```
|
||||
A lone fisherman rows across a foggy lake before sunrise, the boat creaking softly
|
||||
as water laps at its sides. The camera glides overhead in a slow aerial tracking shot,
|
||||
following his steady progress from behind and slightly above. His lantern casts a warm
|
||||
circle of light that reflects in gentle ripples, while tall reeds sway on the distant
|
||||
shoreline.
|
||||
```
|
||||
|
||||
## Performance
|
||||
|
||||
On M3 Max (128GB RAM):
|
||||
- ~20-22 minutes per scene at 1920x1088, 121 frames
|
||||
- ~4 hours for a 10-scene movie (~50 seconds)
|
||||
- ~75GB peak memory usage
|
||||
|
||||
**Scaling**: 720 scenes = 1 hour movie = ~10 days generation time
|
||||
|
||||
## Requirements
|
||||
|
||||
- Apple Silicon Mac (M1/M2/M3/M4)
|
||||
- 64GB+ RAM recommended (32GB minimum at lower resolution)
|
||||
- [mlx-video](https://github.com/Blaizzy/mlx-video) with LTX-2 model
|
||||
- ffmpeg and ffprobe
|
||||
- Python 3.11+
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
# Install mlx-video (if not already)
|
||||
pip install mlx-video
|
||||
|
||||
# Install ffmpeg
|
||||
brew install ffmpeg
|
||||
|
||||
# Make script executable
|
||||
chmod +x generate_story.sh
|
||||
```
|
||||
|
||||
## Future Ideas
|
||||
|
||||
- Scene quality detection using image-to-text models (auto-regenerate poor scenes)
|
||||
- Different transition styles (fade, match cut)
|
||||
- Branching narratives (generate multiple versions, pick best)
|
||||
- Audio continuity chaining
|
||||
- Checkpoint recovery for long generations
|
||||
|
||||
## Credits
|
||||
|
||||
- [LTX-Video (LTX-2)](https://github.com/Lightricks/LTX-Video) by Lightricks - The 2B parameter DiT model that powers the video generation
|
||||
- [mlx-video](https://github.com/Blaizzy/mlx-video) by Prince Canuma ([@Blaizzy](https://github.com/Blaizzy)) - MLX port enabling Apple Silicon native inference
|
||||
- [MLX](https://github.com/ml-explore/mlx) by Apple - The ML framework for Apple Silicon
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
---
|
||||
|
||||
*Built with [mlx-video](https://github.com/Blaizzy/mlx-video) and [LTX-2](https://github.com/Lightricks/LTX-Video) on Apple Silicon*
|
||||
|
||||
210
generate_story.sh
Executable file
210
generate_story.sh
Executable file
@@ -0,0 +1,210 @@
|
||||
#!/bin/bash
|
||||
#
|
||||
# generate_story.sh - Generate multi-scene AI videos with I2V chaining
|
||||
#
|
||||
# Usage: ./generate_story.sh <story_file> <output_dir> [options]
|
||||
#
|
||||
|
||||
set -e
|
||||
|
||||
# Default settings
|
||||
WIDTH=1920
|
||||
HEIGHT=1088
|
||||
FRAMES=121
|
||||
STRENGTH=0.7
|
||||
FPS=24
|
||||
VENV_PYTHON="${VENV_PYTHON:-./venv/bin/python}"
|
||||
OUTPUT_DIR="./output"
|
||||
|
||||
# Colors
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[0;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m'
|
||||
|
||||
usage() {
|
||||
|
||||
echo "Usage: $0 <story_file> [output_dir] [options]"
|
||||
echo ""
|
||||
echo "Arguments:"
|
||||
echo " story_file Text file with one prompt per line"
|
||||
echo " output_dir Directory to save output files (default: ./output)"
|
||||
echo ""
|
||||
echo "Options:"
|
||||
echo " --width Video width (default: 1920)"
|
||||
echo " --height Video height (default: 1088)"
|
||||
echo " --frames Frames per scene (default: 121)"
|
||||
echo " --strength I2V conditioning strength 0.0-1.0 (default: 0.7)"
|
||||
echo " --fps Output framerate (default: 24)"
|
||||
echo " --python Python executable (default: python)"
|
||||
echo ""
|
||||
echo "Example:"
|
||||
echo " $0 stories/mountain.txt output/ --width 1024 --height 768"
|
||||
exit 1
|
||||
}
|
||||
|
||||
# Parse arguments
|
||||
if [ $# -lt 1 ]; then
|
||||
usage
|
||||
fi
|
||||
|
||||
STORY_FILE="$1"
|
||||
shift 1
|
||||
|
||||
# Check if second arg is output dir (not an option starting with --)
|
||||
if [ $# -gt 0 ] && [[ "$1" != --* ]]; then
|
||||
OUTPUT_DIR="$1"
|
||||
shift 1
|
||||
fi
|
||||
|
||||
while [ $# -gt 0 ]; do
|
||||
case "$1" in
|
||||
--width) WIDTH="$2"; shift 2 ;;
|
||||
--height) HEIGHT="$2"; shift 2 ;;
|
||||
--frames) FRAMES="$2"; shift 2 ;;
|
||||
--strength) STRENGTH="$2"; shift 2 ;;
|
||||
--fps) FPS="$2"; shift 2 ;;
|
||||
--python) VENV_PYTHON="$2"; shift 2 ;;
|
||||
*) echo "Unknown option: $1"; usage ;;
|
||||
esac
|
||||
done
|
||||
|
||||
# Validate inputs
|
||||
if [ ! -f "$STORY_FILE" ]; then
|
||||
echo -e "${RED}Error: Story file not found: $STORY_FILE${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Create output directory
|
||||
mkdir -p "$OUTPUT_DIR"
|
||||
|
||||
# Read prompts into array (compatible with bash 3, skip comments and empty lines)
|
||||
PROMPTS=()
|
||||
while IFS= read -r line || [[ -n "$line" ]]; do
|
||||
# Skip empty lines and comments
|
||||
[[ -z "$line" ]] && continue
|
||||
[[ "$line" =~ ^[[:space:]]*$ ]] && continue
|
||||
[[ "$line" =~ ^[[:space:]]*# ]] && continue
|
||||
PROMPTS+=("$line")
|
||||
done < "$STORY_FILE"
|
||||
NUM_SCENES=${#PROMPTS[@]}
|
||||
|
||||
if [ $NUM_SCENES -eq 0 ]; then
|
||||
echo -e "${RED}Error: No prompts found in $STORY_FILE${NC}"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Get story name from filename
|
||||
STORY_NAME=$(basename "$STORY_FILE" .txt)
|
||||
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo -e "${BLUE} mlx-video-maker${NC}"
|
||||
echo -e "${BLUE}========================================${NC}"
|
||||
echo ""
|
||||
echo -e "Story: ${GREEN}$STORY_NAME${NC}"
|
||||
echo -e "Scenes: ${GREEN}$NUM_SCENES${NC}"
|
||||
echo -e "Resolution: ${GREEN}${WIDTH}x${HEIGHT}${NC}"
|
||||
echo -e "Frames/scene: ${GREEN}$FRAMES${NC}"
|
||||
echo -e "I2V strength: ${GREEN}$STRENGTH${NC}"
|
||||
echo -e "Output: ${GREEN}$OUTPUT_DIR${NC}"
|
||||
echo ""
|
||||
|
||||
# Resolve VENV_PYTHON to absolute path before changing directories
|
||||
if [[ "$VENV_PYTHON" == ./* ]]; then
|
||||
VENV_PYTHON="$(pwd)/${VENV_PYTHON:2}"
|
||||
fi
|
||||
|
||||
# Generate scenes
|
||||
cd "$OUTPUT_DIR"
|
||||
|
||||
for i in $(seq 1 $NUM_SCENES); do
|
||||
IDX=$((i-1))
|
||||
PROMPT="${PROMPTS[$IDX]}"
|
||||
SCENE_FILE="scene${i}.mp4"
|
||||
|
||||
# Skip if already exists
|
||||
if [ -f "$SCENE_FILE" ]; then
|
||||
echo -e "${YELLOW}Scene $i already exists, skipping...${NC}"
|
||||
continue
|
||||
fi
|
||||
|
||||
echo ""
|
||||
echo -e "${BLUE}=== Scene $i / $NUM_SCENES ===${NC}"
|
||||
echo -e "${GREEN}Prompt:${NC} ${PROMPT:0:80}..."
|
||||
echo ""
|
||||
|
||||
if [ $i -eq 1 ]; then
|
||||
# First scene: Text-to-Video
|
||||
$VENV_PYTHON -m mlx_video.generate_av \
|
||||
--prompt "$PROMPT" \
|
||||
--height $HEIGHT \
|
||||
--width $WIDTH \
|
||||
--num-frames $FRAMES \
|
||||
--fps $FPS \
|
||||
--seed $((42 + i)) \
|
||||
--output-path "$SCENE_FILE"
|
||||
else
|
||||
# Subsequent scenes: Image-to-Video
|
||||
PREV=$((i-1))
|
||||
PREV_FILE="scene${PREV}.mp4"
|
||||
LAST_FRAME="scene${PREV}_lastframe.jpg"
|
||||
|
||||
# Extract last frame from previous scene
|
||||
if [ ! -f "$LAST_FRAME" ]; then
|
||||
echo -e "${YELLOW}Extracting last frame from scene $PREV...${NC}"
|
||||
FRAME_COUNT=$(ffprobe -v error -select_streams v:0 -count_frames \
|
||||
-show_entries stream=nb_read_frames \
|
||||
-of default=nokey=1:noprint_wrappers=1 "$PREV_FILE")
|
||||
LAST_IDX=$((FRAME_COUNT - 1))
|
||||
ffmpeg -i "$PREV_FILE" -vf "select=eq(n\\,$LAST_IDX)" \
|
||||
-vframes 1 -q:v 2 "$LAST_FRAME" -y 2>/dev/null
|
||||
fi
|
||||
|
||||
# Generate with I2V
|
||||
$VENV_PYTHON -m mlx_video.generate_av \
|
||||
--prompt "$PROMPT" \
|
||||
--image "$LAST_FRAME" \
|
||||
--image-strength $STRENGTH \
|
||||
--height $HEIGHT \
|
||||
--width $WIDTH \
|
||||
--num-frames $FRAMES \
|
||||
--fps $FPS \
|
||||
--seed $((42 + i)) \
|
||||
--output-path "$SCENE_FILE"
|
||||
fi
|
||||
|
||||
echo -e "${GREEN}Scene $i complete!${NC}"
|
||||
done
|
||||
|
||||
echo ""
|
||||
echo -e "${BLUE}=== Concatenating final movie ===${NC}"
|
||||
|
||||
# Create concat list
|
||||
CONCAT_LIST="concat_list.txt"
|
||||
> "$CONCAT_LIST"
|
||||
for i in $(seq 1 $NUM_SCENES); do
|
||||
if [ -f "scene${i}.mp4" ]; then
|
||||
echo "file 'scene${i}.mp4'" >> "$CONCAT_LIST"
|
||||
fi
|
||||
done
|
||||
|
||||
# Concatenate with high quality encoding
|
||||
FINAL_FILE="${STORY_NAME}.mp4"
|
||||
ffmpeg -f concat -safe 0 -i "$CONCAT_LIST" \
|
||||
-c:v libx264 -crf 18 -preset slow \
|
||||
-c:a aac -b:a 192k \
|
||||
"$FINAL_FILE" -y
|
||||
|
||||
# Get duration
|
||||
DURATION=$(ffprobe -v error -show_entries format=duration \
|
||||
-of default=noprint_wrappers=1:nokey=1 "$FINAL_FILE")
|
||||
|
||||
echo ""
|
||||
echo -e "${GREEN}========================================${NC}"
|
||||
echo -e "${GREEN} Complete!${NC}"
|
||||
echo -e "${GREEN}========================================${NC}"
|
||||
echo ""
|
||||
echo -e "Final movie: ${BLUE}$(pwd)/$FINAL_FILE${NC}"
|
||||
echo -e "Duration: ${BLUE}${DURATION%.*} seconds${NC}"
|
||||
echo ""
|
||||
0
output/.gitkeep
Normal file
0
output/.gitkeep
Normal file
1
requirements.txt
Normal file
1
requirements.txt
Normal file
@@ -0,0 +1 @@
|
||||
git+https://github.com/Blaizzy/mlx-video.git
|
||||
22
stories/local_ai_revolution.txt
Normal file
22
stories/local_ai_revolution.txt
Normal file
@@ -0,0 +1,22 @@
|
||||
# The Local AI Revolution - Tech Nerds vs Big Tech
|
||||
|
||||
# Scene 1: The Corporate Dystopia
|
||||
A towering glass skyscraper dominates the frame against a cold gray sky, its facade displaying massive glowing logos of tech giants. The camera begins with a slow crane shot ascending the building's reflective surface, capturing drones delivering packages and surveillance cameras tracking pedestrians below. People walk in uniform patterns, faces illuminated by identical smartphones, ambient electronic hum mixing with corporate jingles echoing from street speakers. The scene feels sterile and controlled, deep blue color grading with harsh artificial lighting, shot on 50mm lens with clinical precision, cinematic dystopian aesthetic.
|
||||
|
||||
# Scene 2: The Underground Lab
|
||||
In a cluttered basement apartment, four tech nerds huddle around multiple monitors displaying terminal windows and neural network visualizations. The camera tracks slowly through the space at eye level, passing towers of servers built from repurposed hardware, walls covered in whiteboards filled with equations. A woman with short purple hair types rapidly, her face lit by screen glow as she mutters "Almost there." Cables snake across the floor, cooling fans whir loudly, empty energy drink cans pile beside keyboards. Warm amber practical lighting contrasts the cold blue of screens, handheld documentary style, intimate 35mm framing.
|
||||
|
||||
# Scene 3: The Breakthrough
|
||||
Close-up on weathered hands hovering over a mechanical keyboard as a progress bar hits one hundred percent. The camera holds steady with shallow depth of field as the programmer's eyes widen, reflecting cascading green text. He slowly turns to his companions, a grin spreading across his bearded face as he whispers "It's running. Fully local. No cloud. No tracking." The others lean in, the glow from the screen illuminating their expressions of disbelief turning to joy. A ceiling fan spins lazily overhead, the hum of the local GPU cluster providing a triumphant drone, warm golden hour light streaming through dusty blinds, 85mm portrait lens, Kodak film emulation.
|
||||
|
||||
# Scene 4: Spreading the Word
|
||||
Split-screen montage showing the revolution spreading across the world. On the left, a teenager in Tokyo installs the open-source AI on a laptop in a cramped bedroom. On the right, a group of students in Berlin gather around a single computer in a university library. The camera cross-dissolves between locations as hands download, compile, and share. Reddit threads and forum posts cascade across screens, download counters climbing exponentially. Night scenes transition to day across time zones, the warm glow of screens in darkened rooms, energetic editing rhythm, ambient electronic score building momentum, wide establishing shots mixed with intimate close-ups of hopeful faces.
|
||||
|
||||
# Scene 5: Big Tech Reacts
|
||||
Inside a sterile corporate boardroom, executives in expensive suits stare at plummeting user graphs on a massive wall display. The camera dollies slowly around the long glass table, capturing sweating brows and loosened ties. A CEO slams his fist down, his voice echoing "Shut it down. All of it." Security teams scramble, server rooms flash red alerts, but the footage cuts to show home servers still humming in garages and basements worldwide, untouchable and decentralized. Sharp contrast between cold corporate blues and warm residential ambers, tension building through crosscut editing, dramatic orchestral undertones mixing with digital glitch sounds, anamorphic widescreen framing.
|
||||
|
||||
# Scene 6: The New Internet
|
||||
A diverse group of people gather in a sunny public park, laptops open, sharing local AI assistants that run without internet connection. Children learn from patient digital tutors, artists generate images on tablets, elderly users dictate messages in their native languages. The camera sweeps through the scene in a smooth steadicam shot, capturing laughter and genuine human connection. No corporate logos visible, no surveillance drones overhead, just people owning their own technology. Golden hour sunlight bathes everything in warm tones, depth of field shifts between faces and screens, birdsong and genuine conversation replace the electronic hum of the opening, 24mm wide-angle lens capturing the communal atmosphere, hopeful documentary aesthetic.
|
||||
|
||||
# Scene 7: The Original Hackers Watch
|
||||
The four original tech nerds sit on a rooftop at sunset, laptops closed for once, watching the city below where countless windows glow with the warm light of local AI. The woman with purple hair raises a bottle of craft beer, the others follow. The camera slowly pulls back in a crane shot revealing the vast urban landscape dotted with individual lights rather than corporate towers. She speaks softly, "We didn't take anything. We just gave everyone the tools to be free." The amber sunset reflects off building windows, a gentle breeze rustles their hair, distant sounds of a liberated city drift upward, 50mm lens with natural film grain, Kodak 2383 print emulation, triumphant yet humble ending.
|
||||
Reference in New Issue
Block a user