- Switch from mlx_video.generate_av to mlx_video.models.ltx_2.generate - Use prince-canuma/LTX-2.3-distilled model with google/gemma-3-12b-it text encoder - Add --audio flag for joint audio-video generation - Add auto-background execution with nohup logging - Add CLAUDE.md and test stories Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
mlx-video-maker
Generate multi-scene AI videos with seamless transitions using I2V (Image-to-Video) chaining on Apple Silicon.
Example Output
https://github.com/user-attachments/assets/REPLACE_WITH_VIDEO_ASSET_ID
Sample scene from "The Local AI Revolution" - see examples/ for the video file and stories/local_ai_revolution.txt for the full story prompts.
The Power: LLM + Prompting Guide
The real magic is combining an LLM (Claude, local models, etc.) with the included promptguide.md. Feed the guide to your LLM, describe your scene in plain language, and get optimized prompts following best practices:
- Flowing narrative paragraphs (not bullet lists)
- Present-tense action verbs
- Explicit camera movements and lens choices
- Audio cues for synchronized generation
- The six essential elements for every scene
Example workflow:
You: "Write me a 5-scene story about a detective investigating an abandoned warehouse"
LLM: [Uses promptguide.md to craft cinematic, LTX-2 optimized prompts]
The guide covers lens language, shutter terminology, video type strategies, and troubleshooting tips.
The Technique
Scene 1 (T2V) → Extract Last Frame → Scene 2 (I2V) → Extract Last Frame → Scene 3 (I2V) → ...
Each scene's last frame becomes the input image for the next scene, creating visual continuity across an entire movie. In theory, you could generate hour-long films this way.
How It Works
- First scene: Text-to-Video generation (no image input)
- Frame extraction:
ffprobecounts frames,ffmpegextracts the last frame - Subsequent scenes: Image-to-Video with
--imagepointing to the previous scene's last frame - Final concat: High-quality
ffmpegmerge of all scenes
Quick Start
# Clone the repo
git clone https://github.com/YOUR_USERNAME/mlx-video-maker.git
cd mlx-video-maker
# Create a story file (one prompt per line)
cat > stories/my_story.txt << 'EOF'
# Scene 1
Wide aerial shot of misty mountains at dawn, camera slowly descending, cinematic, 4K
# Scene 2
Camera continues revealing a lone hiker on the ridge, steady tracking shot, cinematic, 4K
# Scene 3
Close-up of hiker's face illuminated by golden sunrise, emotional, cinematic, 4K
EOF
# Generate the movie
./generate_story.sh stories/my_story.txt output/
For long generations:
nohup ./generate_story.sh stories/my_story.txt output/ > output/nohup.out 2>&1 &
tail -f output/nohup.out # Monitor progress
Options
./generate_story.sh <story_file> [output_dir] [options]
Options:
--width Video width (default: 1920, must be divisible by 64)
--height Video height (default: 1088, must be divisible by 64)
--frames Frames per scene (default: 121, must satisfy 1 + 8*k)
--strength I2V conditioning strength 0.0-1.0 (default: 0.7)
--fps Output framerate (default: 24)
--python Python executable (default: ./venv/bin/python)
Image Strength Guide
| Value | Effect |
|---|---|
| 0.5-0.6 | Strong visual continuity, less motion freedom |
| 0.7 | Sweet spot - balanced continuity and new content |
| 0.8-0.9 | More variation, potential visual jumps |
Story File Format
Plain text, one prompt per line:
- Lines starting with
#are comments (ignored) - Empty lines are ignored
- Each non-comment line = one scene
Pro tip: Use consistent style suffixes across all prompts for visual coherence:
..., cinematic, nature documentary style, 4K
Prompt Engineering
See promptguide.md for the complete guide. Key points:
- Write flowing narrative paragraphs, not lists
- Use present-tense verbs: "walks", "turns", "reaches"
- Specify camera explicitly: "slow dolly forward", "steady tracking shot"
- Include audio cues: "distant traffic hum", "footsteps echoing"
- Add consistent style suffixes across all scenes
Example prompt:
A lone fisherman rows across a foggy lake before sunrise, the boat creaking softly
as water laps at its sides. The camera glides overhead in a slow aerial tracking shot,
following his steady progress from behind and slightly above. His lantern casts a warm
circle of light that reflects in gentle ripples, while tall reeds sway on the distant
shoreline.
Performance
On M3 Max (128GB RAM):
- ~20-22 minutes per scene at 1920x1088, 121 frames
- ~4 hours for a 10-scene movie (~50 seconds)
- ~75GB peak memory usage
Scaling: 720 scenes = 1 hour movie = ~10 days generation time
Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- 64GB+ RAM recommended (32GB minimum at lower resolution)
- mlx-video with LTX-2 model
- ffmpeg and ffprobe
- Python 3.11+
Installation
# Install mlx-video (if not already)
pip install mlx-video
# Install ffmpeg
brew install ffmpeg
# Make script executable
chmod +x generate_story.sh
Future Ideas
- Scene quality detection using image-to-text models (auto-regenerate poor scenes)
- Different transition styles (fade, match cut)
- Branching narratives (generate multiple versions, pick best)
- Audio continuity chaining
- Checkpoint recovery for long generations
Credits
- LTX-Video (LTX-2) by Lightricks - The 2B parameter DiT model that powers the video generation
- mlx-video by Prince Canuma (@Blaizzy) - MLX port enabling Apple Silicon native inference
- MLX by Apple - The ML framework for Apple Silicon
License
MIT