Norbert Schmidt 02b8c27835 Upgrade to LTX-2.3 with audio generation
- Switch from mlx_video.generate_av to mlx_video.models.ltx_2.generate
- Use prince-canuma/LTX-2.3-distilled model with google/gemma-3-12b-it text encoder
- Add --audio flag for joint audio-video generation
- Add auto-background execution with nohup logging
- Add CLAUDE.md and test stories

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-31 13:55:39 +02:00
2026-01-27 19:27:25 +01:00
2026-01-27 19:26:12 +01:00
2026-01-27 19:16:05 +01:00
2026-01-27 19:27:25 +01:00
2026-01-27 19:26:18 +01:00
2026-01-27 19:26:12 +01:00
2026-01-27 19:26:12 +01:00

mlx-video-maker

Generate multi-scene AI videos with seamless transitions using I2V (Image-to-Video) chaining on Apple Silicon.

Example Output

https://github.com/user-attachments/assets/REPLACE_WITH_VIDEO_ASSET_ID

Sample scene from "The Local AI Revolution" - see examples/ for the video file and stories/local_ai_revolution.txt for the full story prompts.

The Power: LLM + Prompting Guide

The real magic is combining an LLM (Claude, local models, etc.) with the included promptguide.md. Feed the guide to your LLM, describe your scene in plain language, and get optimized prompts following best practices:

  • Flowing narrative paragraphs (not bullet lists)
  • Present-tense action verbs
  • Explicit camera movements and lens choices
  • Audio cues for synchronized generation
  • The six essential elements for every scene

Example workflow:

You: "Write me a 5-scene story about a detective investigating an abandoned warehouse"
LLM: [Uses promptguide.md to craft cinematic, LTX-2 optimized prompts]

The guide covers lens language, shutter terminology, video type strategies, and troubleshooting tips.

The Technique

Scene 1 (T2V) → Extract Last Frame → Scene 2 (I2V) → Extract Last Frame → Scene 3 (I2V) → ...

Each scene's last frame becomes the input image for the next scene, creating visual continuity across an entire movie. In theory, you could generate hour-long films this way.

How It Works

  1. First scene: Text-to-Video generation (no image input)
  2. Frame extraction: ffprobe counts frames, ffmpeg extracts the last frame
  3. Subsequent scenes: Image-to-Video with --image pointing to the previous scene's last frame
  4. Final concat: High-quality ffmpeg merge of all scenes

Quick Start

# Clone the repo
git clone https://github.com/YOUR_USERNAME/mlx-video-maker.git
cd mlx-video-maker

# Create a story file (one prompt per line)
cat > stories/my_story.txt << 'EOF'
# Scene 1
Wide aerial shot of misty mountains at dawn, camera slowly descending, cinematic, 4K

# Scene 2
Camera continues revealing a lone hiker on the ridge, steady tracking shot, cinematic, 4K

# Scene 3
Close-up of hiker's face illuminated by golden sunrise, emotional, cinematic, 4K
EOF

# Generate the movie
./generate_story.sh stories/my_story.txt output/

For long generations:

nohup ./generate_story.sh stories/my_story.txt output/ > output/nohup.out 2>&1 &
tail -f output/nohup.out  # Monitor progress

Options

./generate_story.sh <story_file> [output_dir] [options]

Options:
  --width       Video width (default: 1920, must be divisible by 64)
  --height      Video height (default: 1088, must be divisible by 64)
  --frames      Frames per scene (default: 121, must satisfy 1 + 8*k)
  --strength    I2V conditioning strength 0.0-1.0 (default: 0.7)
  --fps         Output framerate (default: 24)
  --python      Python executable (default: ./venv/bin/python)

Image Strength Guide

Value Effect
0.5-0.6 Strong visual continuity, less motion freedom
0.7 Sweet spot - balanced continuity and new content
0.8-0.9 More variation, potential visual jumps

Story File Format

Plain text, one prompt per line:

  • Lines starting with # are comments (ignored)
  • Empty lines are ignored
  • Each non-comment line = one scene

Pro tip: Use consistent style suffixes across all prompts for visual coherence:

..., cinematic, nature documentary style, 4K

Prompt Engineering

See promptguide.md for the complete guide. Key points:

  • Write flowing narrative paragraphs, not lists
  • Use present-tense verbs: "walks", "turns", "reaches"
  • Specify camera explicitly: "slow dolly forward", "steady tracking shot"
  • Include audio cues: "distant traffic hum", "footsteps echoing"
  • Add consistent style suffixes across all scenes

Example prompt:

A lone fisherman rows across a foggy lake before sunrise, the boat creaking softly
as water laps at its sides. The camera glides overhead in a slow aerial tracking shot,
following his steady progress from behind and slightly above. His lantern casts a warm
circle of light that reflects in gentle ripples, while tall reeds sway on the distant
shoreline.

Performance

On M3 Max (128GB RAM):

  • ~20-22 minutes per scene at 1920x1088, 121 frames
  • ~4 hours for a 10-scene movie (~50 seconds)
  • ~75GB peak memory usage

Scaling: 720 scenes = 1 hour movie = ~10 days generation time

Requirements

  • Apple Silicon Mac (M1/M2/M3/M4)
  • 64GB+ RAM recommended (32GB minimum at lower resolution)
  • mlx-video with LTX-2 model
  • ffmpeg and ffprobe
  • Python 3.11+

Installation

# Install mlx-video (if not already)
pip install mlx-video

# Install ffmpeg
brew install ffmpeg

# Make script executable
chmod +x generate_story.sh

Future Ideas

  • Scene quality detection using image-to-text models (auto-regenerate poor scenes)
  • Different transition styles (fade, match cut)
  • Branching narratives (generate multiple versions, pick best)
  • Audio continuity chaining
  • Checkpoint recovery for long generations

Credits

  • LTX-Video (LTX-2) by Lightricks - The 2B parameter DiT model that powers the video generation
  • mlx-video by Prince Canuma (@Blaizzy) - MLX port enabling Apple Silicon native inference
  • MLX by Apple - The ML framework for Apple Silicon

License

MIT


Built with mlx-video and LTX-2 on Apple Silicon

Description
No description provided
Readme 3.4 MiB
Languages
Shell 100%