This commit is contained in:
Norbert Schmidt
2026-01-27 19:26:12 +01:00
parent a9d8be7441
commit 53559c06cb
6 changed files with 430 additions and 0 deletions

23
.gitignore vendored Normal file
View File

@@ -0,0 +1,23 @@
# Output files
output/*
!output/.gitkeep
*.mp4
*.wav
*.jpg
*_lastframe.jpg
concat_list.txt
generation.log
nohup.out
# Python
__pycache__/
*.py[cod]
.venv/
venv/
# macOS
.DS_Store
# IDE
.idea/
.vscode/

174
README.md
View File

@@ -1,2 +1,176 @@
# mlx-video-maker # mlx-video-maker
Generate multi-scene AI videos with seamless transitions using I2V (Image-to-Video) chaining on Apple Silicon.
## Example Output
https://github.com/user-attachments/assets/REPLACE_WITH_VIDEO_ASSET_ID
> Sample scene from "The Local AI Revolution" - see [examples/](examples/) for the video file and [stories/local_ai_revolution.txt](stories/local_ai_revolution.txt) for the full story prompts.
## The Power: LLM + Prompting Guide
The real magic is combining an LLM (Claude, local models, etc.) with the included **[promptguide.md](promptguide.md)**. Feed the guide to your LLM, describe your scene in plain language, and get optimized prompts following best practices:
- Flowing narrative paragraphs (not bullet lists)
- Present-tense action verbs
- Explicit camera movements and lens choices
- Audio cues for synchronized generation
- The six essential elements for every scene
Example workflow:
```
You: "Write me a 5-scene story about a detective investigating an abandoned warehouse"
LLM: [Uses promptguide.md to craft cinematic, LTX-2 optimized prompts]
```
The guide covers lens language, shutter terminology, video type strategies, and troubleshooting tips.
## The Technique
```
Scene 1 (T2V) → Extract Last Frame → Scene 2 (I2V) → Extract Last Frame → Scene 3 (I2V) → ...
```
Each scene's last frame becomes the input image for the next scene, creating visual continuity across an entire movie. In theory, you could generate hour-long films this way.
## How It Works
1. **First scene**: Text-to-Video generation (no image input)
2. **Frame extraction**: `ffprobe` counts frames, `ffmpeg` extracts the last frame
3. **Subsequent scenes**: Image-to-Video with `--image` pointing to the previous scene's last frame
4. **Final concat**: High-quality `ffmpeg` merge of all scenes
## Quick Start
```bash
# Clone the repo
git clone https://github.com/YOUR_USERNAME/mlx-video-maker.git
cd mlx-video-maker
# Create a story file (one prompt per line)
cat > stories/my_story.txt << 'EOF'
# Scene 1
Wide aerial shot of misty mountains at dawn, camera slowly descending, cinematic, 4K
# Scene 2
Camera continues revealing a lone hiker on the ridge, steady tracking shot, cinematic, 4K
# Scene 3
Close-up of hiker's face illuminated by golden sunrise, emotional, cinematic, 4K
EOF
# Generate the movie
./generate_story.sh stories/my_story.txt output/
```
For long generations:
```bash
nohup ./generate_story.sh stories/my_story.txt output/ > output/nohup.out 2>&1 &
tail -f output/nohup.out # Monitor progress
```
## Options
```
./generate_story.sh <story_file> [output_dir] [options]
Options:
--width Video width (default: 1920, must be divisible by 64)
--height Video height (default: 1088, must be divisible by 64)
--frames Frames per scene (default: 121, must satisfy 1 + 8*k)
--strength I2V conditioning strength 0.0-1.0 (default: 0.7)
--fps Output framerate (default: 24)
--python Python executable (default: ./venv/bin/python)
```
### Image Strength Guide
| Value | Effect |
|-------|--------|
| 0.5-0.6 | Strong visual continuity, less motion freedom |
| **0.7** | **Sweet spot** - balanced continuity and new content |
| 0.8-0.9 | More variation, potential visual jumps |
## Story File Format
Plain text, one prompt per line:
- Lines starting with `#` are comments (ignored)
- Empty lines are ignored
- Each non-comment line = one scene
**Pro tip**: Use consistent style suffixes across all prompts for visual coherence:
```
..., cinematic, nature documentary style, 4K
```
## Prompt Engineering
See **[promptguide.md](promptguide.md)** for the complete guide. Key points:
- Write flowing narrative paragraphs, not lists
- Use present-tense verbs: "walks", "turns", "reaches"
- Specify camera explicitly: "slow dolly forward", "steady tracking shot"
- Include audio cues: "distant traffic hum", "footsteps echoing"
- Add consistent style suffixes across all scenes
**Example prompt:**
```
A lone fisherman rows across a foggy lake before sunrise, the boat creaking softly
as water laps at its sides. The camera glides overhead in a slow aerial tracking shot,
following his steady progress from behind and slightly above. His lantern casts a warm
circle of light that reflects in gentle ripples, while tall reeds sway on the distant
shoreline.
```
## Performance
On M3 Max (128GB RAM):
- ~20-22 minutes per scene at 1920x1088, 121 frames
- ~4 hours for a 10-scene movie (~50 seconds)
- ~75GB peak memory usage
**Scaling**: 720 scenes = 1 hour movie = ~10 days generation time
## Requirements
- Apple Silicon Mac (M1/M2/M3/M4)
- 64GB+ RAM recommended (32GB minimum at lower resolution)
- [mlx-video](https://github.com/Blaizzy/mlx-video) with LTX-2 model
- ffmpeg and ffprobe
- Python 3.11+
### Installation
```bash
# Install mlx-video (if not already)
pip install mlx-video
# Install ffmpeg
brew install ffmpeg
# Make script executable
chmod +x generate_story.sh
```
## Future Ideas
- Scene quality detection using image-to-text models (auto-regenerate poor scenes)
- Different transition styles (fade, match cut)
- Branching narratives (generate multiple versions, pick best)
- Audio continuity chaining
- Checkpoint recovery for long generations
## Credits
- [LTX-Video (LTX-2)](https://github.com/Lightricks/LTX-Video) by Lightricks - The 2B parameter DiT model that powers the video generation
- [mlx-video](https://github.com/Blaizzy/mlx-video) by Prince Canuma ([@Blaizzy](https://github.com/Blaizzy)) - MLX port enabling Apple Silicon native inference
- [MLX](https://github.com/ml-explore/mlx) by Apple - The ML framework for Apple Silicon
## License
MIT
---
*Built with [mlx-video](https://github.com/Blaizzy/mlx-video) and [LTX-2](https://github.com/Lightricks/LTX-Video) on Apple Silicon*

210
generate_story.sh Executable file
View File

@@ -0,0 +1,210 @@
#!/bin/bash
#
# generate_story.sh - Generate multi-scene AI videos with I2V chaining
#
# Usage: ./generate_story.sh <story_file> <output_dir> [options]
#
set -e
# Default settings
WIDTH=1920
HEIGHT=1088
FRAMES=121
STRENGTH=0.7
FPS=24
VENV_PYTHON="${VENV_PYTHON:-./venv/bin/python}"
OUTPUT_DIR="./output"
# Colors
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
BLUE='\033[0;34m'
NC='\033[0m'
usage() {
echo "Usage: $0 <story_file> [output_dir] [options]"
echo ""
echo "Arguments:"
echo " story_file Text file with one prompt per line"
echo " output_dir Directory to save output files (default: ./output)"
echo ""
echo "Options:"
echo " --width Video width (default: 1920)"
echo " --height Video height (default: 1088)"
echo " --frames Frames per scene (default: 121)"
echo " --strength I2V conditioning strength 0.0-1.0 (default: 0.7)"
echo " --fps Output framerate (default: 24)"
echo " --python Python executable (default: python)"
echo ""
echo "Example:"
echo " $0 stories/mountain.txt output/ --width 1024 --height 768"
exit 1
}
# Parse arguments
if [ $# -lt 1 ]; then
usage
fi
STORY_FILE="$1"
shift 1
# Check if second arg is output dir (not an option starting with --)
if [ $# -gt 0 ] && [[ "$1" != --* ]]; then
OUTPUT_DIR="$1"
shift 1
fi
while [ $# -gt 0 ]; do
case "$1" in
--width) WIDTH="$2"; shift 2 ;;
--height) HEIGHT="$2"; shift 2 ;;
--frames) FRAMES="$2"; shift 2 ;;
--strength) STRENGTH="$2"; shift 2 ;;
--fps) FPS="$2"; shift 2 ;;
--python) VENV_PYTHON="$2"; shift 2 ;;
*) echo "Unknown option: $1"; usage ;;
esac
done
# Validate inputs
if [ ! -f "$STORY_FILE" ]; then
echo -e "${RED}Error: Story file not found: $STORY_FILE${NC}"
exit 1
fi
# Create output directory
mkdir -p "$OUTPUT_DIR"
# Read prompts into array (compatible with bash 3, skip comments and empty lines)
PROMPTS=()
while IFS= read -r line || [[ -n "$line" ]]; do
# Skip empty lines and comments
[[ -z "$line" ]] && continue
[[ "$line" =~ ^[[:space:]]*$ ]] && continue
[[ "$line" =~ ^[[:space:]]*# ]] && continue
PROMPTS+=("$line")
done < "$STORY_FILE"
NUM_SCENES=${#PROMPTS[@]}
if [ $NUM_SCENES -eq 0 ]; then
echo -e "${RED}Error: No prompts found in $STORY_FILE${NC}"
exit 1
fi
# Get story name from filename
STORY_NAME=$(basename "$STORY_FILE" .txt)
echo -e "${BLUE}========================================${NC}"
echo -e "${BLUE} mlx-video-maker${NC}"
echo -e "${BLUE}========================================${NC}"
echo ""
echo -e "Story: ${GREEN}$STORY_NAME${NC}"
echo -e "Scenes: ${GREEN}$NUM_SCENES${NC}"
echo -e "Resolution: ${GREEN}${WIDTH}x${HEIGHT}${NC}"
echo -e "Frames/scene: ${GREEN}$FRAMES${NC}"
echo -e "I2V strength: ${GREEN}$STRENGTH${NC}"
echo -e "Output: ${GREEN}$OUTPUT_DIR${NC}"
echo ""
# Resolve VENV_PYTHON to absolute path before changing directories
if [[ "$VENV_PYTHON" == ./* ]]; then
VENV_PYTHON="$(pwd)/${VENV_PYTHON:2}"
fi
# Generate scenes
cd "$OUTPUT_DIR"
for i in $(seq 1 $NUM_SCENES); do
IDX=$((i-1))
PROMPT="${PROMPTS[$IDX]}"
SCENE_FILE="scene${i}.mp4"
# Skip if already exists
if [ -f "$SCENE_FILE" ]; then
echo -e "${YELLOW}Scene $i already exists, skipping...${NC}"
continue
fi
echo ""
echo -e "${BLUE}=== Scene $i / $NUM_SCENES ===${NC}"
echo -e "${GREEN}Prompt:${NC} ${PROMPT:0:80}..."
echo ""
if [ $i -eq 1 ]; then
# First scene: Text-to-Video
$VENV_PYTHON -m mlx_video.generate_av \
--prompt "$PROMPT" \
--height $HEIGHT \
--width $WIDTH \
--num-frames $FRAMES \
--fps $FPS \
--seed $((42 + i)) \
--output-path "$SCENE_FILE"
else
# Subsequent scenes: Image-to-Video
PREV=$((i-1))
PREV_FILE="scene${PREV}.mp4"
LAST_FRAME="scene${PREV}_lastframe.jpg"
# Extract last frame from previous scene
if [ ! -f "$LAST_FRAME" ]; then
echo -e "${YELLOW}Extracting last frame from scene $PREV...${NC}"
FRAME_COUNT=$(ffprobe -v error -select_streams v:0 -count_frames \
-show_entries stream=nb_read_frames \
-of default=nokey=1:noprint_wrappers=1 "$PREV_FILE")
LAST_IDX=$((FRAME_COUNT - 1))
ffmpeg -i "$PREV_FILE" -vf "select=eq(n\\,$LAST_IDX)" \
-vframes 1 -q:v 2 "$LAST_FRAME" -y 2>/dev/null
fi
# Generate with I2V
$VENV_PYTHON -m mlx_video.generate_av \
--prompt "$PROMPT" \
--image "$LAST_FRAME" \
--image-strength $STRENGTH \
--height $HEIGHT \
--width $WIDTH \
--num-frames $FRAMES \
--fps $FPS \
--seed $((42 + i)) \
--output-path "$SCENE_FILE"
fi
echo -e "${GREEN}Scene $i complete!${NC}"
done
echo ""
echo -e "${BLUE}=== Concatenating final movie ===${NC}"
# Create concat list
CONCAT_LIST="concat_list.txt"
> "$CONCAT_LIST"
for i in $(seq 1 $NUM_SCENES); do
if [ -f "scene${i}.mp4" ]; then
echo "file 'scene${i}.mp4'" >> "$CONCAT_LIST"
fi
done
# Concatenate with high quality encoding
FINAL_FILE="${STORY_NAME}.mp4"
ffmpeg -f concat -safe 0 -i "$CONCAT_LIST" \
-c:v libx264 -crf 18 -preset slow \
-c:a aac -b:a 192k \
"$FINAL_FILE" -y
# Get duration
DURATION=$(ffprobe -v error -show_entries format=duration \
-of default=noprint_wrappers=1:nokey=1 "$FINAL_FILE")
echo ""
echo -e "${GREEN}========================================${NC}"
echo -e "${GREEN} Complete!${NC}"
echo -e "${GREEN}========================================${NC}"
echo ""
echo -e "Final movie: ${BLUE}$(pwd)/$FINAL_FILE${NC}"
echo -e "Duration: ${BLUE}${DURATION%.*} seconds${NC}"
echo ""

0
output/.gitkeep Normal file
View File

1
requirements.txt Normal file
View File

@@ -0,0 +1 @@
git+https://github.com/Blaizzy/mlx-video.git

View File

@@ -0,0 +1,22 @@
# The Local AI Revolution - Tech Nerds vs Big Tech
# Scene 1: The Corporate Dystopia
A towering glass skyscraper dominates the frame against a cold gray sky, its facade displaying massive glowing logos of tech giants. The camera begins with a slow crane shot ascending the building's reflective surface, capturing drones delivering packages and surveillance cameras tracking pedestrians below. People walk in uniform patterns, faces illuminated by identical smartphones, ambient electronic hum mixing with corporate jingles echoing from street speakers. The scene feels sterile and controlled, deep blue color grading with harsh artificial lighting, shot on 50mm lens with clinical precision, cinematic dystopian aesthetic.
# Scene 2: The Underground Lab
In a cluttered basement apartment, four tech nerds huddle around multiple monitors displaying terminal windows and neural network visualizations. The camera tracks slowly through the space at eye level, passing towers of servers built from repurposed hardware, walls covered in whiteboards filled with equations. A woman with short purple hair types rapidly, her face lit by screen glow as she mutters "Almost there." Cables snake across the floor, cooling fans whir loudly, empty energy drink cans pile beside keyboards. Warm amber practical lighting contrasts the cold blue of screens, handheld documentary style, intimate 35mm framing.
# Scene 3: The Breakthrough
Close-up on weathered hands hovering over a mechanical keyboard as a progress bar hits one hundred percent. The camera holds steady with shallow depth of field as the programmer's eyes widen, reflecting cascading green text. He slowly turns to his companions, a grin spreading across his bearded face as he whispers "It's running. Fully local. No cloud. No tracking." The others lean in, the glow from the screen illuminating their expressions of disbelief turning to joy. A ceiling fan spins lazily overhead, the hum of the local GPU cluster providing a triumphant drone, warm golden hour light streaming through dusty blinds, 85mm portrait lens, Kodak film emulation.
# Scene 4: Spreading the Word
Split-screen montage showing the revolution spreading across the world. On the left, a teenager in Tokyo installs the open-source AI on a laptop in a cramped bedroom. On the right, a group of students in Berlin gather around a single computer in a university library. The camera cross-dissolves between locations as hands download, compile, and share. Reddit threads and forum posts cascade across screens, download counters climbing exponentially. Night scenes transition to day across time zones, the warm glow of screens in darkened rooms, energetic editing rhythm, ambient electronic score building momentum, wide establishing shots mixed with intimate close-ups of hopeful faces.
# Scene 5: Big Tech Reacts
Inside a sterile corporate boardroom, executives in expensive suits stare at plummeting user graphs on a massive wall display. The camera dollies slowly around the long glass table, capturing sweating brows and loosened ties. A CEO slams his fist down, his voice echoing "Shut it down. All of it." Security teams scramble, server rooms flash red alerts, but the footage cuts to show home servers still humming in garages and basements worldwide, untouchable and decentralized. Sharp contrast between cold corporate blues and warm residential ambers, tension building through crosscut editing, dramatic orchestral undertones mixing with digital glitch sounds, anamorphic widescreen framing.
# Scene 6: The New Internet
A diverse group of people gather in a sunny public park, laptops open, sharing local AI assistants that run without internet connection. Children learn from patient digital tutors, artists generate images on tablets, elderly users dictate messages in their native languages. The camera sweeps through the scene in a smooth steadicam shot, capturing laughter and genuine human connection. No corporate logos visible, no surveillance drones overhead, just people owning their own technology. Golden hour sunlight bathes everything in warm tones, depth of field shifts between faces and screens, birdsong and genuine conversation replace the electronic hum of the opening, 24mm wide-angle lens capturing the communal atmosphere, hopeful documentary aesthetic.
# Scene 7: The Original Hackers Watch
The four original tech nerds sit on a rooftop at sunset, laptops closed for once, watching the city below where countless windows glow with the warm light of local AI. The woman with purple hair raises a bottle of craft beer, the others follow. The camera slowly pulls back in a crane shot revealing the vast urban landscape dotted with individual lights rather than corporate towers. She speaks softly, "We didn't take anything. We just gave everyone the tools to be free." The amber sunset reflects off building windows, a gentle breeze rustles their hair, distant sounds of a liberated city drift upward, 50mm lens with natural film grain, Kodak 2383 print emulation, triumphant yet humble ending.