Initial

2026-01-27 19:26:12 +01:00
parent a9d8be7441
commit 53559c06cb
6 changed files with 430 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -0,0 +1,23 @@
 # Output files
 output/*
 !output/.gitkeep
 *.mp4
 *.wav
 *.jpg
 *_lastframe.jpg
 concat_list.txt
 generation.log
 nohup.out
 # Python
 __pycache__/
 *.py[cod]
 .venv/
 venv/
 # macOS
 .DS_Store
 # IDE
 .idea/
 .vscode/
--- a/README.md
+++ b/README.md
@@ -1,2 +1,176 @@
 # mlx-video-maker
 Generate multi-scene AI videos with seamless transitions using I2V (Image-to-Video) chaining on Apple Silicon.
 ## Example Output
 https://github.com/user-attachments/assets/REPLACE_WITH_VIDEO_ASSET_ID
 > Sample scene from "The Local AI Revolution" - see [examples/](examples/) for the video file and [stories/local_ai_revolution.txt](stories/local_ai_revolution.txt) for the full story prompts.
 ## The Power: LLM + Prompting Guide
 The real magic is combining an LLM (Claude, local models, etc.) with the included **[promptguide.md](promptguide.md)**. Feed the guide to your LLM, describe your scene in plain language, and get optimized prompts following best practices:
 - Flowing narrative paragraphs (not bullet lists)
 - Present-tense action verbs
 - Explicit camera movements and lens choices
 - Audio cues for synchronized generation
 - The six essential elements for every scene
 Example workflow:
 ```
 You: "Write me a 5-scene story about a detective investigating an abandoned warehouse"
 LLM: [Uses promptguide.md to craft cinematic, LTX-2 optimized prompts]
 ```
 The guide covers lens language, shutter terminology, video type strategies, and troubleshooting tips.
 ## The Technique
 ```
 Scene 1 (T2V) → Extract Last Frame → Scene 2 (I2V) → Extract Last Frame → Scene 3 (I2V) → ...
 ```
 Each scene's last frame becomes the input image for the next scene, creating visual continuity across an entire movie. In theory, you could generate hour-long films this way.
 ## How It Works
 1. **First scene**: Text-to-Video generation (no image input)
 2. **Frame extraction**: `ffprobe` counts frames, `ffmpeg` extracts the last frame
 3. **Subsequent scenes**: Image-to-Video with `--image` pointing to the previous scene's last frame
 4. **Final concat**: High-quality `ffmpeg` merge of all scenes
 ## Quick Start
 ```bash
 # Clone the repo
 git clone https://github.com/YOUR_USERNAME/mlx-video-maker.git
 cd mlx-video-maker
 # Create a story file (one prompt per line)
 cat > stories/my_story.txt << 'EOF'
 # Scene 1
 Wide aerial shot of misty mountains at dawn, camera slowly descending, cinematic, 4K
 # Scene 2
 Camera continues revealing a lone hiker on the ridge, steady tracking shot, cinematic, 4K
 # Scene 3
 Close-up of hiker's face illuminated by golden sunrise, emotional, cinematic, 4K
 EOF
 # Generate the movie
 ./generate_story.sh stories/my_story.txt output/
 ```
 For long generations:
 ```bash
 nohup ./generate_story.sh stories/my_story.txt output/ > output/nohup.out 2>&1 &
 tail -f output/nohup.out  # Monitor progress
 ```
 ## Options
 ```
 ./generate_story.sh <story_file> [output_dir] [options]
 Options:
  --width       Video width (default: 1920, must be divisible by 64)
  --height      Video height (default: 1088, must be divisible by 64)
  --frames      Frames per scene (default: 121, must satisfy 1 + 8*k)
  --strength    I2V conditioning strength 0.0-1.0 (default: 0.7)
  --fps         Output framerate (default: 24)
  --python      Python executable (default: ./venv/bin/python)
 ```
 ### Image Strength Guide
 | Value | Effect |
 |-------|--------|
 | 0.5-0.6 | Strong visual continuity, less motion freedom |
 | **0.7** | **Sweet spot** - balanced continuity and new content |
 | 0.8-0.9 | More variation, potential visual jumps |
 ## Story File Format
 Plain text, one prompt per line:
 - Lines starting with `#` are comments (ignored)
 - Empty lines are ignored
 - Each non-comment line = one scene
 **Pro tip**: Use consistent style suffixes across all prompts for visual coherence:
 ```
 ..., cinematic, nature documentary style, 4K
 ```
 ## Prompt Engineering
 See **[promptguide.md](promptguide.md)** for the complete guide. Key points:
 - Write flowing narrative paragraphs, not lists
 - Use present-tense verbs: "walks", "turns", "reaches"
 - Specify camera explicitly: "slow dolly forward", "steady tracking shot"
 - Include audio cues: "distant traffic hum", "footsteps echoing"
 - Add consistent style suffixes across all scenes
 **Example prompt:**
 ```
 A lone fisherman rows across a foggy lake before sunrise, the boat creaking softly
 as water laps at its sides. The camera glides overhead in a slow aerial tracking shot,
 following his steady progress from behind and slightly above. His lantern casts a warm
 circle of light that reflects in gentle ripples, while tall reeds sway on the distant
 shoreline.
 ```
 ## Performance
 On M3 Max (128GB RAM):
 - ~20-22 minutes per scene at 1920x1088, 121 frames
 - ~4 hours for a 10-scene movie (~50 seconds)
 - ~75GB peak memory usage
 **Scaling**: 720 scenes = 1 hour movie = ~10 days generation time
 ## Requirements
 - Apple Silicon Mac (M1/M2/M3/M4)
 - 64GB+ RAM recommended (32GB minimum at lower resolution)
 - [mlx-video](https://github.com/Blaizzy/mlx-video) with LTX-2 model
 - ffmpeg and ffprobe
 - Python 3.11+
 ### Installation
 ```bash
 # Install mlx-video (if not already)
 pip install mlx-video
 # Install ffmpeg
 brew install ffmpeg
 # Make script executable
 chmod +x generate_story.sh
 ```
 ## Future Ideas
 - Scene quality detection using image-to-text models (auto-regenerate poor scenes)
 - Different transition styles (fade, match cut)
 - Branching narratives (generate multiple versions, pick best)
 - Audio continuity chaining
 - Checkpoint recovery for long generations
 ## Credits
 - [LTX-Video (LTX-2)](https://github.com/Lightricks/LTX-Video) by Lightricks - The 2B parameter DiT model that powers the video generation
 - [mlx-video](https://github.com/Blaizzy/mlx-video) by Prince Canuma ([@Blaizzy](https://github.com/Blaizzy)) - MLX port enabling Apple Silicon native inference
 - [MLX](https://github.com/ml-explore/mlx) by Apple - The ML framework for Apple Silicon
 ## License
 MIT
 ---
 *Built with [mlx-video](https://github.com/Blaizzy/mlx-video) and [LTX-2](https://github.com/Lightricks/LTX-Video) on Apple Silicon*
--- a/generate_story.sh
+++ b/generate_story.sh
@@ -0,0 +1,210 @@
 #!/bin/bash
 #
 # generate_story.sh - Generate multi-scene AI videos with I2V chaining
 #
 # Usage: ./generate_story.sh <story_file> <output_dir> [options]
 #
 set -e
 # Default settings
 WIDTH=1920
 HEIGHT=1088
 FRAMES=121
 STRENGTH=0.7
 FPS=24
 VENV_PYTHON="${VENV_PYTHON:-./venv/bin/python}"
 OUTPUT_DIR="./output"
 # Colors
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 YELLOW='\033[0;33m'
 BLUE='\033[0;34m'
 NC='\033[0m'
 usage() {
    echo "Usage: $0 <story_file> [output_dir] [options]"
    echo ""
    echo "Arguments:"
    echo "  story_file    Text file with one prompt per line"
    echo "  output_dir    Directory to save output files (default: ./output)"
    echo ""
    echo "Options:"
    echo "  --width       Video width (default: 1920)"
    echo "  --height      Video height (default: 1088)"
    echo "  --frames      Frames per scene (default: 121)"
    echo "  --strength    I2V conditioning strength 0.0-1.0 (default: 0.7)"
    echo "  --fps         Output framerate (default: 24)"
    echo "  --python      Python executable (default: python)"
    echo ""
    echo "Example:"
    echo "  $0 stories/mountain.txt output/ --width 1024 --height 768"
    exit 1
 }
 # Parse arguments
 if [ $# -lt 1 ]; then
    usage
 fi
 STORY_FILE="$1"
 shift 1
 # Check if second arg is output dir (not an option starting with --)
 if [ $# -gt 0 ] && [[ "$1" != --* ]]; then
    OUTPUT_DIR="$1"
    shift 1
 fi
 while [ $# -gt 0 ]; do
    case "$1" in
        --width) WIDTH="$2"; shift 2 ;;
        --height) HEIGHT="$2"; shift 2 ;;
        --frames) FRAMES="$2"; shift 2 ;;
        --strength) STRENGTH="$2"; shift 2 ;;
        --fps) FPS="$2"; shift 2 ;;
        --python) VENV_PYTHON="$2"; shift 2 ;;
        *) echo "Unknown option: $1"; usage ;;
    esac
 done
 # Validate inputs
 if [ ! -f "$STORY_FILE" ]; then
    echo -e "${RED}Error: Story file not found: $STORY_FILE${NC}"
    exit 1
 fi
 # Create output directory
 mkdir -p "$OUTPUT_DIR"
 # Read prompts into array (compatible with bash 3, skip comments and empty lines)
 PROMPTS=()
 while IFS= read -r line || [[ -n "$line" ]]; do
    # Skip empty lines and comments
    [[ -z "$line" ]] && continue
    [[ "$line" =~ ^[[:space:]]*$ ]] && continue
    [[ "$line" =~ ^[[:space:]]*# ]] && continue
    PROMPTS+=("$line")
 done < "$STORY_FILE"
 NUM_SCENES=${#PROMPTS[@]}
 if [ $NUM_SCENES -eq 0 ]; then
    echo -e "${RED}Error: No prompts found in $STORY_FILE${NC}"
    exit 1
 fi
 # Get story name from filename
 STORY_NAME=$(basename "$STORY_FILE" .txt)
 echo -e "${BLUE}========================================${NC}"
 echo -e "${BLUE}  mlx-video-maker${NC}"
 echo -e "${BLUE}========================================${NC}"
 echo ""
 echo -e "Story: ${GREEN}$STORY_NAME${NC}"
 echo -e "Scenes: ${GREEN}$NUM_SCENES${NC}"
 echo -e "Resolution: ${GREEN}${WIDTH}x${HEIGHT}${NC}"
 echo -e "Frames/scene: ${GREEN}$FRAMES${NC}"
 echo -e "I2V strength: ${GREEN}$STRENGTH${NC}"
 echo -e "Output: ${GREEN}$OUTPUT_DIR${NC}"
 echo ""
 # Resolve VENV_PYTHON to absolute path before changing directories
 if [[ "$VENV_PYTHON" == ./* ]]; then
    VENV_PYTHON="$(pwd)/${VENV_PYTHON:2}"
 fi
 # Generate scenes
 cd "$OUTPUT_DIR"
 for i in $(seq 1 $NUM_SCENES); do
    IDX=$((i-1))
    PROMPT="${PROMPTS[$IDX]}"
    SCENE_FILE="scene${i}.mp4"
    # Skip if already exists
    if [ -f "$SCENE_FILE" ]; then
        echo -e "${YELLOW}Scene $i already exists, skipping...${NC}"
        continue
    fi
    echo ""
    echo -e "${BLUE}=== Scene $i / $NUM_SCENES ===${NC}"
    echo -e "${GREEN}Prompt:${NC} ${PROMPT:0:80}..."
    echo ""
    if [ $i -eq 1 ]; then
        # First scene: Text-to-Video
        $VENV_PYTHON -m mlx_video.generate_av \
            --prompt "$PROMPT" \
            --height $HEIGHT \
            --width $WIDTH \
            --num-frames $FRAMES \
            --fps $FPS \
            --seed $((42 + i)) \
            --output-path "$SCENE_FILE"
    else
        # Subsequent scenes: Image-to-Video
        PREV=$((i-1))
        PREV_FILE="scene${PREV}.mp4"
        LAST_FRAME="scene${PREV}_lastframe.jpg"
        # Extract last frame from previous scene
        if [ ! -f "$LAST_FRAME" ]; then
            echo -e "${YELLOW}Extracting last frame from scene $PREV...${NC}"
            FRAME_COUNT=$(ffprobe -v error -select_streams v:0 -count_frames \
                -show_entries stream=nb_read_frames \
                -of default=nokey=1:noprint_wrappers=1 "$PREV_FILE")
            LAST_IDX=$((FRAME_COUNT - 1))
            ffmpeg -i "$PREV_FILE" -vf "select=eq(n\\,$LAST_IDX)" \
                -vframes 1 -q:v 2 "$LAST_FRAME" -y 2>/dev/null
        fi
        # Generate with I2V
        $VENV_PYTHON -m mlx_video.generate_av \
            --prompt "$PROMPT" \
            --image "$LAST_FRAME" \
            --image-strength $STRENGTH \
            --height $HEIGHT \
            --width $WIDTH \
            --num-frames $FRAMES \
            --fps $FPS \
            --seed $((42 + i)) \
            --output-path "$SCENE_FILE"
    fi
    echo -e "${GREEN}Scene $i complete!${NC}"
 done
 echo ""
 echo -e "${BLUE}=== Concatenating final movie ===${NC}"
 # Create concat list
 CONCAT_LIST="concat_list.txt"
 > "$CONCAT_LIST"
 for i in $(seq 1 $NUM_SCENES); do
    if [ -f "scene${i}.mp4" ]; then
        echo "file 'scene${i}.mp4'" >> "$CONCAT_LIST"
    fi
 done
 # Concatenate with high quality encoding
 FINAL_FILE="${STORY_NAME}.mp4"
 ffmpeg -f concat -safe 0 -i "$CONCAT_LIST" \
    -c:v libx264 -crf 18 -preset slow \
    -c:a aac -b:a 192k \
    "$FINAL_FILE" -y
 # Get duration
 DURATION=$(ffprobe -v error -show_entries format=duration \
    -of default=noprint_wrappers=1:nokey=1 "$FINAL_FILE")
 echo ""
 echo -e "${GREEN}========================================${NC}"
 echo -e "${GREEN}  Complete!${NC}"
 echo -e "${GREEN}========================================${NC}"
 echo ""
 echo -e "Final movie: ${BLUE}$(pwd)/$FINAL_FILE${NC}"
 echo -e "Duration: ${BLUE}${DURATION%.*} seconds${NC}"
 echo ""
--- a/output/.gitkeep
+++ b/output/.gitkeep
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1 @@
 git+https://github.com/Blaizzy/mlx-video.git
--- a/stories/local_ai_revolution.txt
+++ b/stories/local_ai_revolution.txt
@@ -0,0 +1,22 @@
 # The Local AI Revolution - Tech Nerds vs Big Tech
 # Scene 1: The Corporate Dystopia
 A towering glass skyscraper dominates the frame against a cold gray sky, its facade displaying massive glowing logos of tech giants. The camera begins with a slow crane shot ascending the building's reflective surface, capturing drones delivering packages and surveillance cameras tracking pedestrians below. People walk in uniform patterns, faces illuminated by identical smartphones, ambient electronic hum mixing with corporate jingles echoing from street speakers. The scene feels sterile and controlled, deep blue color grading with harsh artificial lighting, shot on 50mm lens with clinical precision, cinematic dystopian aesthetic.
 # Scene 2: The Underground Lab
 In a cluttered basement apartment, four tech nerds huddle around multiple monitors displaying terminal windows and neural network visualizations. The camera tracks slowly through the space at eye level, passing towers of servers built from repurposed hardware, walls covered in whiteboards filled with equations. A woman with short purple hair types rapidly, her face lit by screen glow as she mutters "Almost there." Cables snake across the floor, cooling fans whir loudly, empty energy drink cans pile beside keyboards. Warm amber practical lighting contrasts the cold blue of screens, handheld documentary style, intimate 35mm framing.
 # Scene 3: The Breakthrough
 Close-up on weathered hands hovering over a mechanical keyboard as a progress bar hits one hundred percent. The camera holds steady with shallow depth of field as the programmer's eyes widen, reflecting cascading green text. He slowly turns to his companions, a grin spreading across his bearded face as he whispers "It's running. Fully local. No cloud. No tracking." The others lean in, the glow from the screen illuminating their expressions of disbelief turning to joy. A ceiling fan spins lazily overhead, the hum of the local GPU cluster providing a triumphant drone, warm golden hour light streaming through dusty blinds, 85mm portrait lens, Kodak film emulation.
 # Scene 4: Spreading the Word
 Split-screen montage showing the revolution spreading across the world. On the left, a teenager in Tokyo installs the open-source AI on a laptop in a cramped bedroom. On the right, a group of students in Berlin gather around a single computer in a university library. The camera cross-dissolves between locations as hands download, compile, and share. Reddit threads and forum posts cascade across screens, download counters climbing exponentially. Night scenes transition to day across time zones, the warm glow of screens in darkened rooms, energetic editing rhythm, ambient electronic score building momentum, wide establishing shots mixed with intimate close-ups of hopeful faces.
 # Scene 5: Big Tech Reacts
 Inside a sterile corporate boardroom, executives in expensive suits stare at plummeting user graphs on a massive wall display. The camera dollies slowly around the long glass table, capturing sweating brows and loosened ties. A CEO slams his fist down, his voice echoing "Shut it down. All of it." Security teams scramble, server rooms flash red alerts, but the footage cuts to show home servers still humming in garages and basements worldwide, untouchable and decentralized. Sharp contrast between cold corporate blues and warm residential ambers, tension building through crosscut editing, dramatic orchestral undertones mixing with digital glitch sounds, anamorphic widescreen framing.
 # Scene 6: The New Internet
 A diverse group of people gather in a sunny public park, laptops open, sharing local AI assistants that run without internet connection. Children learn from patient digital tutors, artists generate images on tablets, elderly users dictate messages in their native languages. The camera sweeps through the scene in a smooth steadicam shot, capturing laughter and genuine human connection. No corporate logos visible, no surveillance drones overhead, just people owning their own technology. Golden hour sunlight bathes everything in warm tones, depth of field shifts between faces and screens, birdsong and genuine conversation replace the electronic hum of the opening, 24mm wide-angle lens capturing the communal atmosphere, hopeful documentary aesthetic.
 # Scene 7: The Original Hackers Watch
 The four original tech nerds sit on a rooftop at sunset, laptops closed for once, watching the city below where countless windows glow with the warm light of local AI. The woman with purple hair raises a bottle of craft beer, the others follow. The camera slowly pulls back in a crane shot revealing the vast urban landscape dotted with individual lights rather than corporate towers. She speaks softly, "We didn't take anything. We just gave everyone the tools to be free." The amber sunset reflects off building windows, a gentle breeze rustles their hair, distant sounds of a liberated city drift upward, 50mm lens with natural film grain, Kodak 2383 print emulation, triumphant yet humble ending.
		`@@ -0,0 +1 @@`
							`git+https://github.com/Blaizzy/mlx-video.git`