# mlx-video MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX. ## Installation ### Option 1: Install with pip (requires git): ```bash pip install git+https://github.com/Blaizzy/mlx-video.git ``` ### Option 2: Install with uv (ultra-fast package manager, optional): ```bash uv pip install git+https://github.com/Blaizzy/mlx-video.git ``` ## Supported Models ### LTX-2 [LTX-2](https://huggingface.co/Lightricks/LTX-Video) is 19B parameter video generation model from Lightricks ## Features - Text-to-video generation with the LTX-2 19B DiT model - Two-stage generation pipeline for high-quality output - 2x spatial upscaling for images and videos - Optimized for Apple Silicon using MLX ## Usage > **ℹ️ Info:** Currently, only the distilled variant is supported. Full LTX-2 feature support is coming soon. ### Text-to-Video Generation ```bash # Text-to-Video (distilled, fastest) uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768 # Image-to-Video uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg # Audio-to-Video uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music" # Dev pipeline with CFG (higher quality) uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0 # Dev two-stage HQ (highest quality) uv run mlx_video.generate --pipeline dev-two-stage-hq \ --prompt "A cinematic scene of ocean waves at golden hour" \ --model-repo prince-canuma/LTX-2-dev ``` Poodles demo **Converting weights:** Pre-converted weights are available on HuggingFace ([LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled), [LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev), [LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled), [LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev)), or convert from the original Lightricks checkpoint: ```bash python -m mlx_video.generate \ --prompt "Ocean waves crashing on a beach at sunset" \ --height 768 \ --width 768 \ --num-frames 65 \ --seed 123 \ --output my_video.mp4 ``` ### CLI Options | Option | Default | Description | |--------|---------|-------------| | `--prompt`, `-p` | (required) | Text description of the video | | `--height`, `-H` | 512 | Output height (must be divisible by 64) | | `--width`, `-W` | 512 | Output width (must be divisible by 64) | | `--num-frames`, `-n` | 100 | Number of frames | | `--seed`, `-s` | 42 | Random seed for reproducibility | | `--fps` | 24 | Frames per second | | `--output`, `-o` | output.mp4 | Output video path | | `--save-frames` | false | Save individual frames as images | | `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository | ## How It Works The pipeline uses a two-stage generation process: 1. **Stage 1**: Generate at half resolution (e.g., 384x384) with 8 denoising steps 2. **Upsample**: 2x spatial upsampling via LatentUpsampler 3. **Stage 2**: Refine at full resolution (e.g., 768x768) with 3 denoising steps 4. **Decode**: VAE decoder converts latents to RGB video ## Requirements - macOS with Apple Silicon - Python >= 3.11 - MLX >= 0.22.0 ## Model Specifications - **Transformer**: 48 layers, 32 attention heads, 128 dim per head - **Latent channels**: 128 - **Text encoder**: Gemma 3 with 3840-dim output - **RoPE**: Split mode with double precision ## Project Structure ``` mlx_video/ ├── generate.py # Video generation pipeline ├── convert.py # Weight conversion (PyTorch -> MLX) ├── postprocess.py # Video post-processing utilities ├── utils.py # Helper functions └── models/ └── ltx/ ├── ltx.py # Main LTXModel (DiT transformer) ├── config.py # Model configuration ├── transformer.py # Transformer blocks ├── attention.py # Multi-head attention with RoPE ├── text_encoder.py # Text encoder ├── upsampler.py # 2x spatial upsampler └── video_vae/ # VAE encoder/decoder ``` ## License MIT