# LTX-2 for MLX MLX port of [LTX-2](https://huggingface.co/Lightricks/LTX-2), a 19B parameter video generation model from Lightricks with synchronized audio-video support. ## Pipelines Four pipeline types are available via the `--pipeline` flag: | Pipeline | Description | CFG | Stages | Speed | |----------|-------------|-----|--------|-------| | `distilled` (default) | Fixed sigma schedule, no CFG | No | 2 (8+3 steps) | Fastest | | `dev` | Dynamic sigmas, constant CFG | Yes | 1 (30 steps) | Medium | | `dev-two-stage` | Dev + LoRA refinement | Yes (stage 1) | 2 (30+3 steps) | Slow | | `dev-two-stage-hq` | res_2s sampler + LoRA both stages | Yes (stage 1) | 2 (15+3 steps) | Slow, highest quality | ## Usage ### Text-to-Video (T2V) ```bash # Distilled (default) - fast, two-stage uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768 # Dev - single-stage with CFG uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0 # Dev two-stage - dev + LoRA refinement uv run mlx_video.generate --pipeline dev-two-stage \ --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \ -n 145 --width 1024 --height 768 \ --model-repo prince-canuma/LTX-2-dev \ --cfg-scale 3.0 --lora-strength 0.8 \ --enhance-prompt # Dev two-stage HQ - res_2s sampler, LoRA both stages (highest quality) uv run mlx_video.generate --pipeline dev-two-stage-hq \ --prompt "A cinematic scene of ocean waves at golden hour" \ --model-repo prince-canuma/LTX-2-dev # HQ with custom LoRA strengths uv run mlx_video.generate --pipeline dev-two-stage-hq \ --prompt "A sunset over mountains" \ --model-repo prince-canuma/LTX-2-dev \ --lora-strength-stage-1 0.3 --lora-strength-stage-2 0.6 ``` ### Image-to-Video (I2V) ```bash # Distilled I2V uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg # Dev I2V uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5 ``` ### Audio-to-Video (A2V) Generate video conditioned on an input audio file. Works with all four pipelines. The audio is encoded to latent space and frozen during denoising -- the transformer's cross-attention reads the audio signal to guide video generation. ```bash # A2V - distilled (default, fastest) uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music" # A2V - dev (single-stage with CFG) uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves" # A2V - dev-two-stage (dev + LoRA refinement) uv run mlx_video.generate --pipeline dev-two-stage --audio-file music.wav \ --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev # A2V - dev-two-stage-hq (highest quality) uv run mlx_video.generate --pipeline dev-two-stage-hq --audio-file music.wav \ --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev # A2V + I2V (audio + image conditioning) uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest" # A2V with custom start time uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert" ``` > **Note:** `--audio-file` (A2V) and `--audio` (generate audio) are mutually exclusive. Supported formats: WAV, FLAC, MP3, OGG, and video files with audio tracks. ### Audio-Video Generation (experimental) Generate synchronized audio alongside video from scratch: ```bash uv run mlx_video.generate --prompt "Ocean waves crashing" --audio uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt # With full guidance (STG + modality_scale, matches PyTorch defaults) uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \ --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0 ``` ### LoRA LoRA weights can be loaded from a file, directory, or HuggingFace repo: ```bash # From HuggingFace repo uv run mlx_video.generate --pipeline dev-two-stage \ --prompt "Camera dolly out of a forest" \ --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \ --lora-strength 1.0 # From local file uv run mlx_video.generate --pipeline dev-two-stage \ --prompt "A scene" \ --lora-path ./my-lora/weights.safetensors # From local directory (auto-detects .safetensors file) uv run mlx_video.generate --pipeline dev-two-stage \ --prompt "A scene" \ --lora-path ./LTX-2-distilled/lora ``` ### Upscaling ```bash # Upscale an image 2x uv run mlx_video.upscale --input photo.png --output upscaled.png # Upscale a video 2x uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 # Upscale with refinement (higher quality, requires text prompt) uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene" ``` ## CLI Options ### General | Option | Default | Description | |--------|---------|-------------| | `--prompt`, `-p` | (required) | Text description of the video | | `--pipeline` | `distilled` | Pipeline type: `distilled`, `dev`, `dev-two-stage`, or `dev-two-stage-hq` | | `--height`, `-H` | 512 | Output height (divisible by 64 for two-stage, 32 for dev) | | `--width`, `-W` | 512 | Output width (divisible by 64 for two-stage, 32 for dev) | | `--num-frames`, `-n` | 33 | Number of frames (must be 1 + 8*k) | | `--seed`, `-s` | 42 | Random seed for reproducibility | | `--fps` | 24 | Frames per second | | `--output-path`, `-o` | output.mp4 | Output video path | | `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository | | `--text-encoder-repo` | None | Separate text encoder repo (if not in model repo) | | `--save-frames` | false | Save individual frames as images | | `--enhance-prompt` | false | Enhance prompt using Gemma | | `--image`, `-i` | None | Conditioning image for I2V | | `--image-strength` | 1.0 | Conditioning strength for I2V | | `--audio`, `-a` | false | Enable synchronized audio generation | | `--audio-file` | None | Path to audio file for A2V conditioning | | `--audio-start-time` | 0.0 | Start time in seconds for audio file | | `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` | | `--stream` | false | Stream frames as they decode | | `--spatial-upscaler` | auto (x2) | Spatial upscaler file for two-stage pipelines (see below) | ### Spatial Upscalers (LTX-2.3) LTX-2.3 ships with multiple spatial upscaler variants. Use `--spatial-upscaler` to select one: | Variant | Scale | Output (from 256x256) | Architecture | |---------|-------|-----------------------|--------------| | `ltx-2.3-spatial-upscaler-x2-1.0.safetensors` (default) | 2.0x | 512x512 | Conv2d + PixelShuffle(2) | | `ltx-2.3-spatial-upscaler-x2-1.1.safetensors` | 2.0x | 512x512 | Same arch, newer weights | | `ltx-2.3-spatial-upscaler-x1.5-1.0.safetensors` | 1.5x | 384x384 | Conv2d + PixelShuffle(3) + BlurDownsample | ```bash # Default (x2-1.0, auto-detected) uv run mlx_video.generate --prompt "A sunset" --model-repo ./LTX-2.3-distilled # x2-1.1 (newer weights) uv run mlx_video.generate --prompt "A sunset" --model-repo ./LTX-2.3-distilled \ --spatial-upscaler ltx-2.3-spatial-upscaler-x2-1.1.safetensors # x1.5 (smaller output, faster) uv run mlx_video.generate --prompt "A sunset" --model-repo ./LTX-2.3-distilled \ --spatial-upscaler ltx-2.3-spatial-upscaler-x1.5-1.0.safetensors ``` > **Note:** Stage 1 always runs at half the target resolution. With x1.5, the final output is 75% of `--width`/`--height` (e.g., 512 target -> 256 stage 1 -> 384 output). With x2, the output matches the target exactly. ### Dev / Dev-Two-Stage | Option | Default | Description | |--------|---------|-------------| | `--steps` | 30 | Number of denoising steps | | `--cfg-scale` | 3.0 | CFG guidance scale | | `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) | | `--negative-prompt` | (default) | Negative prompt for CFG | | `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) | | `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) | | `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) | | `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) | ### Dev-Two-Stage LoRA | Option | Default | Description | |--------|---------|-------------| | `--lora-path` | auto-detect | Path to LoRA file, directory, or HuggingFace repo | | `--lora-strength` | 1.0 | LoRA merge strength | ### Dev-Two-Stage HQ | Option | Default | Description | |--------|---------|-------------| | `--lora-strength-stage-1` | 0.25 | LoRA strength for stage 1 | | `--lora-strength-stage-2` | 0.5 | LoRA strength for stage 2 | HQ defaults: 15 steps (vs 30), `cfg-rescale` 0.45 (vs 0.7), STG disabled. Uses the res_2s second-order sampler (2 model evals per step) for better quality at the same compute budget. ## How It Works ### Distilled Pipeline (default) 1. **Stage 1**: Generate at half resolution with 8 denoising steps (fixed sigmas) 2. **Upsample**: Spatial upsampling via LatentUpsampler (x2 or x1.5, selectable via `--spatial-upscaler`) 3. **Stage 2**: Refine at upsampled resolution with 3 denoising steps 4. **Decode**: VAE decoder converts latents to RGB video ### Dev Pipeline 1. **Generate**: Full resolution with configurable steps and constant CFG 2. **Decode**: VAE decoder converts latents to RGB video ### Dev Two-Stage Pipeline 1. **Stage 1**: Dev denoising at half resolution with CFG 2. **Upsample**: Spatial upsampling via LatentUpsampler (x2 or x1.5) 3. **Stage 2**: Distilled refinement at upsampled resolution with LoRA weights (3 steps, no CFG) 4. **Decode**: VAE decoder converts latents to RGB video ### Dev Two-Stage HQ Pipeline 1. **Stage 1**: res_2s denoising at half resolution with CFG + LoRA@0.25 (15 steps, 2 evals/step) 2. **Upsample**: Spatial upsampling via LatentUpsampler (x2 or x1.5) 3. **Stage 2**: res_2s refinement at upsampled resolution with LoRA@0.5 (3 steps, no CFG) 4. **Decode**: VAE decoder converts latents to RGB video The res_2s sampler uses an exponential Rosenbrock-type Runge-Kutta integrator with SDE noise injection, producing higher quality results than Euler at the same compute budget (~30 total model evaluations). ### Audio-to-Video (A2V) Conditioning A2V works by encoding input audio into the same latent space as generated audio, then **freezing** those latents during denoising: 1. Load audio file, resample to 16kHz, compute mel-spectrogram 2. `AudioEncoder(mel_spec)` produces audio latents `(B, 8, T, 16)` 3. Normalize via `PerChannelStatistics` 4. Freeze during denoising: `timesteps=0`, `sigma=0`, skip Euler/RK updates 5. Transformer's A2V cross-attention reads frozen audio to guide video generation 6. Output: denoised video + original input audio waveform (skip audio VAE decode) ## Converting Models Convert original Lightricks/LTX-2 weights to the modular mlx-video format: ```bash # Convert distilled model uv run python -m mlx_video.models.ltx_2.convert \ --source Lightricks/LTX-2 --output ./LTX-2-distilled --variant distilled # Convert dev model uv run python -m mlx_video.models.ltx_2.convert \ --source Lightricks/LTX-2 --output ./LTX-2-dev --variant dev ``` This extracts 7 components from the monolithic checkpoint: ``` LTX-2-distilled/ ├── transformer/ # DiT transformer (19B params) ├── vae/ │ ├── decoder/ # Video VAE decoder │ └── encoder/ # Video VAE encoder ├── audio_vae/ │ ├── decoder/ # Audio VAE decoder │ └── encoder/ # Audio VAE encoder ├── vocoder/ # Mel-spectrogram to waveform └── text_projections/ # Text embedding projections ``` Pre-converted weights are available on HuggingFace: - [prince-canuma/LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled) - [prince-canuma/LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev) - [prince-canuma/LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled) - [prince-canuma/LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev) ## Model Specifications - **Transformer**: 48 layers, 32 attention heads, 128 dim per head (19B parameters) - **Latent channels**: 128 - **Patch size**: 4 (for VAE patchify/unpatchify) - **Text encoder**: Gemma 3 with 3840-dim output - **RoPE**: Split mode with double precision (LTX-2.3) or standard (LTX-2) - **Audio VAE**: Encoder (~35M), Decoder (~50M), Vocoder (~13M) ### Audio VAE Architecture ``` Audio Encoder: mel-spectrogram -> latents (B, 8, T, 16) - Channel multipliers: (1, 2, 4) - ResNet blocks with optional attention - GroupNorm or PixelNorm normalization - Optional causal convolutions Audio Decoder: latents -> mel-spectrogram - Mirrors encoder with upsampling path - Per-channel statistics for latent normalization Vocoder: mel-spectrogram -> waveform (~13M params) - HiFi-GAN style architecture - Upsample rates: [6, 5, 2, 2, 2] - ResBlock1 with dilations [1, 3, 5] ``` ## Project Structure ``` mlx_video/models/ltx_2/ ├── __init__.py ├── config.py # LTXModelConfig, AudioEncoderModelConfig, AudioDecoderModelConfig ├── convert.py # Weight conversion from Lightricks/LTX-2 ├── generate.py # Unified generation pipeline (T2V, I2V, A2V, +Audio) ├── postprocess.py # Video post-processing ├── samplers.py # Euler and res_2s samplers ├── utils.py # Shared utilities (get_model_path, load_safetensors, etc.) ├── ltx.py # Main LTXModel (DiT transformer with AV support) ├── transformer.py # Transformer blocks, Modality dataclass ├── attention.py # Multi-head attention with RoPE ├── feed_forward.py # Feed-forward layers ├── adaln.py # Adaptive Layer Normalization ├── rope.py # Rotary Position Embeddings (split/combined) ├── text_projection.py # Text embedding projection ├── text_encoder.py # Text encoder with AV embeddings support ├── upsampler.py # LatentUpsampler for 2-stage generation ├── conditioning/ │ ├── keyframe.py # Image-to-video keyframe conditioning │ └── latent.py # Video-to-video latent conditioning ├── video_vae/ │ ├── decoder.py # VAE decoder with timestep conditioning │ ├── encoder.py # VAE encoder for image/video encoding │ ├── convolution.py # CausalConv3d, CausalConv2d │ ├── ops.py # patchify, unpatchify, PerChannelStatistics │ ├── resnet.py # ResBlock3D, ResBlockGroup │ ├── sampling.py # DepthToSpaceUpsample, SpaceToDepthDownsample │ └── video_vae.py # Full VAE (encoder + decoder) └── audio_vae/ ├── audio_vae.py # Audio encoder and decoder ├── audio_processor.py # Mel-spectrogram computation (librosa) ├── vocoder.py # Mel-spectrogram to waveform synthesis ├── ops.py # AudioPatchifier, PerChannelStatistics ├── resnet.py # ResNet blocks for audio ├── attention.py # Attention blocks for audio VAE ├── normalization.py # Normalization layers ├── causal_conv_2d.py # Causal 2D convolutions ├── downsample.py # Downsampling layers └── upsample.py # Upsampling layers ``` ## LTX-2 vs LTX-2.3 LTX-2.3 introduces prompt-conditioned adaptive layer normalization (adaln): | Feature | LTX-2 | LTX-2.3 | |---------|--------|---------| | AdaLN | Standard | Prompt-conditioned (`has_prompt_adaln=True`) | | Attention gate | None | `2.0 * sigmoid(gate_logits)` | | Scale-shift table | 6 params | 9 params (+ cross-attn Q) | | Text encoder connectors | 2 blocks | 8 blocks with gate_logits | | Feature extractor | V1 (batch-level) | V2 (per-token RMSNorm) | | RoPE | Standard | Double precision | | STG blocks | [29] | [28] | | Text encoder repo | Included | Separate (`--text-encoder-repo`) |