From 643f250195fc581bd5addc570544962c80a37680 Mon Sep 17 00:00:00 2001
From: Prince Canuma <prince.gdt@gmail.com>
Date: Mon, 16 Mar 2026 23:03:05 +0100
Subject: [PATCH] Update README.md with installation instructions, supported
 models, and usage examples; add new LTX-2 model documentation for pipelines
 and features.

---
 README.md                        | 230 ++-------------------
 mlx_video/models/ltx_2/README.md | 345 +++++++++++++++++++++++++++++++
 2 files changed, 366 insertions(+), 209 deletions(-)
 create mode 100644 mlx_video/models/ltx_2/README.md

diff --git a/README.md b/README.md
index 80c87ef..d4ce9dd 100644
--- a/README.md
+++ b/README.md
@@ -4,8 +4,6 @@ MLX-Video is the best package for inference and finetuning of Image-Video-Audio
 
 ## Installation
 
-Install from source:
-
 ### Option 1: Install with pip (requires git):
 ```bash
 pip install git+https://github.com/Blaizzy/mlx-video.git
@@ -16,244 +14,58 @@ pip install git+https://github.com/Blaizzy/mlx-video.git
 uv pip install git+https://github.com/Blaizzy/mlx-video.git
 ```
 
-Supported models:
+## Supported Models
 
 ### LTX-2
-[LTX-2](https://huggingface.co/Lightricks/LTX-2) is a 19B parameter video generation model from Lightricks.
 
-## Features
+[LTX-2](https://huggingface.co/Lightricks/LTX-2) is a 19B parameter video generation model from Lightricks. See the full [LTX-2 model card](mlx_video/models/ltx_2/README.md) for detailed usage, CLI options, pipeline descriptions, and architecture.
 
-- Text-to-video (T2V) and Image-to-video (I2V) generation
-- Audio-to-video (A2V) conditioning — generate video from input audio
-- Four pipeline modes: Distilled, Dev, Dev Two-Stage, and Dev Two-Stage HQ
+**Features:**
+- Text-to-Video (T2V), Image-to-Video (I2V), and Audio-to-Video (A2V)
+- Four pipelines: Distilled (fast), Dev (CFG), Dev Two-Stage (LoRA), Dev Two-Stage HQ (highest quality)
 - Synchronized audio-video generation (experimental)
-- LoRA support (including HuggingFace repos)
+- LoRA support (local files or HuggingFace repos)
 - Prompt enhancement via Gemma
 - 2x spatial upscaling for images and videos
-- Optimized for Apple Silicon using MLX
 
-## Usage
-
-### Pipelines
-
-mlx-video supports four pipeline types via the `--pipeline` flag:
-
-| Pipeline | Description | CFG | Stages | Speed |
-|----------|-------------|-----|--------|-------|
-| `distilled` (default) | Fixed sigma schedule, no CFG | No | 2 (8+3 steps) | Fastest |
-| `dev` | Dynamic sigmas, constant CFG | Yes | 1 (30 steps) | Medium |
-| `dev-two-stage` | Dev + LoRA refinement | Yes (stage 1) | 2 (30+3 steps) | Slow |
-| `dev-two-stage-hq` | res_2s sampler + LoRA both stages | Yes (stage 1) | 2 (15+3 steps) | Slow, highest quality |
-
-### Text-to-Video
+**Quick start:**
 
 ```bash
-# Distilled (default) - fast, two-stage
+# Text-to-Video (distilled, fastest)
 uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768
 
-# Dev - single-stage with CFG
+# Image-to-Video
+uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg
+
+# Audio-to-Video
+uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music"
+
+# Dev pipeline with CFG (higher quality)
 uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0
 
-# Dev two-stage - dev + LoRA refinement
-uv run mlx_video.generate --pipeline dev-two-stage \
-    --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \
-    -n 145 --width 1024 --height 768 \
-    --model-repo prince-canuma/LTX-2-dev \
-    --cfg-scale 3.0 --lora-strength 0.8 \
-    --enhance-prompt
-
-# Dev two-stage HQ - res_2s sampler, LoRA both stages (highest quality)
+# Dev two-stage HQ (highest quality)
 uv run mlx_video.generate --pipeline dev-two-stage-hq \
     --prompt "A cinematic scene of ocean waves at golden hour" \
     --model-repo prince-canuma/LTX-2-dev
-
-# HQ with custom LoRA strengths
-uv run mlx_video.generate --pipeline dev-two-stage-hq \
-    --prompt "A sunset over mountains" \
-    --model-repo prince-canuma/LTX-2-dev \
-    --lora-strength-stage-1 0.3 --lora-strength-stage-2 0.6
 ```
 
 <img src="https://github.com/Blaizzy/mlx-video/raw/main/examples/poodles.gif" width="512" alt="Poodles demo">
 
-### Image-to-Video
+**Converting weights:**
+
+Pre-converted weights are available on HuggingFace ([LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled), [LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev), [LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled), [LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev)), or convert from the original Lightricks checkpoint:
 
 ```bash
-# Distilled I2V
-uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg
-
-# Dev I2V
-uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5
+uv run python -m mlx_video.models.ltx_2.convert \
+    --source Lightricks/LTX-2 --output ./LTX-2-distilled --variant distilled
 ```
 
-### Audio-to-Video (A2V)
-
-Generate video conditioned on an input audio file. Works with all four pipelines. The audio is encoded to latent space and frozen during denoising — the transformer's cross-attention reads the audio signal to guide video generation.
-
-```bash
-# A2V - distilled (default, fastest)
-uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music"
-
-# A2V - dev (single-stage with CFG)
-uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves"
-
-# A2V - dev-two-stage (dev + LoRA refinement)
-uv run mlx_video.generate --pipeline dev-two-stage --audio-file music.wav \
-    --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev
-
-# A2V - dev-two-stage-hq (highest quality)
-uv run mlx_video.generate --pipeline dev-two-stage-hq --audio-file music.wav \
-    --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev
-
-# A2V + I2V (audio + image conditioning)
-uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest"
-
-# A2V with custom start time
-uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert"
-```
-
-> **Note:** `--audio-file` (A2V) and `--audio` (generate audio) are mutually exclusive. Supported formats: WAV, FLAC, MP3, OGG, and video files with audio tracks.
-
-### Audio-Video Generation (experimental)
-
-Generate synchronized audio alongside video from scratch:
-
-```bash
-uv run mlx_video.generate --prompt "Ocean waves crashing" --audio
-uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt
-
-# With full guidance (STG + modality_scale, matches PyTorch defaults)
-uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \
-    --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0
-```
-
-### LoRA
-
-LoRA weights can be loaded from a file, directory, or HuggingFace repo:
-
-```bash
-# From HuggingFace repo
-uv run mlx_video.generate --pipeline dev-two-stage \
-    --prompt "Camera dolly out of a forest" \
-    --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \
-    --lora-strength 1.0
-
-# From local file
-uv run mlx_video.generate --pipeline dev-two-stage \
-    --prompt "A scene" \
-    --lora-path ./my-lora/weights.safetensors
-
-# From local directory (auto-detects .safetensors file)
-uv run mlx_video.generate --pipeline dev-two-stage \
-    --prompt "A scene" \
-    --lora-path ./LTX-2-distilled/lora
-```
-
-### Upscaling
-
-```bash
-# Upscale an image 2x
-uv run mlx_video.upscale --input photo.png --output upscaled.png
-
-# Upscale a video 2x
-uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4
-
-# Upscale with refinement (higher quality, requires text prompt)
-uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene"
-```
-
-### CLI Options
-
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--prompt`, `-p` | (required) | Text description of the video |
-| `--pipeline` | `distilled` | Pipeline type: `distilled`, `dev`, `dev-two-stage`, or `dev-two-stage-hq` |
-| `--height`, `-H` | 512 | Output height (divisible by 64 for two-stage, 32 for dev) |
-| `--width`, `-W` | 512 | Output width (divisible by 64 for two-stage, 32 for dev) |
-| `--num-frames`, `-n` | 33 | Number of frames (must be 1 + 8*k) |
-| `--seed`, `-s` | 42 | Random seed for reproducibility |
-| `--fps` | 24 | Frames per second |
-| `--output-path`, `-o` | output.mp4 | Output video path |
-| `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository |
-| `--text-encoder-repo` | None | Separate text encoder repo (if not in model repo) |
-| `--save-frames` | false | Save individual frames as images |
-| `--enhance-prompt` | false | Enhance prompt using Gemma |
-| `--image`, `-i` | None | Conditioning image for I2V |
-| `--image-strength` | 1.0 | Conditioning strength for I2V |
-| `--audio`, `-a` | false | Enable synchronized audio generation |
-| `--audio-file` | None | Path to audio file for A2V conditioning |
-| `--audio-start-time` | 0.0 | Start time in seconds for audio file |
-| `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` |
-| `--stream` | false | Stream frames as they decode |
-
-**Dev/Dev-Two-Stage options:**
-
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--steps` | 30 | Number of denoising steps |
-| `--cfg-scale` | 3.0 | CFG guidance scale |
-| `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) |
-| `--negative-prompt` | (default) | Negative prompt for CFG |
-| `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) |
-| `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) |
-| `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) |
-| `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) |
-
-**Dev-Two-Stage LoRA options:**
-
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--lora-path` | auto-detect | Path to LoRA file, directory, or HuggingFace repo |
-| `--lora-strength` | 1.0 | LoRA merge strength |
-
-**Dev-Two-Stage HQ options:**
-
-| Option | Default | Description |
-|--------|---------|-------------|
-| `--lora-strength-stage-1` | 0.25 | LoRA strength for stage 1 |
-| `--lora-strength-stage-2` | 0.5 | LoRA strength for stage 2 |
-
-HQ defaults: 15 steps (vs 30), `cfg-rescale` 0.45 (vs 0.7), STG disabled. Uses the res_2s second-order sampler (2 model evals per step) for better quality at the same compute budget.
-
-## How It Works
-
-### Distilled Pipeline (default)
-1. **Stage 1**: Generate at half resolution with 8 denoising steps (fixed sigmas)
-2. **Upsample**: 2x spatial upsampling via LatentUpsampler
-3. **Stage 2**: Refine at full resolution with 3 denoising steps
-4. **Decode**: VAE decoder converts latents to RGB video
-
-### Dev Pipeline
-1. **Generate**: Full resolution with configurable steps and constant CFG
-2. **Decode**: VAE decoder converts latents to RGB video
-
-### Dev Two-Stage Pipeline
-1. **Stage 1**: Dev denoising at half resolution with CFG
-2. **Upsample**: 2x spatial upsampling via LatentUpsampler
-3. **Stage 2**: Distilled refinement at full resolution with LoRA weights (3 steps, no CFG)
-4. **Decode**: VAE decoder converts latents to RGB video
-
-### Dev Two-Stage HQ Pipeline
-1. **Stage 1**: res_2s denoising at half resolution with CFG + LoRA@0.25 (15 steps, 2 evals/step)
-2. **Upsample**: 2x spatial upsampling via LatentUpsampler
-3. **Stage 2**: res_2s refinement at full resolution with LoRA@0.5 (3 steps, no CFG)
-4. **Decode**: VAE decoder converts latents to RGB video
-
-The res_2s sampler uses an exponential Rosenbrock-type Runge-Kutta integrator with SDE noise injection, producing higher quality results than Euler at the same compute budget (~30 total model evaluations).
-
 ## Requirements
 
 - macOS with Apple Silicon
 - Python >= 3.11
 - MLX >= 0.22.0
 
-## Model Specifications
-
-- **Transformer**: 48 layers, 32 attention heads, 128 dim per head (19B parameters)
-- **Latent channels**: 128
-- **Text encoder**: Gemma 3 with 3840-dim output
-- **Audio**: Synchronized audio-video with separate audio VAE and vocoder
-
 ## License
 
 MIT
diff --git a/mlx_video/models/ltx_2/README.md b/mlx_video/models/ltx_2/README.md
new file mode 100644
index 0000000..f84400e
--- /dev/null
+++ b/mlx_video/models/ltx_2/README.md
@@ -0,0 +1,345 @@
+# LTX-2 for MLX
+
+MLX port of [LTX-2](https://huggingface.co/Lightricks/LTX-2), a 19B parameter video generation model from Lightricks with synchronized audio-video support.
+
+## Pipelines
+
+Four pipeline types are available via the `--pipeline` flag:
+
+| Pipeline | Description | CFG | Stages | Speed |
+|----------|-------------|-----|--------|-------|
+| `distilled` (default) | Fixed sigma schedule, no CFG | No | 2 (8+3 steps) | Fastest |
+| `dev` | Dynamic sigmas, constant CFG | Yes | 1 (30 steps) | Medium |
+| `dev-two-stage` | Dev + LoRA refinement | Yes (stage 1) | 2 (30+3 steps) | Slow |
+| `dev-two-stage-hq` | res_2s sampler + LoRA both stages | Yes (stage 1) | 2 (15+3 steps) | Slow, highest quality |
+
+## Usage
+
+### Text-to-Video (T2V)
+
+```bash
+# Distilled (default) - fast, two-stage
+uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768
+
+# Dev - single-stage with CFG
+uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0
+
+# Dev two-stage - dev + LoRA refinement
+uv run mlx_video.generate --pipeline dev-two-stage \
+    --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \
+    -n 145 --width 1024 --height 768 \
+    --model-repo prince-canuma/LTX-2-dev \
+    --cfg-scale 3.0 --lora-strength 0.8 \
+    --enhance-prompt
+
+# Dev two-stage HQ - res_2s sampler, LoRA both stages (highest quality)
+uv run mlx_video.generate --pipeline dev-two-stage-hq \
+    --prompt "A cinematic scene of ocean waves at golden hour" \
+    --model-repo prince-canuma/LTX-2-dev
+
+# HQ with custom LoRA strengths
+uv run mlx_video.generate --pipeline dev-two-stage-hq \
+    --prompt "A sunset over mountains" \
+    --model-repo prince-canuma/LTX-2-dev \
+    --lora-strength-stage-1 0.3 --lora-strength-stage-2 0.6
+```
+
+### Image-to-Video (I2V)
+
+```bash
+# Distilled I2V
+uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg
+
+# Dev I2V
+uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5
+```
+
+### Audio-to-Video (A2V)
+
+Generate video conditioned on an input audio file. Works with all four pipelines. The audio is encoded to latent space and frozen during denoising -- the transformer's cross-attention reads the audio signal to guide video generation.
+
+```bash
+# A2V - distilled (default, fastest)
+uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music"
+
+# A2V - dev (single-stage with CFG)
+uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves"
+
+# A2V - dev-two-stage (dev + LoRA refinement)
+uv run mlx_video.generate --pipeline dev-two-stage --audio-file music.wav \
+    --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev
+
+# A2V - dev-two-stage-hq (highest quality)
+uv run mlx_video.generate --pipeline dev-two-stage-hq --audio-file music.wav \
+    --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev
+
+# A2V + I2V (audio + image conditioning)
+uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest"
+
+# A2V with custom start time
+uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert"
+```
+
+> **Note:** `--audio-file` (A2V) and `--audio` (generate audio) are mutually exclusive. Supported formats: WAV, FLAC, MP3, OGG, and video files with audio tracks.
+
+### Audio-Video Generation (experimental)
+
+Generate synchronized audio alongside video from scratch:
+
+```bash
+uv run mlx_video.generate --prompt "Ocean waves crashing" --audio
+uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt
+
+# With full guidance (STG + modality_scale, matches PyTorch defaults)
+uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \
+    --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0
+```
+
+### LoRA
+
+LoRA weights can be loaded from a file, directory, or HuggingFace repo:
+
+```bash
+# From HuggingFace repo
+uv run mlx_video.generate --pipeline dev-two-stage \
+    --prompt "Camera dolly out of a forest" \
+    --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \
+    --lora-strength 1.0
+
+# From local file
+uv run mlx_video.generate --pipeline dev-two-stage \
+    --prompt "A scene" \
+    --lora-path ./my-lora/weights.safetensors
+
+# From local directory (auto-detects .safetensors file)
+uv run mlx_video.generate --pipeline dev-two-stage \
+    --prompt "A scene" \
+    --lora-path ./LTX-2-distilled/lora
+```
+
+### Upscaling
+
+```bash
+# Upscale an image 2x
+uv run mlx_video.upscale --input photo.png --output upscaled.png
+
+# Upscale a video 2x
+uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4
+
+# Upscale with refinement (higher quality, requires text prompt)
+uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene"
+```
+
+## CLI Options
+
+### General
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--prompt`, `-p` | (required) | Text description of the video |
+| `--pipeline` | `distilled` | Pipeline type: `distilled`, `dev`, `dev-two-stage`, or `dev-two-stage-hq` |
+| `--height`, `-H` | 512 | Output height (divisible by 64 for two-stage, 32 for dev) |
+| `--width`, `-W` | 512 | Output width (divisible by 64 for two-stage, 32 for dev) |
+| `--num-frames`, `-n` | 33 | Number of frames (must be 1 + 8*k) |
+| `--seed`, `-s` | 42 | Random seed for reproducibility |
+| `--fps` | 24 | Frames per second |
+| `--output-path`, `-o` | output.mp4 | Output video path |
+| `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository |
+| `--text-encoder-repo` | None | Separate text encoder repo (if not in model repo) |
+| `--save-frames` | false | Save individual frames as images |
+| `--enhance-prompt` | false | Enhance prompt using Gemma |
+| `--image`, `-i` | None | Conditioning image for I2V |
+| `--image-strength` | 1.0 | Conditioning strength for I2V |
+| `--audio`, `-a` | false | Enable synchronized audio generation |
+| `--audio-file` | None | Path to audio file for A2V conditioning |
+| `--audio-start-time` | 0.0 | Start time in seconds for audio file |
+| `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` |
+| `--stream` | false | Stream frames as they decode |
+
+### Dev / Dev-Two-Stage
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--steps` | 30 | Number of denoising steps |
+| `--cfg-scale` | 3.0 | CFG guidance scale |
+| `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) |
+| `--negative-prompt` | (default) | Negative prompt for CFG |
+| `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) |
+| `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) |
+| `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) |
+| `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) |
+
+### Dev-Two-Stage LoRA
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--lora-path` | auto-detect | Path to LoRA file, directory, or HuggingFace repo |
+| `--lora-strength` | 1.0 | LoRA merge strength |
+
+### Dev-Two-Stage HQ
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--lora-strength-stage-1` | 0.25 | LoRA strength for stage 1 |
+| `--lora-strength-stage-2` | 0.5 | LoRA strength for stage 2 |
+
+HQ defaults: 15 steps (vs 30), `cfg-rescale` 0.45 (vs 0.7), STG disabled. Uses the res_2s second-order sampler (2 model evals per step) for better quality at the same compute budget.
+
+## How It Works
+
+### Distilled Pipeline (default)
+1. **Stage 1**: Generate at half resolution with 8 denoising steps (fixed sigmas)
+2. **Upsample**: 2x spatial upsampling via LatentUpsampler
+3. **Stage 2**: Refine at full resolution with 3 denoising steps
+4. **Decode**: VAE decoder converts latents to RGB video
+
+### Dev Pipeline
+1. **Generate**: Full resolution with configurable steps and constant CFG
+2. **Decode**: VAE decoder converts latents to RGB video
+
+### Dev Two-Stage Pipeline
+1. **Stage 1**: Dev denoising at half resolution with CFG
+2. **Upsample**: 2x spatial upsampling via LatentUpsampler
+3. **Stage 2**: Distilled refinement at full resolution with LoRA weights (3 steps, no CFG)
+4. **Decode**: VAE decoder converts latents to RGB video
+
+### Dev Two-Stage HQ Pipeline
+1. **Stage 1**: res_2s denoising at half resolution with CFG + LoRA@0.25 (15 steps, 2 evals/step)
+2. **Upsample**: 2x spatial upsampling via LatentUpsampler
+3. **Stage 2**: res_2s refinement at full resolution with LoRA@0.5 (3 steps, no CFG)
+4. **Decode**: VAE decoder converts latents to RGB video
+
+The res_2s sampler uses an exponential Rosenbrock-type Runge-Kutta integrator with SDE noise injection, producing higher quality results than Euler at the same compute budget (~30 total model evaluations).
+
+### Audio-to-Video (A2V) Conditioning
+
+A2V works by encoding input audio into the same latent space as generated audio, then **freezing** those latents during denoising:
+
+1. Load audio file, resample to 16kHz, compute mel-spectrogram
+2. `AudioEncoder(mel_spec)` produces audio latents `(B, 8, T, 16)`
+3. Normalize via `PerChannelStatistics`
+4. Freeze during denoising: `timesteps=0`, `sigma=0`, skip Euler/RK updates
+5. Transformer's A2V cross-attention reads frozen audio to guide video generation
+6. Output: denoised video + original input audio waveform (skip audio VAE decode)
+
+## Converting Models
+
+Convert original Lightricks/LTX-2 weights to the modular mlx-video format:
+
+```bash
+# Convert distilled model
+uv run python -m mlx_video.models.ltx_2.convert \
+    --source Lightricks/LTX-2 --output ./LTX-2-distilled --variant distilled
+
+# Convert dev model
+uv run python -m mlx_video.models.ltx_2.convert \
+    --source Lightricks/LTX-2 --output ./LTX-2-dev --variant dev
+```
+
+This extracts 7 components from the monolithic checkpoint:
+
+```
+LTX-2-distilled/
+├── transformer/          # DiT transformer (19B params)
+├── vae/
+│   ├── decoder/          # Video VAE decoder
+│   └── encoder/          # Video VAE encoder
+├── audio_vae/
+│   ├── decoder/          # Audio VAE decoder
+│   └── encoder/          # Audio VAE encoder
+├── vocoder/              # Mel-spectrogram to waveform
+└── text_projections/     # Text embedding projections
+```
+
+Pre-converted weights are available on HuggingFace:
+- [prince-canuma/LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled)
+- [prince-canuma/LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev)
+- [prince-canuma/LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled)
+- [prince-canuma/LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev)
+
+## Model Specifications
+
+- **Transformer**: 48 layers, 32 attention heads, 128 dim per head (19B parameters)
+- **Latent channels**: 128
+- **Patch size**: 4 (for VAE patchify/unpatchify)
+- **Text encoder**: Gemma 3 with 3840-dim output
+- **RoPE**: Split mode with double precision (LTX-2.3) or standard (LTX-2)
+- **Audio VAE**: Encoder (~35M), Decoder (~50M), Vocoder (~13M)
+
+### Audio VAE Architecture
+
+```
+Audio Encoder: mel-spectrogram -> latents (B, 8, T, 16)
+  - Channel multipliers: (1, 2, 4)
+  - ResNet blocks with optional attention
+  - GroupNorm or PixelNorm normalization
+  - Optional causal convolutions
+
+Audio Decoder: latents -> mel-spectrogram
+  - Mirrors encoder with upsampling path
+  - Per-channel statistics for latent normalization
+
+Vocoder: mel-spectrogram -> waveform (~13M params)
+  - HiFi-GAN style architecture
+  - Upsample rates: [6, 5, 2, 2, 2]
+  - ResBlock1 with dilations [1, 3, 5]
+```
+
+## Project Structure
+
+```
+mlx_video/models/ltx_2/
+├── __init__.py
+├── config.py             # LTXModelConfig, AudioEncoderModelConfig, AudioDecoderModelConfig
+├── convert.py            # Weight conversion from Lightricks/LTX-2
+├── generate.py           # Unified generation pipeline (T2V, I2V, A2V, +Audio)
+├── postprocess.py        # Video post-processing
+├── samplers.py           # Euler and res_2s samplers
+├── utils.py              # Shared utilities (get_model_path, load_safetensors, etc.)
+├── ltx.py                # Main LTXModel (DiT transformer with AV support)
+├── transformer.py        # Transformer blocks, Modality dataclass
+├── attention.py          # Multi-head attention with RoPE
+├── feed_forward.py       # Feed-forward layers
+├── adaln.py              # Adaptive Layer Normalization
+├── rope.py               # Rotary Position Embeddings (split/combined)
+├── text_projection.py    # Text embedding projection
+├── text_encoder.py       # Text encoder with AV embeddings support
+├── upsampler.py          # LatentUpsampler for 2-stage generation
+├── conditioning/
+│   ├── keyframe.py       # Image-to-video keyframe conditioning
+│   └── latent.py         # Video-to-video latent conditioning
+├── video_vae/
+│   ├── decoder.py        # VAE decoder with timestep conditioning
+│   ├── encoder.py        # VAE encoder for image/video encoding
+│   ├── convolution.py    # CausalConv3d, CausalConv2d
+│   ├── ops.py            # patchify, unpatchify, PerChannelStatistics
+│   ├── resnet.py         # ResBlock3D, ResBlockGroup
+│   ├── sampling.py       # DepthToSpaceUpsample, SpaceToDepthDownsample
+│   └── video_vae.py      # Full VAE (encoder + decoder)
+└── audio_vae/
+    ├── audio_vae.py      # Audio encoder and decoder
+    ├── audio_processor.py # Mel-spectrogram computation (librosa)
+    ├── vocoder.py        # Mel-spectrogram to waveform synthesis
+    ├── ops.py            # AudioPatchifier, PerChannelStatistics
+    ├── resnet.py         # ResNet blocks for audio
+    ├── attention.py      # Attention blocks for audio VAE
+    ├── normalization.py  # Normalization layers
+    ├── causal_conv_2d.py # Causal 2D convolutions
+    ├── downsample.py     # Downsampling layers
+    └── upsample.py       # Upsampling layers
+```
+
+## LTX-2 vs LTX-2.3
+
+LTX-2.3 introduces prompt-conditioned adaptive layer normalization (adaln):
+
+| Feature | LTX-2 | LTX-2.3 |
+|---------|--------|---------|
+| AdaLN | Standard | Prompt-conditioned (`has_prompt_adaln=True`) |
+| Attention gate | None | `2.0 * sigmoid(gate_logits)` |
+| Scale-shift table | 6 params | 9 params (+ cross-attn Q) |
+| Text encoder connectors | 2 blocks | 8 blocks with gate_logits |
+| Feature extractor | V1 (batch-level) | V2 (per-token RMSNorm) |
+| RoPE | Standard | Double precision |
+| STG blocks | [29] | [28] |
+| Text encoder repo | Included | Separate (`--text-encoder-repo`) |