From 643f250195fc581bd5addc570544962c80a37680 Mon Sep 17 00:00:00 2001 From: Prince Canuma Date: Mon, 16 Mar 2026 23:03:05 +0100 Subject: [PATCH] Update README.md with installation instructions, supported models, and usage examples; add new LTX-2 model documentation for pipelines and features. --- README.md | 230 ++------------------- mlx_video/models/ltx_2/README.md | 345 +++++++++++++++++++++++++++++++ 2 files changed, 366 insertions(+), 209 deletions(-) create mode 100644 mlx_video/models/ltx_2/README.md diff --git a/README.md b/README.md index 80c87ef..d4ce9dd 100644 --- a/README.md +++ b/README.md @@ -4,8 +4,6 @@ MLX-Video is the best package for inference and finetuning of Image-Video-Audio ## Installation -Install from source: - ### Option 1: Install with pip (requires git): ```bash pip install git+https://github.com/Blaizzy/mlx-video.git @@ -16,244 +14,58 @@ pip install git+https://github.com/Blaizzy/mlx-video.git uv pip install git+https://github.com/Blaizzy/mlx-video.git ``` -Supported models: +## Supported Models ### LTX-2 -[LTX-2](https://huggingface.co/Lightricks/LTX-2) is a 19B parameter video generation model from Lightricks. -## Features +[LTX-2](https://huggingface.co/Lightricks/LTX-2) is a 19B parameter video generation model from Lightricks. See the full [LTX-2 model card](mlx_video/models/ltx_2/README.md) for detailed usage, CLI options, pipeline descriptions, and architecture. -- Text-to-video (T2V) and Image-to-video (I2V) generation -- Audio-to-video (A2V) conditioning — generate video from input audio -- Four pipeline modes: Distilled, Dev, Dev Two-Stage, and Dev Two-Stage HQ +**Features:** +- Text-to-Video (T2V), Image-to-Video (I2V), and Audio-to-Video (A2V) +- Four pipelines: Distilled (fast), Dev (CFG), Dev Two-Stage (LoRA), Dev Two-Stage HQ (highest quality) - Synchronized audio-video generation (experimental) -- LoRA support (including HuggingFace repos) +- LoRA support (local files or HuggingFace repos) - Prompt enhancement via Gemma - 2x spatial upscaling for images and videos -- Optimized for Apple Silicon using MLX -## Usage - -### Pipelines - -mlx-video supports four pipeline types via the `--pipeline` flag: - -| Pipeline | Description | CFG | Stages | Speed | -|----------|-------------|-----|--------|-------| -| `distilled` (default) | Fixed sigma schedule, no CFG | No | 2 (8+3 steps) | Fastest | -| `dev` | Dynamic sigmas, constant CFG | Yes | 1 (30 steps) | Medium | -| `dev-two-stage` | Dev + LoRA refinement | Yes (stage 1) | 2 (30+3 steps) | Slow | -| `dev-two-stage-hq` | res_2s sampler + LoRA both stages | Yes (stage 1) | 2 (15+3 steps) | Slow, highest quality | - -### Text-to-Video +**Quick start:** ```bash -# Distilled (default) - fast, two-stage +# Text-to-Video (distilled, fastest) uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768 -# Dev - single-stage with CFG +# Image-to-Video +uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg + +# Audio-to-Video +uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music" + +# Dev pipeline with CFG (higher quality) uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0 -# Dev two-stage - dev + LoRA refinement -uv run mlx_video.generate --pipeline dev-two-stage \ - --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \ - -n 145 --width 1024 --height 768 \ - --model-repo prince-canuma/LTX-2-dev \ - --cfg-scale 3.0 --lora-strength 0.8 \ - --enhance-prompt - -# Dev two-stage HQ - res_2s sampler, LoRA both stages (highest quality) +# Dev two-stage HQ (highest quality) uv run mlx_video.generate --pipeline dev-two-stage-hq \ --prompt "A cinematic scene of ocean waves at golden hour" \ --model-repo prince-canuma/LTX-2-dev - -# HQ with custom LoRA strengths -uv run mlx_video.generate --pipeline dev-two-stage-hq \ - --prompt "A sunset over mountains" \ - --model-repo prince-canuma/LTX-2-dev \ - --lora-strength-stage-1 0.3 --lora-strength-stage-2 0.6 ``` Poodles demo -### Image-to-Video +**Converting weights:** + +Pre-converted weights are available on HuggingFace ([LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled), [LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev), [LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled), [LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev)), or convert from the original Lightricks checkpoint: ```bash -# Distilled I2V -uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg - -# Dev I2V -uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5 +uv run python -m mlx_video.models.ltx_2.convert \ + --source Lightricks/LTX-2 --output ./LTX-2-distilled --variant distilled ``` -### Audio-to-Video (A2V) - -Generate video conditioned on an input audio file. Works with all four pipelines. The audio is encoded to latent space and frozen during denoising — the transformer's cross-attention reads the audio signal to guide video generation. - -```bash -# A2V - distilled (default, fastest) -uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music" - -# A2V - dev (single-stage with CFG) -uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves" - -# A2V - dev-two-stage (dev + LoRA refinement) -uv run mlx_video.generate --pipeline dev-two-stage --audio-file music.wav \ - --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev - -# A2V - dev-two-stage-hq (highest quality) -uv run mlx_video.generate --pipeline dev-two-stage-hq --audio-file music.wav \ - --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev - -# A2V + I2V (audio + image conditioning) -uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest" - -# A2V with custom start time -uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert" -``` - -> **Note:** `--audio-file` (A2V) and `--audio` (generate audio) are mutually exclusive. Supported formats: WAV, FLAC, MP3, OGG, and video files with audio tracks. - -### Audio-Video Generation (experimental) - -Generate synchronized audio alongside video from scratch: - -```bash -uv run mlx_video.generate --prompt "Ocean waves crashing" --audio -uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt - -# With full guidance (STG + modality_scale, matches PyTorch defaults) -uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \ - --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0 -``` - -### LoRA - -LoRA weights can be loaded from a file, directory, or HuggingFace repo: - -```bash -# From HuggingFace repo -uv run mlx_video.generate --pipeline dev-two-stage \ - --prompt "Camera dolly out of a forest" \ - --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \ - --lora-strength 1.0 - -# From local file -uv run mlx_video.generate --pipeline dev-two-stage \ - --prompt "A scene" \ - --lora-path ./my-lora/weights.safetensors - -# From local directory (auto-detects .safetensors file) -uv run mlx_video.generate --pipeline dev-two-stage \ - --prompt "A scene" \ - --lora-path ./LTX-2-distilled/lora -``` - -### Upscaling - -```bash -# Upscale an image 2x -uv run mlx_video.upscale --input photo.png --output upscaled.png - -# Upscale a video 2x -uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 - -# Upscale with refinement (higher quality, requires text prompt) -uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene" -``` - -### CLI Options - -| Option | Default | Description | -|--------|---------|-------------| -| `--prompt`, `-p` | (required) | Text description of the video | -| `--pipeline` | `distilled` | Pipeline type: `distilled`, `dev`, `dev-two-stage`, or `dev-two-stage-hq` | -| `--height`, `-H` | 512 | Output height (divisible by 64 for two-stage, 32 for dev) | -| `--width`, `-W` | 512 | Output width (divisible by 64 for two-stage, 32 for dev) | -| `--num-frames`, `-n` | 33 | Number of frames (must be 1 + 8*k) | -| `--seed`, `-s` | 42 | Random seed for reproducibility | -| `--fps` | 24 | Frames per second | -| `--output-path`, `-o` | output.mp4 | Output video path | -| `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository | -| `--text-encoder-repo` | None | Separate text encoder repo (if not in model repo) | -| `--save-frames` | false | Save individual frames as images | -| `--enhance-prompt` | false | Enhance prompt using Gemma | -| `--image`, `-i` | None | Conditioning image for I2V | -| `--image-strength` | 1.0 | Conditioning strength for I2V | -| `--audio`, `-a` | false | Enable synchronized audio generation | -| `--audio-file` | None | Path to audio file for A2V conditioning | -| `--audio-start-time` | 0.0 | Start time in seconds for audio file | -| `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` | -| `--stream` | false | Stream frames as they decode | - -**Dev/Dev-Two-Stage options:** - -| Option | Default | Description | -|--------|---------|-------------| -| `--steps` | 30 | Number of denoising steps | -| `--cfg-scale` | 3.0 | CFG guidance scale | -| `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) | -| `--negative-prompt` | (default) | Negative prompt for CFG | -| `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) | -| `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) | -| `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) | -| `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) | - -**Dev-Two-Stage LoRA options:** - -| Option | Default | Description | -|--------|---------|-------------| -| `--lora-path` | auto-detect | Path to LoRA file, directory, or HuggingFace repo | -| `--lora-strength` | 1.0 | LoRA merge strength | - -**Dev-Two-Stage HQ options:** - -| Option | Default | Description | -|--------|---------|-------------| -| `--lora-strength-stage-1` | 0.25 | LoRA strength for stage 1 | -| `--lora-strength-stage-2` | 0.5 | LoRA strength for stage 2 | - -HQ defaults: 15 steps (vs 30), `cfg-rescale` 0.45 (vs 0.7), STG disabled. Uses the res_2s second-order sampler (2 model evals per step) for better quality at the same compute budget. - -## How It Works - -### Distilled Pipeline (default) -1. **Stage 1**: Generate at half resolution with 8 denoising steps (fixed sigmas) -2. **Upsample**: 2x spatial upsampling via LatentUpsampler -3. **Stage 2**: Refine at full resolution with 3 denoising steps -4. **Decode**: VAE decoder converts latents to RGB video - -### Dev Pipeline -1. **Generate**: Full resolution with configurable steps and constant CFG -2. **Decode**: VAE decoder converts latents to RGB video - -### Dev Two-Stage Pipeline -1. **Stage 1**: Dev denoising at half resolution with CFG -2. **Upsample**: 2x spatial upsampling via LatentUpsampler -3. **Stage 2**: Distilled refinement at full resolution with LoRA weights (3 steps, no CFG) -4. **Decode**: VAE decoder converts latents to RGB video - -### Dev Two-Stage HQ Pipeline -1. **Stage 1**: res_2s denoising at half resolution with CFG + LoRA@0.25 (15 steps, 2 evals/step) -2. **Upsample**: 2x spatial upsampling via LatentUpsampler -3. **Stage 2**: res_2s refinement at full resolution with LoRA@0.5 (3 steps, no CFG) -4. **Decode**: VAE decoder converts latents to RGB video - -The res_2s sampler uses an exponential Rosenbrock-type Runge-Kutta integrator with SDE noise injection, producing higher quality results than Euler at the same compute budget (~30 total model evaluations). - ## Requirements - macOS with Apple Silicon - Python >= 3.11 - MLX >= 0.22.0 -## Model Specifications - -- **Transformer**: 48 layers, 32 attention heads, 128 dim per head (19B parameters) -- **Latent channels**: 128 -- **Text encoder**: Gemma 3 with 3840-dim output -- **Audio**: Synchronized audio-video with separate audio VAE and vocoder - ## License MIT diff --git a/mlx_video/models/ltx_2/README.md b/mlx_video/models/ltx_2/README.md new file mode 100644 index 0000000..f84400e --- /dev/null +++ b/mlx_video/models/ltx_2/README.md @@ -0,0 +1,345 @@ +# LTX-2 for MLX + +MLX port of [LTX-2](https://huggingface.co/Lightricks/LTX-2), a 19B parameter video generation model from Lightricks with synchronized audio-video support. + +## Pipelines + +Four pipeline types are available via the `--pipeline` flag: + +| Pipeline | Description | CFG | Stages | Speed | +|----------|-------------|-----|--------|-------| +| `distilled` (default) | Fixed sigma schedule, no CFG | No | 2 (8+3 steps) | Fastest | +| `dev` | Dynamic sigmas, constant CFG | Yes | 1 (30 steps) | Medium | +| `dev-two-stage` | Dev + LoRA refinement | Yes (stage 1) | 2 (30+3 steps) | Slow | +| `dev-two-stage-hq` | res_2s sampler + LoRA both stages | Yes (stage 1) | 2 (15+3 steps) | Slow, highest quality | + +## Usage + +### Text-to-Video (T2V) + +```bash +# Distilled (default) - fast, two-stage +uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768 + +# Dev - single-stage with CFG +uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0 + +# Dev two-stage - dev + LoRA refinement +uv run mlx_video.generate --pipeline dev-two-stage \ + --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \ + -n 145 --width 1024 --height 768 \ + --model-repo prince-canuma/LTX-2-dev \ + --cfg-scale 3.0 --lora-strength 0.8 \ + --enhance-prompt + +# Dev two-stage HQ - res_2s sampler, LoRA both stages (highest quality) +uv run mlx_video.generate --pipeline dev-two-stage-hq \ + --prompt "A cinematic scene of ocean waves at golden hour" \ + --model-repo prince-canuma/LTX-2-dev + +# HQ with custom LoRA strengths +uv run mlx_video.generate --pipeline dev-two-stage-hq \ + --prompt "A sunset over mountains" \ + --model-repo prince-canuma/LTX-2-dev \ + --lora-strength-stage-1 0.3 --lora-strength-stage-2 0.6 +``` + +### Image-to-Video (I2V) + +```bash +# Distilled I2V +uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg + +# Dev I2V +uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5 +``` + +### Audio-to-Video (A2V) + +Generate video conditioned on an input audio file. Works with all four pipelines. The audio is encoded to latent space and frozen during denoising -- the transformer's cross-attention reads the audio signal to guide video generation. + +```bash +# A2V - distilled (default, fastest) +uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music" + +# A2V - dev (single-stage with CFG) +uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves" + +# A2V - dev-two-stage (dev + LoRA refinement) +uv run mlx_video.generate --pipeline dev-two-stage --audio-file music.wav \ + --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev + +# A2V - dev-two-stage-hq (highest quality) +uv run mlx_video.generate --pipeline dev-two-stage-hq --audio-file music.wav \ + --prompt "A band playing music" --model-repo prince-canuma/LTX-2-dev + +# A2V + I2V (audio + image conditioning) +uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest" + +# A2V with custom start time +uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert" +``` + +> **Note:** `--audio-file` (A2V) and `--audio` (generate audio) are mutually exclusive. Supported formats: WAV, FLAC, MP3, OGG, and video files with audio tracks. + +### Audio-Video Generation (experimental) + +Generate synchronized audio alongside video from scratch: + +```bash +uv run mlx_video.generate --prompt "Ocean waves crashing" --audio +uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt + +# With full guidance (STG + modality_scale, matches PyTorch defaults) +uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \ + --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0 +``` + +### LoRA + +LoRA weights can be loaded from a file, directory, or HuggingFace repo: + +```bash +# From HuggingFace repo +uv run mlx_video.generate --pipeline dev-two-stage \ + --prompt "Camera dolly out of a forest" \ + --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \ + --lora-strength 1.0 + +# From local file +uv run mlx_video.generate --pipeline dev-two-stage \ + --prompt "A scene" \ + --lora-path ./my-lora/weights.safetensors + +# From local directory (auto-detects .safetensors file) +uv run mlx_video.generate --pipeline dev-two-stage \ + --prompt "A scene" \ + --lora-path ./LTX-2-distilled/lora +``` + +### Upscaling + +```bash +# Upscale an image 2x +uv run mlx_video.upscale --input photo.png --output upscaled.png + +# Upscale a video 2x +uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 + +# Upscale with refinement (higher quality, requires text prompt) +uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene" +``` + +## CLI Options + +### General + +| Option | Default | Description | +|--------|---------|-------------| +| `--prompt`, `-p` | (required) | Text description of the video | +| `--pipeline` | `distilled` | Pipeline type: `distilled`, `dev`, `dev-two-stage`, or `dev-two-stage-hq` | +| `--height`, `-H` | 512 | Output height (divisible by 64 for two-stage, 32 for dev) | +| `--width`, `-W` | 512 | Output width (divisible by 64 for two-stage, 32 for dev) | +| `--num-frames`, `-n` | 33 | Number of frames (must be 1 + 8*k) | +| `--seed`, `-s` | 42 | Random seed for reproducibility | +| `--fps` | 24 | Frames per second | +| `--output-path`, `-o` | output.mp4 | Output video path | +| `--model-repo` | Lightricks/LTX-2 | HuggingFace model repository | +| `--text-encoder-repo` | None | Separate text encoder repo (if not in model repo) | +| `--save-frames` | false | Save individual frames as images | +| `--enhance-prompt` | false | Enhance prompt using Gemma | +| `--image`, `-i` | None | Conditioning image for I2V | +| `--image-strength` | 1.0 | Conditioning strength for I2V | +| `--audio`, `-a` | false | Enable synchronized audio generation | +| `--audio-file` | None | Path to audio file for A2V conditioning | +| `--audio-start-time` | 0.0 | Start time in seconds for audio file | +| `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` | +| `--stream` | false | Stream frames as they decode | + +### Dev / Dev-Two-Stage + +| Option | Default | Description | +|--------|---------|-------------| +| `--steps` | 30 | Number of denoising steps | +| `--cfg-scale` | 3.0 | CFG guidance scale | +| `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) | +| `--negative-prompt` | (default) | Negative prompt for CFG | +| `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) | +| `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) | +| `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) | +| `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) | + +### Dev-Two-Stage LoRA + +| Option | Default | Description | +|--------|---------|-------------| +| `--lora-path` | auto-detect | Path to LoRA file, directory, or HuggingFace repo | +| `--lora-strength` | 1.0 | LoRA merge strength | + +### Dev-Two-Stage HQ + +| Option | Default | Description | +|--------|---------|-------------| +| `--lora-strength-stage-1` | 0.25 | LoRA strength for stage 1 | +| `--lora-strength-stage-2` | 0.5 | LoRA strength for stage 2 | + +HQ defaults: 15 steps (vs 30), `cfg-rescale` 0.45 (vs 0.7), STG disabled. Uses the res_2s second-order sampler (2 model evals per step) for better quality at the same compute budget. + +## How It Works + +### Distilled Pipeline (default) +1. **Stage 1**: Generate at half resolution with 8 denoising steps (fixed sigmas) +2. **Upsample**: 2x spatial upsampling via LatentUpsampler +3. **Stage 2**: Refine at full resolution with 3 denoising steps +4. **Decode**: VAE decoder converts latents to RGB video + +### Dev Pipeline +1. **Generate**: Full resolution with configurable steps and constant CFG +2. **Decode**: VAE decoder converts latents to RGB video + +### Dev Two-Stage Pipeline +1. **Stage 1**: Dev denoising at half resolution with CFG +2. **Upsample**: 2x spatial upsampling via LatentUpsampler +3. **Stage 2**: Distilled refinement at full resolution with LoRA weights (3 steps, no CFG) +4. **Decode**: VAE decoder converts latents to RGB video + +### Dev Two-Stage HQ Pipeline +1. **Stage 1**: res_2s denoising at half resolution with CFG + LoRA@0.25 (15 steps, 2 evals/step) +2. **Upsample**: 2x spatial upsampling via LatentUpsampler +3. **Stage 2**: res_2s refinement at full resolution with LoRA@0.5 (3 steps, no CFG) +4. **Decode**: VAE decoder converts latents to RGB video + +The res_2s sampler uses an exponential Rosenbrock-type Runge-Kutta integrator with SDE noise injection, producing higher quality results than Euler at the same compute budget (~30 total model evaluations). + +### Audio-to-Video (A2V) Conditioning + +A2V works by encoding input audio into the same latent space as generated audio, then **freezing** those latents during denoising: + +1. Load audio file, resample to 16kHz, compute mel-spectrogram +2. `AudioEncoder(mel_spec)` produces audio latents `(B, 8, T, 16)` +3. Normalize via `PerChannelStatistics` +4. Freeze during denoising: `timesteps=0`, `sigma=0`, skip Euler/RK updates +5. Transformer's A2V cross-attention reads frozen audio to guide video generation +6. Output: denoised video + original input audio waveform (skip audio VAE decode) + +## Converting Models + +Convert original Lightricks/LTX-2 weights to the modular mlx-video format: + +```bash +# Convert distilled model +uv run python -m mlx_video.models.ltx_2.convert \ + --source Lightricks/LTX-2 --output ./LTX-2-distilled --variant distilled + +# Convert dev model +uv run python -m mlx_video.models.ltx_2.convert \ + --source Lightricks/LTX-2 --output ./LTX-2-dev --variant dev +``` + +This extracts 7 components from the monolithic checkpoint: + +``` +LTX-2-distilled/ +├── transformer/ # DiT transformer (19B params) +├── vae/ +│ ├── decoder/ # Video VAE decoder +│ └── encoder/ # Video VAE encoder +├── audio_vae/ +│ ├── decoder/ # Audio VAE decoder +│ └── encoder/ # Audio VAE encoder +├── vocoder/ # Mel-spectrogram to waveform +└── text_projections/ # Text embedding projections +``` + +Pre-converted weights are available on HuggingFace: +- [prince-canuma/LTX-2-distilled](https://huggingface.co/prince-canuma/LTX-2-distilled) +- [prince-canuma/LTX-2-dev](https://huggingface.co/prince-canuma/LTX-2-dev) +- [prince-canuma/LTX-2.3-distilled](https://huggingface.co/prince-canuma/LTX-2.3-distilled) +- [prince-canuma/LTX-2.3-dev](https://huggingface.co/prince-canuma/LTX-2.3-dev) + +## Model Specifications + +- **Transformer**: 48 layers, 32 attention heads, 128 dim per head (19B parameters) +- **Latent channels**: 128 +- **Patch size**: 4 (for VAE patchify/unpatchify) +- **Text encoder**: Gemma 3 with 3840-dim output +- **RoPE**: Split mode with double precision (LTX-2.3) or standard (LTX-2) +- **Audio VAE**: Encoder (~35M), Decoder (~50M), Vocoder (~13M) + +### Audio VAE Architecture + +``` +Audio Encoder: mel-spectrogram -> latents (B, 8, T, 16) + - Channel multipliers: (1, 2, 4) + - ResNet blocks with optional attention + - GroupNorm or PixelNorm normalization + - Optional causal convolutions + +Audio Decoder: latents -> mel-spectrogram + - Mirrors encoder with upsampling path + - Per-channel statistics for latent normalization + +Vocoder: mel-spectrogram -> waveform (~13M params) + - HiFi-GAN style architecture + - Upsample rates: [6, 5, 2, 2, 2] + - ResBlock1 with dilations [1, 3, 5] +``` + +## Project Structure + +``` +mlx_video/models/ltx_2/ +├── __init__.py +├── config.py # LTXModelConfig, AudioEncoderModelConfig, AudioDecoderModelConfig +├── convert.py # Weight conversion from Lightricks/LTX-2 +├── generate.py # Unified generation pipeline (T2V, I2V, A2V, +Audio) +├── postprocess.py # Video post-processing +├── samplers.py # Euler and res_2s samplers +├── utils.py # Shared utilities (get_model_path, load_safetensors, etc.) +├── ltx.py # Main LTXModel (DiT transformer with AV support) +├── transformer.py # Transformer blocks, Modality dataclass +├── attention.py # Multi-head attention with RoPE +├── feed_forward.py # Feed-forward layers +├── adaln.py # Adaptive Layer Normalization +├── rope.py # Rotary Position Embeddings (split/combined) +├── text_projection.py # Text embedding projection +├── text_encoder.py # Text encoder with AV embeddings support +├── upsampler.py # LatentUpsampler for 2-stage generation +├── conditioning/ +│ ├── keyframe.py # Image-to-video keyframe conditioning +│ └── latent.py # Video-to-video latent conditioning +├── video_vae/ +│ ├── decoder.py # VAE decoder with timestep conditioning +│ ├── encoder.py # VAE encoder for image/video encoding +│ ├── convolution.py # CausalConv3d, CausalConv2d +│ ├── ops.py # patchify, unpatchify, PerChannelStatistics +│ ├── resnet.py # ResBlock3D, ResBlockGroup +│ ├── sampling.py # DepthToSpaceUpsample, SpaceToDepthDownsample +│ └── video_vae.py # Full VAE (encoder + decoder) +└── audio_vae/ + ├── audio_vae.py # Audio encoder and decoder + ├── audio_processor.py # Mel-spectrogram computation (librosa) + ├── vocoder.py # Mel-spectrogram to waveform synthesis + ├── ops.py # AudioPatchifier, PerChannelStatistics + ├── resnet.py # ResNet blocks for audio + ├── attention.py # Attention blocks for audio VAE + ├── normalization.py # Normalization layers + ├── causal_conv_2d.py # Causal 2D convolutions + ├── downsample.py # Downsampling layers + └── upsample.py # Upsampling layers +``` + +## LTX-2 vs LTX-2.3 + +LTX-2.3 introduces prompt-conditioned adaptive layer normalization (adaln): + +| Feature | LTX-2 | LTX-2.3 | +|---------|--------|---------| +| AdaLN | Standard | Prompt-conditioned (`has_prompt_adaln=True`) | +| Attention gate | None | `2.0 * sigmoid(gate_logits)` | +| Scale-shift table | 6 params | 9 params (+ cross-attn Q) | +| Text encoder connectors | 2 blocks | 8 blocks with gate_logits | +| Feature extractor | V1 (batch-level) | V2 (per-token RMSNorm) | +| RoPE | Standard | Double precision | +| STG blocks | [29] | [28] | +| Text encoder repo | Included | Separate (`--text-encoder-repo`) |