Files

Daniel e64483a66a feat(wan): Add Wan2.1/2.2 T2V with quantization support

2026-03-11 09:08:10 +01:00

5.8 KiB

Raw Blame History

MLX-Video Copilot Instructions

Overview

MLX-Video is a video/audio generation package using Apple MLX framework. It implements the LTX-2 model (19B parameter DiT) for text-to-video, image-to-video, and audio-video generation, optimized for Apple Silicon.

Build, Test, and Lint

Testing

# Install test dependencies first (pytest not in main deps)
pip install pytest

# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test_generate_dev.py

# Run specific test
python -m pytest tests/test_generate_dev.py::TestLTX2Scheduler::test_scheduler_output_shape

Linting

Pre-commit hooks configured with:

black: Code formatting
isort: Import sorting (profile: black)
autoflake: Remove unused imports

# Run pre-commit manually
pre-commit run --all-files

Running Generation

# Quick test - distilled model (two-stage pipeline)
python -m mlx_video.generate --prompt "test video" --num-frames 33

# Dev model with CFG (single-stage, higher quality)
python -m mlx_video.generate_dev --prompt "test video" --steps 40 --cfg-scale 4.0

# Audio-video generation
python -m mlx_video.generate_av --prompt "test video" --output-path out.mp4 --output-audio out.wav

Architecture

Two-Stage Pipeline (Distilled Model)

The distilled model (generate.py) uses a two-stage approach for efficiency:

Stage 1: Generate at half resolution with 8 denoising steps using STAGE_1_SIGMAS
Upsampler: 2x spatial upsampling via LatentUpsampler
Stage 2: Refine at full resolution with 3 steps using STAGE_2_SIGMAS
VAE Decoder: Convert latents to RGB video (tiled decoding for memory efficiency)

Single-Stage Pipeline (Dev Model)

The dev model (generate_dev.py) uses classifier-free guidance (CFG):

Full resolution generation with configurable steps (typically 40)
CFG guidance scale controls prompt adherence vs. diversity
More flexible but slower than distilled model

Core Components

DiT Transformer (models/ltx/ltx.py):

48 layers, 32 attention heads, 128 dim per head
Dual modality support: video (3840-dim) and audio (2048-dim) embeddings
Uses RoPE (Rotary Position Embeddings) in SPLIT mode with double precision
AdaLN-Zero conditioning blocks inject timestep/text embeddings

VAE Architecture:

Video VAE: 128 latent channels, 8x temporal + 32x spatial compression
- Encoder: models/ltx/video_vae/encoder.py
- Decoder: models/ltx/video_vae/decoder.py (supports tiled decoding)
Audio VAE: 8 latent channels, mel-spectrogram intermediate
- Decoder: models/ltx/audio_vae/decoder.py
- HiFi-GAN vocoder: models/ltx/audio_vae/vocoder.py

Text Encoder (models/ltx/text_encoder.py):

Based on Gemma 3 model
Returns separate embeddings for video (3840-dim) and audio (2048-dim)
Supports prompt enhancement via enhance_t2v() method

Tiling System (models/ltx/video_vae/tiling.py):

Memory-efficient decoding for large videos
Modes: auto, default (512px/64f), aggressive (256px/32f), conservative (768px/96f)
Supports streaming via on_frames_ready callback

Key Patterns

Position Grids:

Created in pixel space, then converted to latent space internally
Video: (B, 3, num_patches, 2) with [start, end) bounds for temporal/spatial dims
Audio: (B, 1, num_patches, 2) for temporal dimension only
See create_position_grid() in generate modules

Latent Conditioning (conditioning/latent.py):

LatentState tracks clean latents, noise, and sigma values
VideoConditionByLatentIndex enables I2V by conditioning specific frames
apply_denoise_mask() protects conditioned regions during denoising

Weight Loading:

convert.py: Downloads from HuggingFace, converts PyTorch → MLX format
Sanitization functions (sanitize_transformer_weights, sanitize_vae_encoder_weights) adapt keys
Uses safetensors for efficient loading

Key Conventions

Model Configuration

Always use LTXModelConfig to instantiate models
model_type determines modality: VideoOnly, AudioOnly, or AudioVideo
rope_type=LTXRopeType.SPLIT and double_precision_rope=True are standard

Frame Count Requirements

Distilled model: num_frames = 1 + 8*k format (e.g., 33, 65, 97)
Dev model: No strict requirement, but odd numbers work better
Audio frames auto-computed from video duration via AUDIO_LATENTS_PER_SECOND

Dimension Constraints

Video height/width must be divisible by 64 (VAE spatial compression)
Latent dimensions are pixel dimensions divided by 32

Audio Constants

AUDIO_SAMPLE_RATE = 24000          # Output sample rate
AUDIO_LATENT_SAMPLE_RATE = 16000   # VAE internal rate
AUDIO_HOP_LENGTH = 160             # Mel hop length
AUDIO_LATENT_CHANNELS = 8          # Audio latent channels
AUDIO_MEL_BINS = 16                # Mel frequency bins

Sigma Schedules

Distilled model uses predefined schedules (no scheduler class):

STAGE_1_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0]
STAGE_2_SIGMAS = [0.909375, 0.725, 0.421875, 0.0]

Dev model computes schedules via ltx2_scheduler(steps) function.

Code Style

Follow black formatting (configured in pre-commit)
Import sorting: isort with black profile
Remove unused imports (autoflake)
Type hints encouraged but not enforced

Modality Enum

Use Modality.VIDEO and Modality.AUDIO from models/ltx/transformer.py for multi-modal operations.

Video Post-Processing

postprocess.py: Contains utilities for frame normalization and video saving
Always denormalize latents from [-1, 1] to [0, 255] before saving
Use opencv-python for video I/O

Python Requirements

Python >= 3.11
MLX >= 0.22.0
Primary dependencies: numpy, safetensors, transformers, opencv-python, Pillow, mlx-vlm, scipy, librosa
Package manager: uv recommended for faster installs, pip also supported

5.8 KiB Raw Blame History