5.8 KiB
MLX-Video Copilot Instructions
Overview
MLX-Video is a video/audio generation package using Apple MLX framework. It implements the LTX-2 model (19B parameter DiT) for text-to-video, image-to-video, and audio-video generation, optimized for Apple Silicon.
Build, Test, and Lint
Testing
# Install test dependencies first (pytest not in main deps)
pip install pytest
# Run all tests
python -m pytest tests/
# Run specific test file
python -m pytest tests/test_generate_dev.py
# Run specific test
python -m pytest tests/test_generate_dev.py::TestLTX2Scheduler::test_scheduler_output_shape
Linting
Pre-commit hooks configured with:
- black: Code formatting
- isort: Import sorting (profile: black)
- autoflake: Remove unused imports
# Run pre-commit manually
pre-commit run --all-files
Running Generation
# Quick test - distilled model (two-stage pipeline)
python -m mlx_video.generate --prompt "test video" --num-frames 33
# Dev model with CFG (single-stage, higher quality)
python -m mlx_video.generate_dev --prompt "test video" --steps 40 --cfg-scale 4.0
# Audio-video generation
python -m mlx_video.generate_av --prompt "test video" --output-path out.mp4 --output-audio out.wav
Architecture
Two-Stage Pipeline (Distilled Model)
The distilled model (generate.py) uses a two-stage approach for efficiency:
- Stage 1: Generate at half resolution with 8 denoising steps using STAGE_1_SIGMAS
- Upsampler: 2x spatial upsampling via LatentUpsampler
- Stage 2: Refine at full resolution with 3 steps using STAGE_2_SIGMAS
- VAE Decoder: Convert latents to RGB video (tiled decoding for memory efficiency)
Single-Stage Pipeline (Dev Model)
The dev model (generate_dev.py) uses classifier-free guidance (CFG):
- Full resolution generation with configurable steps (typically 40)
- CFG guidance scale controls prompt adherence vs. diversity
- More flexible but slower than distilled model
Core Components
DiT Transformer (models/ltx/ltx.py):
- 48 layers, 32 attention heads, 128 dim per head
- Dual modality support: video (3840-dim) and audio (2048-dim) embeddings
- Uses RoPE (Rotary Position Embeddings) in SPLIT mode with double precision
- AdaLN-Zero conditioning blocks inject timestep/text embeddings
VAE Architecture:
- Video VAE: 128 latent channels, 8x temporal + 32x spatial compression
- Encoder:
models/ltx/video_vae/encoder.py - Decoder:
models/ltx/video_vae/decoder.py(supports tiled decoding)
- Encoder:
- Audio VAE: 8 latent channels, mel-spectrogram intermediate
- Decoder:
models/ltx/audio_vae/decoder.py - HiFi-GAN vocoder:
models/ltx/audio_vae/vocoder.py
- Decoder:
Text Encoder (models/ltx/text_encoder.py):
- Based on Gemma 3 model
- Returns separate embeddings for video (3840-dim) and audio (2048-dim)
- Supports prompt enhancement via
enhance_t2v()method
Tiling System (models/ltx/video_vae/tiling.py):
- Memory-efficient decoding for large videos
- Modes: auto, default (512px/64f), aggressive (256px/32f), conservative (768px/96f)
- Supports streaming via
on_frames_readycallback
Key Patterns
Position Grids:
- Created in pixel space, then converted to latent space internally
- Video: (B, 3, num_patches, 2) with [start, end) bounds for temporal/spatial dims
- Audio: (B, 1, num_patches, 2) for temporal dimension only
- See
create_position_grid()in generate modules
Latent Conditioning (conditioning/latent.py):
LatentStatetracks clean latents, noise, and sigma valuesVideoConditionByLatentIndexenables I2V by conditioning specific framesapply_denoise_mask()protects conditioned regions during denoising
Weight Loading:
convert.py: Downloads from HuggingFace, converts PyTorch → MLX format- Sanitization functions (
sanitize_transformer_weights,sanitize_vae_encoder_weights) adapt keys - Uses safetensors for efficient loading
Key Conventions
Model Configuration
- Always use
LTXModelConfigto instantiate models model_typedetermines modality:VideoOnly,AudioOnly, orAudioVideorope_type=LTXRopeType.SPLITanddouble_precision_rope=Trueare standard
Frame Count Requirements
- Distilled model:
num_frames = 1 + 8*kformat (e.g., 33, 65, 97) - Dev model: No strict requirement, but odd numbers work better
- Audio frames auto-computed from video duration via
AUDIO_LATENTS_PER_SECOND
Dimension Constraints
- Video height/width must be divisible by 64 (VAE spatial compression)
- Latent dimensions are pixel dimensions divided by 32
Audio Constants
AUDIO_SAMPLE_RATE = 24000 # Output sample rate
AUDIO_LATENT_SAMPLE_RATE = 16000 # VAE internal rate
AUDIO_HOP_LENGTH = 160 # Mel hop length
AUDIO_LATENT_CHANNELS = 8 # Audio latent channels
AUDIO_MEL_BINS = 16 # Mel frequency bins
Sigma Schedules
Distilled model uses predefined schedules (no scheduler class):
STAGE_1_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0]
STAGE_2_SIGMAS = [0.909375, 0.725, 0.421875, 0.0]
Dev model computes schedules via ltx2_scheduler(steps) function.
Code Style
- Follow black formatting (configured in pre-commit)
- Import sorting: isort with black profile
- Remove unused imports (autoflake)
- Type hints encouraged but not enforced
Modality Enum
Use Modality.VIDEO and Modality.AUDIO from models/ltx/transformer.py for multi-modal operations.
Video Post-Processing
postprocess.py: Contains utilities for frame normalization and video saving- Always denormalize latents from [-1, 1] to [0, 255] before saving
- Use opencv-python for video I/O
Python Requirements
- Python >= 3.11
- MLX >= 0.22.0
- Primary dependencies: numpy, safetensors, transformers, opencv-python, Pillow, mlx-vlm, scipy, librosa
- Package manager: uv recommended for faster installs, pip also supported