159 lines
5.8 KiB
Markdown
159 lines
5.8 KiB
Markdown
# MLX-Video Copilot Instructions
|
|
|
|
## Overview
|
|
|
|
MLX-Video is a video/audio generation package using Apple MLX framework. It implements the LTX-2 model (19B parameter DiT) for text-to-video, image-to-video, and audio-video generation, optimized for Apple Silicon.
|
|
|
|
## Build, Test, and Lint
|
|
|
|
### Testing
|
|
```bash
|
|
# Install test dependencies first (pytest not in main deps)
|
|
pip install pytest
|
|
|
|
# Run all tests
|
|
python -m pytest tests/
|
|
|
|
# Run specific test file
|
|
python -m pytest tests/test_generate_dev.py
|
|
|
|
# Run specific test
|
|
python -m pytest tests/test_generate_dev.py::TestLTX2Scheduler::test_scheduler_output_shape
|
|
```
|
|
|
|
### Linting
|
|
Pre-commit hooks configured with:
|
|
- **black**: Code formatting
|
|
- **isort**: Import sorting (profile: black)
|
|
- **autoflake**: Remove unused imports
|
|
|
|
```bash
|
|
# Run pre-commit manually
|
|
pre-commit run --all-files
|
|
```
|
|
|
|
### Running Generation
|
|
```bash
|
|
# Quick test - distilled model (two-stage pipeline)
|
|
python -m mlx_video.generate --prompt "test video" --num-frames 33
|
|
|
|
# Dev model with CFG (single-stage, higher quality)
|
|
python -m mlx_video.generate_dev --prompt "test video" --steps 40 --cfg-scale 4.0
|
|
|
|
# Audio-video generation
|
|
python -m mlx_video.generate_av --prompt "test video" --output-path out.mp4 --output-audio out.wav
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Two-Stage Pipeline (Distilled Model)
|
|
The distilled model (`generate.py`) uses a two-stage approach for efficiency:
|
|
1. **Stage 1**: Generate at half resolution with 8 denoising steps using STAGE_1_SIGMAS
|
|
2. **Upsampler**: 2x spatial upsampling via LatentUpsampler
|
|
3. **Stage 2**: Refine at full resolution with 3 steps using STAGE_2_SIGMAS
|
|
4. **VAE Decoder**: Convert latents to RGB video (tiled decoding for memory efficiency)
|
|
|
|
### Single-Stage Pipeline (Dev Model)
|
|
The dev model (`generate_dev.py`) uses classifier-free guidance (CFG):
|
|
- Full resolution generation with configurable steps (typically 40)
|
|
- CFG guidance scale controls prompt adherence vs. diversity
|
|
- More flexible but slower than distilled model
|
|
|
|
### Core Components
|
|
|
|
**DiT Transformer** (`models/ltx/ltx.py`):
|
|
- 48 layers, 32 attention heads, 128 dim per head
|
|
- Dual modality support: video (3840-dim) and audio (2048-dim) embeddings
|
|
- Uses RoPE (Rotary Position Embeddings) in SPLIT mode with double precision
|
|
- AdaLN-Zero conditioning blocks inject timestep/text embeddings
|
|
|
|
**VAE Architecture**:
|
|
- **Video VAE**: 128 latent channels, 8x temporal + 32x spatial compression
|
|
- Encoder: `models/ltx/video_vae/encoder.py`
|
|
- Decoder: `models/ltx/video_vae/decoder.py` (supports tiled decoding)
|
|
- **Audio VAE**: 8 latent channels, mel-spectrogram intermediate
|
|
- Decoder: `models/ltx/audio_vae/decoder.py`
|
|
- HiFi-GAN vocoder: `models/ltx/audio_vae/vocoder.py`
|
|
|
|
**Text Encoder** (`models/ltx/text_encoder.py`):
|
|
- Based on Gemma 3 model
|
|
- Returns separate embeddings for video (3840-dim) and audio (2048-dim)
|
|
- Supports prompt enhancement via `enhance_t2v()` method
|
|
|
|
**Tiling System** (`models/ltx/video_vae/tiling.py`):
|
|
- Memory-efficient decoding for large videos
|
|
- Modes: auto, default (512px/64f), aggressive (256px/32f), conservative (768px/96f)
|
|
- Supports streaming via `on_frames_ready` callback
|
|
|
|
### Key Patterns
|
|
|
|
**Position Grids**:
|
|
- Created in pixel space, then converted to latent space internally
|
|
- Video: (B, 3, num_patches, 2) with [start, end) bounds for temporal/spatial dims
|
|
- Audio: (B, 1, num_patches, 2) for temporal dimension only
|
|
- See `create_position_grid()` in generate modules
|
|
|
|
**Latent Conditioning** (`conditioning/latent.py`):
|
|
- `LatentState` tracks clean latents, noise, and sigma values
|
|
- `VideoConditionByLatentIndex` enables I2V by conditioning specific frames
|
|
- `apply_denoise_mask()` protects conditioned regions during denoising
|
|
|
|
**Weight Loading**:
|
|
- `convert.py`: Downloads from HuggingFace, converts PyTorch → MLX format
|
|
- Sanitization functions (`sanitize_transformer_weights`, `sanitize_vae_encoder_weights`) adapt keys
|
|
- Uses safetensors for efficient loading
|
|
|
|
## Key Conventions
|
|
|
|
### Model Configuration
|
|
- Always use `LTXModelConfig` to instantiate models
|
|
- `model_type` determines modality: `VideoOnly`, `AudioOnly`, or `AudioVideo`
|
|
- `rope_type=LTXRopeType.SPLIT` and `double_precision_rope=True` are standard
|
|
|
|
### Frame Count Requirements
|
|
- **Distilled model**: `num_frames = 1 + 8*k` format (e.g., 33, 65, 97)
|
|
- **Dev model**: No strict requirement, but odd numbers work better
|
|
- Audio frames auto-computed from video duration via `AUDIO_LATENTS_PER_SECOND`
|
|
|
|
### Dimension Constraints
|
|
- Video height/width must be divisible by 64 (VAE spatial compression)
|
|
- Latent dimensions are pixel dimensions divided by 32
|
|
|
|
### Audio Constants
|
|
```python
|
|
AUDIO_SAMPLE_RATE = 24000 # Output sample rate
|
|
AUDIO_LATENT_SAMPLE_RATE = 16000 # VAE internal rate
|
|
AUDIO_HOP_LENGTH = 160 # Mel hop length
|
|
AUDIO_LATENT_CHANNELS = 8 # Audio latent channels
|
|
AUDIO_MEL_BINS = 16 # Mel frequency bins
|
|
```
|
|
|
|
### Sigma Schedules
|
|
Distilled model uses predefined schedules (no scheduler class):
|
|
```python
|
|
STAGE_1_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0]
|
|
STAGE_2_SIGMAS = [0.909375, 0.725, 0.421875, 0.0]
|
|
```
|
|
|
|
Dev model computes schedules via `ltx2_scheduler(steps)` function.
|
|
|
|
### Code Style
|
|
- Follow black formatting (configured in pre-commit)
|
|
- Import sorting: isort with black profile
|
|
- Remove unused imports (autoflake)
|
|
- Type hints encouraged but not enforced
|
|
|
|
### Modality Enum
|
|
Use `Modality.VIDEO` and `Modality.AUDIO` from `models/ltx/transformer.py` for multi-modal operations.
|
|
|
|
### Video Post-Processing
|
|
- `postprocess.py`: Contains utilities for frame normalization and video saving
|
|
- Always denormalize latents from [-1, 1] to [0, 255] before saving
|
|
- Use opencv-python for video I/O
|
|
|
|
## Python Requirements
|
|
- Python >= 3.11
|
|
- MLX >= 0.22.0
|
|
- Primary dependencies: numpy, safetensors, transformers, opencv-python, Pillow, mlx-vlm, scipy, librosa
|
|
- Package manager: uv recommended for faster installs, pip also supported
|