mlx-video/.github/copilot-instructions.md

# MLX-Video Copilot Instructions

## Overview

MLX-Video is a video/audio generation package using Apple MLX framework. It implements the LTX-2 model (19B parameter DiT) for text-to-video, image-to-video, and audio-video generation, optimized for Apple Silicon.

## Build, Test, and Lint

### Testing
```bash
# Install test dependencies first (pytest not in main deps)
pip install pytest

# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test_generate_dev.py

# Run specific test
python -m pytest tests/test_generate_dev.py::TestLTX2Scheduler::test_scheduler_output_shape
```

### Linting
Pre-commit hooks configured with:
- **black**: Code formatting
- **isort**: Import sorting (profile: black)
- **autoflake**: Remove unused imports

```bash
# Run pre-commit manually
pre-commit run --all-files
```

### Running Generation
```bash
# Quick test - distilled model (two-stage pipeline)
python -m mlx_video.generate --prompt "test video" --num-frames 33

# Dev model with CFG (single-stage, higher quality)
python -m mlx_video.generate_dev --prompt "test video" --steps 40 --cfg-scale 4.0

# Audio-video generation
python -m mlx_video.generate_av --prompt "test video" --output-path out.mp4 --output-audio out.wav
```

## Architecture

### Two-Stage Pipeline (Distilled Model)
The distilled model (`generate.py`) uses a two-stage approach for efficiency:
1. **Stage 1**: Generate at half resolution with 8 denoising steps using STAGE_1_SIGMAS
2. **Upsampler**: 2x spatial upsampling via LatentUpsampler
3. **Stage 2**: Refine at full resolution with 3 steps using STAGE_2_SIGMAS
4. **VAE Decoder**: Convert latents to RGB video (tiled decoding for memory efficiency)

### Single-Stage Pipeline (Dev Model)
The dev model (`generate_dev.py`) uses classifier-free guidance (CFG):
- Full resolution generation with configurable steps (typically 40)
- CFG guidance scale controls prompt adherence vs. diversity
- More flexible but slower than distilled model

### Core Components

**DiT Transformer** (`models/ltx/ltx.py`):
- 48 layers, 32 attention heads, 128 dim per head
- Dual modality support: video (3840-dim) and audio (2048-dim) embeddings
- Uses RoPE (Rotary Position Embeddings) in SPLIT mode with double precision
- AdaLN-Zero conditioning blocks inject timestep/text embeddings

**VAE Architecture**:
- **Video VAE**: 128 latent channels, 8x temporal + 32x spatial compression
  - Encoder: `models/ltx/video_vae/encoder.py`
  - Decoder: `models/ltx/video_vae/decoder.py` (supports tiled decoding)
- **Audio VAE**: 8 latent channels, mel-spectrogram intermediate
  - Decoder: `models/ltx/audio_vae/decoder.py`
  - HiFi-GAN vocoder: `models/ltx/audio_vae/vocoder.py`

**Text Encoder** (`models/ltx/text_encoder.py`):
- Based on Gemma 3 model
- Returns separate embeddings for video (3840-dim) and audio (2048-dim)
- Supports prompt enhancement via `enhance_t2v()` method

**Tiling System** (`models/ltx/video_vae/tiling.py`):
- Memory-efficient decoding for large videos
- Modes: auto, default (512px/64f), aggressive (256px/32f), conservative (768px/96f)
- Supports streaming via `on_frames_ready` callback

### Key Patterns

**Position Grids**:
- Created in pixel space, then converted to latent space internally
- Video: (B, 3, num_patches, 2) with [start, end) bounds for temporal/spatial dims
- Audio: (B, 1, num_patches, 2) for temporal dimension only
- See `create_position_grid()` in generate modules

**Latent Conditioning** (`conditioning/latent.py`):
- `LatentState` tracks clean latents, noise, and sigma values
- `VideoConditionByLatentIndex` enables I2V by conditioning specific frames
- `apply_denoise_mask()` protects conditioned regions during denoising

**Weight Loading**:
- `convert.py`: Downloads from HuggingFace, converts PyTorch → MLX format
- Sanitization functions (`sanitize_transformer_weights`, `sanitize_vae_encoder_weights`) adapt keys
- Uses safetensors for efficient loading

## Key Conventions

### Model Configuration
- Always use `LTXModelConfig` to instantiate models
- `model_type` determines modality: `VideoOnly`, `AudioOnly`, or `AudioVideo`
- `rope_type=LTXRopeType.SPLIT` and `double_precision_rope=True` are standard

### Frame Count Requirements
- **Distilled model**: `num_frames = 1 + 8*k` format (e.g., 33, 65, 97)
- **Dev model**: No strict requirement, but odd numbers work better
- Audio frames auto-computed from video duration via `AUDIO_LATENTS_PER_SECOND`

### Dimension Constraints
- Video height/width must be divisible by 64 (VAE spatial compression)
- Latent dimensions are pixel dimensions divided by 32

### Audio Constants
```python
AUDIO_SAMPLE_RATE = 24000          # Output sample rate
AUDIO_LATENT_SAMPLE_RATE = 16000   # VAE internal rate
AUDIO_HOP_LENGTH = 160             # Mel hop length
AUDIO_LATENT_CHANNELS = 8          # Audio latent channels
AUDIO_MEL_BINS = 16                # Mel frequency bins
```

### Sigma Schedules
Distilled model uses predefined schedules (no scheduler class):
```python
STAGE_1_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0]
STAGE_2_SIGMAS = [0.909375, 0.725, 0.421875, 0.0]
```

Dev model computes schedules via `ltx2_scheduler(steps)` function.

### Code Style
- Follow black formatting (configured in pre-commit)
- Import sorting: isort with black profile
- Remove unused imports (autoflake)
- Type hints encouraged but not enforced

### Modality Enum
Use `Modality.VIDEO` and `Modality.AUDIO` from `models/ltx/transformer.py` for multi-modal operations.

### Video Post-Processing
- `postprocess.py`: Contains utilities for frame normalization and video saving
- Always denormalize latents from [-1, 1] to [0, 255] before saving
- Use opencv-python for video I/O

## Python Requirements
- Python >= 3.11
- MLX >= 0.22.0
- Primary dependencies: numpy, safetensors, transformers, opencv-python, Pillow, mlx-vlm, scipy, librosa
- Package manager: uv recommended for faster installs, pip also supported