Files
mlx-video/.github/copilot-instructions.md

5.8 KiB

MLX-Video Copilot Instructions

Overview

MLX-Video is a video/audio generation package using Apple MLX framework. It implements the LTX-2 model (19B parameter DiT) for text-to-video, image-to-video, and audio-video generation, optimized for Apple Silicon.

Build, Test, and Lint

Testing

# Install test dependencies first (pytest not in main deps)
pip install pytest

# Run all tests
python -m pytest tests/

# Run specific test file
python -m pytest tests/test_generate_dev.py

# Run specific test
python -m pytest tests/test_generate_dev.py::TestLTX2Scheduler::test_scheduler_output_shape

Linting

Pre-commit hooks configured with:

  • black: Code formatting
  • isort: Import sorting (profile: black)
  • autoflake: Remove unused imports
# Run pre-commit manually
pre-commit run --all-files

Running Generation

# Quick test - distilled model (two-stage pipeline)
python -m mlx_video.generate --prompt "test video" --num-frames 33

# Dev model with CFG (single-stage, higher quality)
python -m mlx_video.generate_dev --prompt "test video" --steps 40 --cfg-scale 4.0

# Audio-video generation
python -m mlx_video.generate_av --prompt "test video" --output-path out.mp4 --output-audio out.wav

Architecture

Two-Stage Pipeline (Distilled Model)

The distilled model (generate.py) uses a two-stage approach for efficiency:

  1. Stage 1: Generate at half resolution with 8 denoising steps using STAGE_1_SIGMAS
  2. Upsampler: 2x spatial upsampling via LatentUpsampler
  3. Stage 2: Refine at full resolution with 3 steps using STAGE_2_SIGMAS
  4. VAE Decoder: Convert latents to RGB video (tiled decoding for memory efficiency)

Single-Stage Pipeline (Dev Model)

The dev model (generate_dev.py) uses classifier-free guidance (CFG):

  • Full resolution generation with configurable steps (typically 40)
  • CFG guidance scale controls prompt adherence vs. diversity
  • More flexible but slower than distilled model

Core Components

DiT Transformer (models/ltx/ltx.py):

  • 48 layers, 32 attention heads, 128 dim per head
  • Dual modality support: video (3840-dim) and audio (2048-dim) embeddings
  • Uses RoPE (Rotary Position Embeddings) in SPLIT mode with double precision
  • AdaLN-Zero conditioning blocks inject timestep/text embeddings

VAE Architecture:

  • Video VAE: 128 latent channels, 8x temporal + 32x spatial compression
    • Encoder: models/ltx/video_vae/encoder.py
    • Decoder: models/ltx/video_vae/decoder.py (supports tiled decoding)
  • Audio VAE: 8 latent channels, mel-spectrogram intermediate
    • Decoder: models/ltx/audio_vae/decoder.py
    • HiFi-GAN vocoder: models/ltx/audio_vae/vocoder.py

Text Encoder (models/ltx/text_encoder.py):

  • Based on Gemma 3 model
  • Returns separate embeddings for video (3840-dim) and audio (2048-dim)
  • Supports prompt enhancement via enhance_t2v() method

Tiling System (models/ltx/video_vae/tiling.py):

  • Memory-efficient decoding for large videos
  • Modes: auto, default (512px/64f), aggressive (256px/32f), conservative (768px/96f)
  • Supports streaming via on_frames_ready callback

Key Patterns

Position Grids:

  • Created in pixel space, then converted to latent space internally
  • Video: (B, 3, num_patches, 2) with [start, end) bounds for temporal/spatial dims
  • Audio: (B, 1, num_patches, 2) for temporal dimension only
  • See create_position_grid() in generate modules

Latent Conditioning (conditioning/latent.py):

  • LatentState tracks clean latents, noise, and sigma values
  • VideoConditionByLatentIndex enables I2V by conditioning specific frames
  • apply_denoise_mask() protects conditioned regions during denoising

Weight Loading:

  • convert.py: Downloads from HuggingFace, converts PyTorch → MLX format
  • Sanitization functions (sanitize_transformer_weights, sanitize_vae_encoder_weights) adapt keys
  • Uses safetensors for efficient loading

Key Conventions

Model Configuration

  • Always use LTXModelConfig to instantiate models
  • model_type determines modality: VideoOnly, AudioOnly, or AudioVideo
  • rope_type=LTXRopeType.SPLIT and double_precision_rope=True are standard

Frame Count Requirements

  • Distilled model: num_frames = 1 + 8*k format (e.g., 33, 65, 97)
  • Dev model: No strict requirement, but odd numbers work better
  • Audio frames auto-computed from video duration via AUDIO_LATENTS_PER_SECOND

Dimension Constraints

  • Video height/width must be divisible by 64 (VAE spatial compression)
  • Latent dimensions are pixel dimensions divided by 32

Audio Constants

AUDIO_SAMPLE_RATE = 24000          # Output sample rate
AUDIO_LATENT_SAMPLE_RATE = 16000   # VAE internal rate
AUDIO_HOP_LENGTH = 160             # Mel hop length
AUDIO_LATENT_CHANNELS = 8          # Audio latent channels
AUDIO_MEL_BINS = 16                # Mel frequency bins

Sigma Schedules

Distilled model uses predefined schedules (no scheduler class):

STAGE_1_SIGMAS = [1.0, 0.99375, 0.9875, 0.98125, 0.975, 0.909375, 0.725, 0.421875, 0.0]
STAGE_2_SIGMAS = [0.909375, 0.725, 0.421875, 0.0]

Dev model computes schedules via ltx2_scheduler(steps) function.

Code Style

  • Follow black formatting (configured in pre-commit)
  • Import sorting: isort with black profile
  • Remove unused imports (autoflake)
  • Type hints encouraged but not enforced

Modality Enum

Use Modality.VIDEO and Modality.AUDIO from models/ltx/transformer.py for multi-modal operations.

Video Post-Processing

  • postprocess.py: Contains utilities for frame normalization and video saving
  • Always denormalize latents from [-1, 1] to [0, 255] before saving
  • Use opencv-python for video I/O

Python Requirements

  • Python >= 3.11
  • MLX >= 0.22.0
  • Primary dependencies: numpy, safetensors, transformers, opencv-python, Pillow, mlx-vlm, scipy, librosa
  • Package manager: uv recommended for faster installs, pip also supported