Go to file

Prince Canuma f346e09de4 Refactor audio handling in generate_video function to preserve stage 1 audio latents during stage 2 processing. Remove redundant audio re-denoising steps, ensuring audio integrity while refining video output. Update comments for clarity on audio processing logic.

2026-03-13 16:09:07 +01:00

.github

Update GitHub Sponsors username in FUNDING.yml

2026-01-17 19:35:24 +01:00

examples

Replace poodles.mp4 with poodles.gif in examples directory

2026-01-12 17:14:12 +01:00

mlx_video

Refactor audio handling in generate_video function to preserve stage 1 audio latents during stage 2 processing. Remove redundant audio re-denoising steps, ensuring audio integrity while refining video output. Update comments for clarity on audio processing logic.

2026-03-13 16:09:07 +01:00

tests

Add audio generation capabilities to video pipeline, including audio position grid creation, audio frame computation, and integration of audio VAE and vocoder. Update tests to cover new audio functionalities.

2026-01-18 21:28:56 +01:00

.gitignore

Update .gitignore to exclude additional configuration and model files. Modify generate.py to enhance console output with rescale parameter and adjust default values for inference steps and CFG scale. Refactor text encoder to align positional embedding max position with PyTorch defaults, improving compatibility and performance.

2026-03-12 17:13:43 +01:00

.pre-commit-config.yaml

Add pre-commit configuration for code formatting and linting with Black, isort, and autoflake

2026-01-12 16:47:34 +01:00

LICENSE

Initial commit

2025-05-07 12:21:09 +02:00

pyproject.toml

Enhance video generation pipeline by integrating Rich for styled console output and progress tracking. Update dependencies in pyproject.toml to include Rich. Refactor print statements to use console methods for improved user experience during video and audio processing.

2026-01-19 01:43:14 +01:00

README.md

Enhance README.md with detailed descriptions of LTX-2 features, pipeline options, and usage examples for text-to-video, image-to-video, and audio-video generation. Update generate.py to improve LoRA loading functionality, allowing for local files, directories, or HuggingFace repos. This update improves flexibility in model configurations and enhances user guidance in the documentation.

2026-03-13 01:39:39 +01:00

uv.lock

2026-01-19 01:43:14 +01:00

README.md

mlx-video

MLX-Video is the best package for inference and finetuning of Image-Video-Audio generation models on your Mac using MLX.

Installation

Install from source:

Option 1: Install with pip (requires git):

pip install git+https://github.com/Blaizzy/mlx-video.git

Option 2: Install with uv (ultra-fast package manager, optional):

uv pip install git+https://github.com/Blaizzy/mlx-video.git

Supported models:

LTX-2

LTX-2 is a 19B parameter video generation model from Lightricks.

Features

Text-to-video (T2V) and Image-to-video (I2V) generation
Three pipeline modes: Distilled, Dev, and Dev Two-Stage
Synchronized audio-video generation (experimental)
LoRA support (including HuggingFace repos)
Prompt enhancement via Gemma
2x spatial upscaling for images and videos
Optimized for Apple Silicon using MLX

Usage

Pipelines

mlx-video supports three pipeline types via the --pipeline flag:

Pipeline	Description	CFG	Stages	Speed
`distilled` (default)	Fixed sigma schedule, no CFG	No	2 (8+3 steps)	Fastest
`dev`	Dynamic sigmas, constant CFG	Yes	1 (30 steps)	Medium
`dev-two-stage`	Dev + LoRA refinement	Yes (stage 1)	2 (30+3 steps)	Slowest, highest quality

Text-to-Video

# Distilled (default) - fast, two-stage
uv run mlx_video.generate --prompt "Two dogs wearing sunglasses, cinematic, sunset" -n 97 --width 768

# Dev - single-stage with CFG
uv run mlx_video.generate --pipeline dev --prompt "A cinematic scene" --cfg-scale 3.0

# Dev two-stage - dev + LoRA refinement (highest quality)
uv run mlx_video.generate --pipeline dev-two-stage \
    --prompt "Two dogs of the poodle breed wearing sunglasses, close up, cinematic, sunset" \
    -n 145 --width 1024 --height 768 \
    --model-repo prince-canuma/LTX-2-dev \
    --cfg-scale 3.0 --lora-strength 0.8 \
    --enhance-prompt

Image-to-Video

# Distilled I2V
uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg

# Dev I2V
uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5

Audio-Video (experimental)

uv run mlx_video.generate --prompt "Ocean waves crashing" --audio
uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt

LoRA

LoRA weights can be loaded from a file, directory, or HuggingFace repo:

# From HuggingFace repo
uv run mlx_video.generate --pipeline dev-two-stage \
    --prompt "Camera dolly out of a forest" \
    --lora-path Lightricks/LTX-2-19b-LoRA-Camera-Control-Dolly-Out \
    --lora-strength 1.0

# From local file
uv run mlx_video.generate --pipeline dev-two-stage \
    --prompt "A scene" \
    --lora-path ./my-lora/weights.safetensors

# From local directory (auto-detects .safetensors file)
uv run mlx_video.generate --pipeline dev-two-stage \
    --prompt "A scene" \
    --lora-path ./LTX-2-distilled/lora

Upscaling

# Upscale an image 2x
uv run mlx_video.upscale --input photo.png --output upscaled.png

# Upscale a video 2x
uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4

# Upscale with refinement (higher quality, requires text prompt)
uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prompt "A cinematic scene"

CLI Options

Option	Default	Description
`--prompt`, `-p`	(required)	Text description of the video
`--pipeline`	`distilled`	Pipeline type: `distilled`, `dev`, or `dev-two-stage`
`--height`, `-H`	512	Output height (divisible by 64 for two-stage, 32 for dev)
`--width`, `-W`	512	Output width (divisible by 64 for two-stage, 32 for dev)
`--num-frames`, `-n`	33	Number of frames (must be 1 + 8*k)
`--seed`, `-s`	42	Random seed for reproducibility
`--fps`	24	Frames per second
`--output-path`, `-o`	output.mp4	Output video path
`--model-repo`	Lightricks/LTX-2	HuggingFace model repository
`--text-encoder-repo`	None	Separate text encoder repo (if not in model repo)
`--save-frames`	false	Save individual frames as images
`--enhance-prompt`	false	Enhance prompt using Gemma
`--image`, `-i`	None	Conditioning image for I2V
`--image-strength`	1.0	Conditioning strength for I2V
`--audio`, `-a`	false	Enable synchronized audio generation
`--tiling`	`auto`	VAE tiling mode: `auto`, `none`, `aggressive`, `conservative`
`--stream`	false	Stream frames as they decode

Dev/Dev-Two-Stage options:

Option	Default	Description
`--steps`	30	Number of denoising steps
`--cfg-scale`	3.0	CFG guidance scale
`--cfg-rescale`	0.7	CFG rescale factor (reduces over-saturation)
`--negative-prompt`	(default)	Negative prompt for CFG
`--apg`	false	Use Adaptive Projected Guidance (more stable for I2V)

Dev-Two-Stage LoRA options:

Option	Default	Description
`--lora-path`	auto-detect	Path to LoRA file, directory, or HuggingFace repo
`--lora-strength`	1.0	LoRA merge strength

How It Works

Distilled Pipeline (default)

Stage 1: Generate at half resolution with 8 denoising steps (fixed sigmas)
Upsample: 2x spatial upsampling via LatentUpsampler
Stage 2: Refine at full resolution with 3 denoising steps
Decode: VAE decoder converts latents to RGB video

Dev Pipeline

Generate: Full resolution with configurable steps and constant CFG
Decode: VAE decoder converts latents to RGB video

Dev Two-Stage Pipeline

Stage 1: Dev denoising at half resolution with CFG
Upsample: 2x spatial upsampling via LatentUpsampler
Stage 2: Distilled refinement at full resolution with LoRA weights (3 steps, no CFG)
Decode: VAE decoder converts latents to RGB video

Requirements

macOS with Apple Silicon
Python >= 3.11
MLX >= 0.22.0

Model Specifications

Transformer: 48 layers, 32 attention heads, 128 dim per head (19B parameters)
Latent channels: 128
Text encoder: Gemma 3 with 3840-dim output
Audio: Synchronized audio-video with separate audio VAE and vocoder

License

MIT