Add audio to video conditioning

This commit is contained in:
Prince Canuma
2026-03-16 01:42:11 +01:00
parent f53b9e0807
commit 6f6105b715
7 changed files with 623 additions and 62 deletions

View File

@@ -24,6 +24,7 @@ Supported models:
## Features
- Text-to-video (T2V) and Image-to-video (I2V) generation
- Audio-to-video (A2V) conditioning — generate video from input audio
- Four pipeline modes: Distilled, Dev, Dev Two-Stage, and Dev Two-Stage HQ
- Synchronized audio-video generation (experimental)
- LoRA support (including HuggingFace repos)
@@ -85,7 +86,27 @@ uv run mlx_video.generate --prompt "A person dancing" --image photo.jpg
uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach.png --cfg-scale 3.5
```
### Audio-Video (experimental)
### Audio-to-Video (A2V)
Generate video conditioned on an input audio file. The audio is encoded to latent space and frozen during denoising — the transformer's cross-attention reads the audio signal to guide video generation.
```bash
# A2V - generate video from audio
uv run mlx_video.generate --audio-file music.wav --prompt "A band playing music"
# A2V with dev pipeline
uv run mlx_video.generate --pipeline dev --audio-file ocean.wav --prompt "Ocean waves"
# A2V + I2V (audio + image conditioning)
uv run mlx_video.generate --audio-file rain.wav --image forest.jpg --prompt "Rain in forest"
# A2V with custom start time
uv run mlx_video.generate --audio-file song.mp3 --audio-start-time 30.0 --prompt "Concert"
```
### Audio-Video Generation (experimental)
Generate synchronized audio alongside video from scratch:
```bash
uv run mlx_video.generate --prompt "Ocean waves crashing" --audio
@@ -150,6 +171,8 @@ uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prom
| `--image`, `-i` | None | Conditioning image for I2V |
| `--image-strength` | 1.0 | Conditioning strength for I2V |
| `--audio`, `-a` | false | Enable synchronized audio generation |
| `--audio-file` | None | Path to audio file for A2V conditioning |
| `--audio-start-time` | 0.0 | Start time in seconds for audio file |
| `--tiling` | `auto` | VAE tiling mode: `auto`, `none`, `aggressive`, `conservative` |
| `--stream` | false | Stream frames as they decode |