From ae410f3121045a518af532b0f2a676b4a4ef0290 Mon Sep 17 00:00:00 2001 From: Daniel Date: Wed, 11 Mar 2026 12:24:59 +0100 Subject: [PATCH] Update Wan2.1/Wan2.2 README.md --- mlx_video/models/wan/README.md | 272 ++++++++++++++++++++++++++++----- 1 file changed, 230 insertions(+), 42 deletions(-) diff --git a/mlx_video/models/wan/README.md b/mlx_video/models/wan/README.md index b5830a4..3d45e2c 100644 --- a/mlx_video/models/wan/README.md +++ b/mlx_video/models/wan/README.md @@ -10,6 +10,7 @@ They share the same model architecture — the difference is in the inference pi | **Task** | Text-to-Video | Text-to-Video | Image-to-Video | Text+Image-to-Video | | **Pipeline** | Single model | Dual model | Dual model | Single model | | **Sizes** | 1.3B, 14B | 14B | 14B | 5B | +| **Resolution** | 480P (1.3B), 720P (14B) | 720P | 720P | 720P | | **Steps** | 50 | 40 | 40 | 40 | | **Guidance** | 5.0 (fixed) | 3.0 / 4.0 | 3.5 / 3.5 | 5.0 (fixed) | | **Shift** | 5.0 | 12.0 | 5.0 | 5.0 | @@ -17,55 +18,103 @@ They share the same model architecture — the difference is in the inference pi ### Step 1: Download Weights -Download the original PyTorch checkpoints: +Download the original PyTorch checkpoints from HuggingFace using the `huggingface-cli` tool (install with `pip install huggingface_hub`): -**Wan2.1 (14B)** +**Wan2.1** ```bash -# From https://github.com/Wan-Video/Wan2.1 or HuggingFace -# Expected directory structure: -# wan21_checkpoint/ -# ├── models_t5_umt5-xxl-enc-bf16.pth -# ├── Wan2.1_VAE.pth -# └── diffusion_pytorch_model*.safetensors # single model +# Text-to-Video 1.3B (fast, fits in ~4 GB) +huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B + +# Text-to-Video 14B +huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B ``` -**Wan2.1 (1.3B)** — same structure, smaller transformer weights. - -**Wan2.2 (14B)** +**Wan2.2** ```bash -# From https://github.com/Wan-Video/Wan2.2 or HuggingFace -# Expected directory structure: -# wan22_checkpoint/ -# ├── models_t5_umt5-xxl-enc-bf16.pth -# ├── Wan2.1_VAE.pth -# ├── low_noise_model/ # safetensors -# └── high_noise_model/ # safetensors +# Text-to-Video 14B +huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B + +# Image-to-Video 14B +huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B + +# Text+Image-to-Video 5B (uses a different VAE — z_dim=48) +huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B ``` -**Wan2.2 I2V-14B** — same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V-14B from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`). +Each downloaded directory will have this structure: + +``` +Wan2.1-T2V-*/ +├── models_t5_umt5-xxl-enc-bf16.pth # T5 text encoder +├── Wan2.1_VAE.pth # 3D VAE +└── diffusion_pytorch_model*.safetensors # transformer (single) + +Wan2.2-T2V-A14B/ or Wan2.2-I2V-A14B/ +├── models_t5_umt5-xxl-enc-bf16.pth +├── Wan2.1_VAE.pth +├── low_noise_model/ # dual-model low-noise transformer +└── high_noise_model/ # dual-model high-noise transformer + +Wan2.2-TI2V-5B/ +├── models_t5_umt5-xxl-enc-bf16.pth +├── Wan2.2_VAE.pth # different VAE (z_dim=48) +└── diffusion_pytorch_model*.safetensors # transformer (single) +``` + +> **Wan2.2 I2V-14B** shares the same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`). ### Step 2: Convert to MLX Format -The conversion script auto-detects the model version based on the directory structure (presence of `low_noise_model/` subdirectory) and model type (`model_type` in source config.json for I2V vs T2V). +The conversion script auto-detects the model version from the directory structure (presence of `low_noise_model/` → Wan2.2 dual model) and the model type from `config.json` (I2V vs T2V). + +#### Wan2.1 T2V 1.3B ```bash -# Auto-detect version python -m mlx_video.convert_wan \ - --checkpoint-dir /path/to/wan_checkpoint \ - --output-dir wan_mlx - -# Explicit version -python -m mlx_video.convert_wan \ - --checkpoint-dir /path/to/wan21_checkpoint \ - --output-dir wan21_mlx \ - --model-version 2.1 - -python -m mlx_video.convert_wan \ - --checkpoint-dir /path/to/wan22_checkpoint \ - --output-dir wan22_mlx \ - --model-version 2.2 + --checkpoint-dir ./Wan2.1-T2V-1.3B \ + --output-dir ./Wan2.1-T2V-1.3B-MLX ``` +#### Wan2.1 T2V 14B + +```bash +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.1-T2V-14B \ + --output-dir ./Wan2.1-T2V-14B-MLX +``` + +#### Wan2.2 T2V 14B + +```bash +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-T2V-A14B \ + --output-dir ./Wan2.2-T2V-A14B-MLX +``` + +#### Wan2.2 I2V 14B + +```bash +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-I2V-A14B \ + --output-dir ./Wan2.2-I2V-A14B-MLX +``` + +The I2V model is auto-detected from `config.json`; the output will include a `vae_encoder.safetensors` used to encode the conditioning image. + +#### Wan2.2 TI2V 5B + +```bash +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-TI2V-5B \ + --output-dir ./Wan2.2-TI2V-5B-MLX +``` + +The TI2V model uses a different VAE (`z_dim=48`, `vae_stride=(4,16,16)`) and is auto-detected during conversion. + +--- + +You can also pass `--model-version 2.1` or `--model-version 2.2` to force the version instead of relying on auto-detection. + #### Conversion Options | Option | Default | Description | @@ -90,20 +139,158 @@ wan_mlx/ └── high_noise_model.safetensors # (Wan2.2) High-noise transformer ``` -### Quantization (Reduced Memory) +### Step 3: Generate Video -Quantize the transformer weights to reduce memory usage by ~3.4x. This is especially useful for the 14B model or memory-constrained devices: +#### Wan2.1 T2V 1.3B ```bash -# Convert with 4-bit quantization +python -m mlx_video.generate_wan \ + --model-dir ./Wan2.1-T2V-1.3B-MLX \ + --prompt "A cat playing piano in a cozy living room, cinematic lighting" \ + --width 832 --height 480 --num-frames 81 \ + --steps 50 --guide-scale 5.0 \ + --seed 42 \ + --output-path wan21_1b.mp4 +``` + +#### Wan2.1 T2V 14B + +```bash +python -m mlx_video.generate_wan \ + --model-dir ./Wan2.1-T2V-14B-MLX \ + --prompt "A woman walks through a misty forest at dawn, slow motion, cinematic" \ + --width 1280 --height 704 --num-frames 81 \ + --steps 50 --guide-scale 5.0 \ + --seed 42 \ + --output-path wan21_14b.mp4 +``` + +> **Tip**: If the first few frames look washed out or have color artifacts, add `--trim-first-frames 1` to generate 4 extra frames at the start and discard them. With the `unipc` scheduler (default), **10 steps** often gives satisfying results — useful for quick iteration. + +#### Wan2.2 T2V 14B + +Wan2.2 uses a dual-model pipeline (separate high-noise and low-noise transformers) and takes guidance as a `high,low` pair: + +```bash +python -m mlx_video.generate_wan \ + --model-dir ./Wan2.2-T2V-A14B-MLX \ + --prompt "Two astronauts playing chess on the surface of the moon, dramatic lighting, 8K" \ + --negative-prompt "low quality, blurry, distorted" \ + --width 1280 --height 704 --num-frames 81 \ + --steps 40 --guide-scale "3.0,4.0" \ + --seed 42 \ + --output-path wan22_t2v.mp4 +``` + +> **Tip**: With the `unipc` scheduler (default), **10 steps** often produces satisfying results for 14B models — a significant speed-up with minimal quality loss. Try `--steps 10` for quick iterations. + +#### Wan2.2 I2V 14B + +Image-to-video: animates a starting image guided by a text prompt. Pass the image with `--image`: + +```bash +python -m mlx_video.generate_wan \ + --model-dir ./Wan2.2-I2V-A14B-MLX \ + --image ./my_photo.png \ + --prompt "The person slowly turns their head and smiles, cinematic, natural lighting" \ + --negative-prompt "low quality, blurry, distorted" \ + --width 1280 --height 704 --num-frames 81 \ + --steps 40 --guide-scale "3.5,3.5" \ + --seed 42 \ + --output-path wan22_i2v.mp4 +``` + +> **Tip**: As with T2V, `--steps 10` with the `unipc` scheduler is often sufficient for fast prototyping. + +#### Wan2.2 TI2V 5B + +Text+image-to-video: a single-model variant with a larger VAE (`z_dim=48`). Resolution must be divisible by **32** (not 16 as with other models): + +```bash +python -m mlx_video.generate_wan \ + --model-dir ./Wan2.2-TI2V-5B-MLX \ + --image ./my_photo.png \ + --prompt "The subject waves hello, warm sunlight, film grain" \ + --width 1280 --height 704 --num-frames 41 \ + --steps 40 --guide-scale 5.0 \ + --seed 42 \ + --output-path wan22_ti2v.mp4 +``` + +> **Note**: The 5B model is fast — 40 steps run quickly and are recommended for best quality. + +> **Frame count**: `--num-frames` must satisfy `4n+1` for all models (e.g. 5, 9, 13, 21, 41, 81, 101 …). + +> **Resolution**: Always use the model's native resolution. While generation will succeed at other sizes, mismatched resolutions or aspect ratios are likely to produce visual artifacts. Preferred resolutions are: +> - **480P** — 832×480 (landscape) or 480×832 (portrait) — for Wan2.1 1.3B +> - **720P** — 1280×704 (landscape) or 704×1280 (portrait) — for Wan2.1 14B, Wan2.2 T2V/I2V/TI2V + +#### Generation Options + +| Option | Default | Description | +|--------|---------|-------------| +| `--model-dir` | (required) | Path to converted MLX model directory | +| `--prompt` | (required) | Text prompt | +| `--image` | — | Input image path (I2V and TI2V modes) | +| `--negative-prompt` | config default | Negative guidance prompt | +| `--width` | `1280` | Output width in pixels | +| `--height` | `704` | Output height in pixels | +| `--num-frames` | `81` | Number of frames (must be `4n+1`) | +| `--steps` | config default | Diffusion steps | +| `--guide-scale` | config default | Guidance scale; use `"high,low"` pair for Wan2.2 dual models | +| `--shift` | config default | Noise schedule shift | +| `--seed` | `-1` (random) | Random seed for reproducibility | +| `--output-path` | `output.mp4` | Output video file path | +| `--scheduler` | `unipc` | Solver: `euler`, `dpm++`, or `unipc` | +| `--trim-first-frames` | `0` | Drop N leading frames (fixes first-frame artifacts on 14B models) | +| `--tiling` | `auto` | VAE tiling: `auto`, `none`, `spatial`, `temporal` | + +### Quantization (Reduced Memory) + +Quantize the transformer weights to reduce memory usage by ~3.4×. Quantization is supported for all model variants and is especially important for running 14B models on devices with limited unified memory: + +```bash +# Convert with 4-bit quantization (works for any variant) python -m mlx_video.convert_wan \ - --checkpoint-dir /path/to/Wan2.1-T2V-1.3B \ - --output-dir wan21_mlx_q4 \ + --checkpoint-dir ./Wan2.1-T2V-1.3B \ + --output-dir ./Wan2.1-T2V-1.3B-MLX-Q4 \ --quantize --bits 4 --group-size 64 -# Generate with quantized model (auto-detected from config.json) +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.1-T2V-14B \ + --output-dir ./Wan2.1-T2V-14B-MLX-Q4 \ + --quantize --bits 4 --group-size 64 + +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-T2V-A14B \ + --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \ + --quantize --bits 4 --group-size 64 + +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-I2V-A14B \ + --output-dir ./Wan2.2-I2V-A14B-MLX-Q4 \ + --quantize --bits 4 --group-size 64 + +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-TI2V-5B \ + --output-dir ./Wan2.2-TI2V-5B-MLX-Q4 \ + --quantize --bits 4 --group-size 64 +``` + +You can also quantize an already-converted MLX model without re-converting from PyTorch: + +```bash +python -m mlx_video.convert_wan \ + --checkpoint-dir ./Wan2.2-T2V-A14B-MLX \ + --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \ + --quantize-only --bits 4 +``` + +Quantized models are used exactly the same way — the quantization is auto-detected from `config.json`: + +```bash python -m mlx_video.generate_wan \ - --model-dir wan21_mlx_q4 \ + --model-dir ./Wan2.2-T2V-A14B-MLX-Q4 \ --prompt "A cat playing piano" ``` @@ -157,5 +344,6 @@ python -m mlx_video.generate_wan \ --lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1 ``` -Which results in +## Enjoy + ![Poodles](../../../examples/poodles-wan.gif) \ No newline at end of file