Update Wan2.1/Wan2.2 README.md

2026-03-11 12:24:59 +01:00
parent c144c8817c
commit ae410f3121
1 changed files with 230 additions and 42 deletions
--- a/mlx_video/models/wan/README.md
+++ b/mlx_video/models/wan/README.md
@@ -10,6 +10,7 @@ They share the same model architecture — the difference is in the inference pi
 | **Task** | Text-to-Video | Text-to-Video | Image-to-Video | Text+Image-to-Video |
 | **Pipeline** | Single model | Dual model | Dual model | Single model |
 | **Sizes** | 1.3B, 14B | 14B | 14B | 5B |
+| **Resolution** | 480P (1.3B), 720P (14B) | 720P | 720P | 720P |
 | **Steps** | 50 | 40 | 40 | 40 |
 | **Guidance** | 5.0 (fixed) | 3.0 / 4.0 | 3.5 / 3.5 | 5.0 (fixed) |
 | **Shift** | 5.0 | 12.0 | 5.0 | 5.0 |
@@ -17,55 +18,103 @@ They share the same model architecture — the difference is in the inference pi

 ### Step 1: Download Weights

-Download the original PyTorch checkpoints:
+Download the original PyTorch checkpoints from HuggingFace using the `huggingface-cli` tool (install with `pip install huggingface_hub`):

-**Wan2.1 (14B)**
+**Wan2.1**
 ```bash
-# From https://github.com/Wan-Video/Wan2.1 or HuggingFace
-# Expected directory structure:
-# wan21_checkpoint/
-#   ├── models_t5_umt5-xxl-enc-bf16.pth
-#   ├── Wan2.1_VAE.pth
-#   └── diffusion_pytorch_model*.safetensors   # single model
+# Text-to-Video 1.3B (fast, fits in ~4 GB)
+huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
+
+# Text-to-Video 14B
+huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
 ```

-**Wan2.1 (1.3B)** — same structure, smaller transformer weights.
-
-**Wan2.2 (14B)**
+**Wan2.2**
 ```bash
-# From https://github.com/Wan-Video/Wan2.2 or HuggingFace
-# Expected directory structure:
-# wan22_checkpoint/
-#   ├── models_t5_umt5-xxl-enc-bf16.pth
-#   ├── Wan2.1_VAE.pth
-#   ├── low_noise_model/   # safetensors
-#   └── high_noise_model/  # safetensors
+# Text-to-Video 14B
+huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B
+
+# Image-to-Video 14B
+huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
+
+# Text+Image-to-Video 5B (uses a different VAE — z_dim=48)
+huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B
 ```

-**Wan2.2 I2V-14B** — same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V-14B from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).
+Each downloaded directory will have this structure:
+
+```
+Wan2.1-T2V-*/
+├── models_t5_umt5-xxl-enc-bf16.pth       # T5 text encoder
+├── Wan2.1_VAE.pth                         # 3D VAE
+└── diffusion_pytorch_model*.safetensors   # transformer (single)
+
+Wan2.2-T2V-A14B/ or Wan2.2-I2V-A14B/
+├── models_t5_umt5-xxl-enc-bf16.pth
+├── Wan2.1_VAE.pth
+├── low_noise_model/                       # dual-model low-noise transformer
+└── high_noise_model/                      # dual-model high-noise transformer
+
+Wan2.2-TI2V-5B/
+├── models_t5_umt5-xxl-enc-bf16.pth
+├── Wan2.2_VAE.pth                         # different VAE (z_dim=48)
+└── diffusion_pytorch_model*.safetensors   # transformer (single)
+```
+
+> **Wan2.2 I2V-14B** shares the same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).

 ### Step 2: Convert to MLX Format

-The conversion script auto-detects the model version based on the directory structure (presence of `low_noise_model/` subdirectory) and model type (`model_type` in source config.json for I2V vs T2V).
+The conversion script auto-detects the model version from the directory structure (presence of `low_noise_model/` → Wan2.2 dual model) and the model type from `config.json` (I2V vs T2V).
+
+#### Wan2.1 T2V 1.3B

 ```bash
-# Auto-detect version
 python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/wan_checkpoint \
-    --output-dir wan_mlx
-
-# Explicit version
-python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/wan21_checkpoint \
-    --output-dir wan21_mlx \
-    --model-version 2.1
-
-python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/wan22_checkpoint \
-    --output-dir wan22_mlx \
-    --model-version 2.2
+    --checkpoint-dir ./Wan2.1-T2V-1.3B \
+    --output-dir ./Wan2.1-T2V-1.3B-MLX
 ```

+#### Wan2.1 T2V 14B
+
+```bash
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.1-T2V-14B \
+    --output-dir ./Wan2.1-T2V-14B-MLX
+```
+
+#### Wan2.2 T2V 14B
+
+```bash
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-T2V-A14B \
+    --output-dir ./Wan2.2-T2V-A14B-MLX
+```
+
+#### Wan2.2 I2V 14B
+
+```bash
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-I2V-A14B \
+    --output-dir ./Wan2.2-I2V-A14B-MLX
+```
+
+The I2V model is auto-detected from `config.json`; the output will include a `vae_encoder.safetensors` used to encode the conditioning image.
+
+#### Wan2.2 TI2V 5B
+
+```bash
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-TI2V-5B \
+    --output-dir ./Wan2.2-TI2V-5B-MLX
+```
+
+The TI2V model uses a different VAE (`z_dim=48`, `vae_stride=(4,16,16)`) and is auto-detected during conversion.
+
+---
+
+You can also pass `--model-version 2.1` or `--model-version 2.2` to force the version instead of relying on auto-detection.
+
 #### Conversion Options

 | Option | Default | Description |
@@ -90,20 +139,158 @@ wan_mlx/
 └── high_noise_model.safetensors   # (Wan2.2) High-noise transformer
 ```

-### Quantization (Reduced Memory)
+### Step 3: Generate Video

-Quantize the transformer weights to reduce memory usage by ~3.4x. This is especially useful for the 14B model or memory-constrained devices:
+#### Wan2.1 T2V 1.3B

 ```bash
-# Convert with 4-bit quantization
+python -m mlx_video.generate_wan \
+    --model-dir ./Wan2.1-T2V-1.3B-MLX \
+    --prompt "A cat playing piano in a cozy living room, cinematic lighting" \
+    --width 832 --height 480 --num-frames 81 \
+    --steps 50 --guide-scale 5.0 \
+    --seed 42 \
+    --output-path wan21_1b.mp4
+```
+
+#### Wan2.1 T2V 14B
+
+```bash
+python -m mlx_video.generate_wan \
+    --model-dir ./Wan2.1-T2V-14B-MLX \
+    --prompt "A woman walks through a misty forest at dawn, slow motion, cinematic" \
+    --width 1280 --height 704 --num-frames 81 \
+    --steps 50 --guide-scale 5.0 \
+    --seed 42 \
+    --output-path wan21_14b.mp4
+```
+
+> **Tip**: If the first few frames look washed out or have color artifacts, add `--trim-first-frames 1` to generate 4 extra frames at the start and discard them. With the `unipc` scheduler (default), **10 steps** often gives satisfying results — useful for quick iteration.
+
+#### Wan2.2 T2V 14B
+
+Wan2.2 uses a dual-model pipeline (separate high-noise and low-noise transformers) and takes guidance as a `high,low` pair:
+
+```bash
+python -m mlx_video.generate_wan \
+    --model-dir ./Wan2.2-T2V-A14B-MLX \
+    --prompt "Two astronauts playing chess on the surface of the moon, dramatic lighting, 8K" \
+    --negative-prompt "low quality, blurry, distorted" \
+    --width 1280 --height 704 --num-frames 81 \
+    --steps 40 --guide-scale "3.0,4.0" \
+    --seed 42 \
+    --output-path wan22_t2v.mp4
+```
+
+> **Tip**: With the `unipc` scheduler (default), **10 steps** often produces satisfying results for 14B models — a significant speed-up with minimal quality loss. Try `--steps 10` for quick iterations.
+
+#### Wan2.2 I2V 14B
+
+Image-to-video: animates a starting image guided by a text prompt. Pass the image with `--image`:
+
+```bash
+python -m mlx_video.generate_wan \
+    --model-dir ./Wan2.2-I2V-A14B-MLX \
+    --image ./my_photo.png \
+    --prompt "The person slowly turns their head and smiles, cinematic, natural lighting" \
+    --negative-prompt "low quality, blurry, distorted" \
+    --width 1280 --height 704 --num-frames 81 \
+    --steps 40 --guide-scale "3.5,3.5" \
+    --seed 42 \
+    --output-path wan22_i2v.mp4
+```
+
+> **Tip**: As with T2V, `--steps 10` with the `unipc` scheduler is often sufficient for fast prototyping.
+
+#### Wan2.2 TI2V 5B
+
+Text+image-to-video: a single-model variant with a larger VAE (`z_dim=48`). Resolution must be divisible by **32** (not 16 as with other models):
+
+```bash
+python -m mlx_video.generate_wan \
+    --model-dir ./Wan2.2-TI2V-5B-MLX \
+    --image ./my_photo.png \
+    --prompt "The subject waves hello, warm sunlight, film grain" \
+    --width 1280 --height 704 --num-frames 41 \
+    --steps 40 --guide-scale 5.0 \
+    --seed 42 \
+    --output-path wan22_ti2v.mp4
+```
+
+> **Note**: The 5B model is fast — 40 steps run quickly and are recommended for best quality.
+
+> **Frame count**: `--num-frames` must satisfy `4n+1` for all models (e.g. 5, 9, 13, 21, 41, 81, 101 …).
+
+> **Resolution**: Always use the model's native resolution. While generation will succeed at other sizes, mismatched resolutions or aspect ratios are likely to produce visual artifacts. Preferred resolutions are:
+> - **480P** — 832×480 (landscape) or 480×832 (portrait) — for Wan2.1 1.3B
+> - **720P** — 1280×704 (landscape) or 704×1280 (portrait) — for Wan2.1 14B, Wan2.2 T2V/I2V/TI2V
+
+#### Generation Options
+
+| Option | Default | Description |
+|--------|---------|-------------|
+| `--model-dir` | (required) | Path to converted MLX model directory |
+| `--prompt` | (required) | Text prompt |
+| `--image` | — | Input image path (I2V and TI2V modes) |
+| `--negative-prompt` | config default | Negative guidance prompt |
+| `--width` | `1280` | Output width in pixels |
+| `--height` | `704` | Output height in pixels |
+| `--num-frames` | `81` | Number of frames (must be `4n+1`) |
+| `--steps` | config default | Diffusion steps |
+| `--guide-scale` | config default | Guidance scale; use `"high,low"` pair for Wan2.2 dual models |
+| `--shift` | config default | Noise schedule shift |
+| `--seed` | `-1` (random) | Random seed for reproducibility |
+| `--output-path` | `output.mp4` | Output video file path |
+| `--scheduler` | `unipc` | Solver: `euler`, `dpm++`, or `unipc` |
+| `--trim-first-frames` | `0` | Drop N leading frames (fixes first-frame artifacts on 14B models) |
+| `--tiling` | `auto` | VAE tiling: `auto`, `none`, `spatial`, `temporal` |
+
+### Quantization (Reduced Memory)
+
+Quantize the transformer weights to reduce memory usage by ~3.4×. Quantization is supported for all model variants and is especially important for running 14B models on devices with limited unified memory:
+
+```bash
+# Convert with 4-bit quantization (works for any variant)
 python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/Wan2.1-T2V-1.3B \
-    --output-dir wan21_mlx_q4 \
+    --checkpoint-dir ./Wan2.1-T2V-1.3B \
+    --output-dir ./Wan2.1-T2V-1.3B-MLX-Q4 \
    --quantize --bits 4 --group-size 64

-# Generate with quantized model (auto-detected from config.json)
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.1-T2V-14B \
+    --output-dir ./Wan2.1-T2V-14B-MLX-Q4 \
+    --quantize --bits 4 --group-size 64
+
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-T2V-A14B \
+    --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
+    --quantize --bits 4 --group-size 64
+
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-I2V-A14B \
+    --output-dir ./Wan2.2-I2V-A14B-MLX-Q4 \
+    --quantize --bits 4 --group-size 64
+
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-TI2V-5B \
+    --output-dir ./Wan2.2-TI2V-5B-MLX-Q4 \
+    --quantize --bits 4 --group-size 64
+```
+
+You can also quantize an already-converted MLX model without re-converting from PyTorch:
+
+```bash
+python -m mlx_video.convert_wan \
+    --checkpoint-dir ./Wan2.2-T2V-A14B-MLX \
+    --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
+    --quantize-only --bits 4
+```
+
+Quantized models are used exactly the same way — the quantization is auto-detected from `config.json`:
+
+```bash
 python -m mlx_video.generate_wan \
-    --model-dir wan21_mlx_q4 \
+    --model-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
    --prompt "A cat playing piano"
 ```

@@ -157,5 +344,6 @@ python -m mlx_video.generate_wan \
    --lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1
 ```

-Which results in 
+## Enjoy
+
 ![Poodles](../../../examples/poodles-wan.gif)