Update Wan2.1/Wan2.2 README.md

2026-03-11 12:24:59 +01:00
parent c144c8817c
commit ae410f3121
1 changed files with 230 additions and 42 deletions
--- a/mlx_video/models/wan/README.md
+++ b/mlx_video/models/wan/README.md
@@ -10,6 +10,7 @@ They share the same model architecture — the difference is in the inference pi
 | **Task** | Text-to-Video | Text-to-Video | Image-to-Video | Text+Image-to-Video |
 | **Pipeline** | Single model | Dual model | Dual model | Single model |
 | **Sizes** | 1.3B, 14B | 14B | 14B | 5B |
 | **Resolution** | 480P (1.3B), 720P (14B) | 720P | 720P | 720P |
 | **Steps** | 50 | 40 | 40 | 40 |
 | **Guidance** | 5.0 (fixed) | 3.0 / 4.0 | 3.5 / 3.5 | 5.0 (fixed) |
 | **Shift** | 5.0 | 12.0 | 5.0 | 5.0 |
@@ -17,55 +18,103 @@ They share the same model architecture — the difference is in the inference pi
 ### Step 1: Download Weights
-Download the original PyTorch checkpoints:
+Download the original PyTorch checkpoints from HuggingFace using the `huggingface-cli` tool (install with `pip install huggingface_hub`):
-**Wan2.1 (14B)**
+**Wan2.1**
 ```bash
-# From https://github.com/Wan-Video/Wan2.1 or HuggingFace
+# Text-to-Video 1.3B (fast, fits in ~4 GB)
-# Expected directory structure:
+huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
-# wan21_checkpoint/
+
-#   ├── models_t5_umt5-xxl-enc-bf16.pth
+# Text-to-Video 14B
-#   ├── Wan2.1_VAE.pth
+huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
 #   └── diffusion_pytorch_model*.safetensors   # single model
 ```
-**Wan2.1 (1.3B)** — same structure, smaller transformer weights.
+**Wan2.2**
 **Wan2.2 (14B)**
 ```bash
-# From https://github.com/Wan-Video/Wan2.2 or HuggingFace
+# Text-to-Video 14B
-# Expected directory structure:
+huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B
-# wan22_checkpoint/
+
-#   ├── models_t5_umt5-xxl-enc-bf16.pth
+# Image-to-Video 14B
-#   ├── Wan2.1_VAE.pth
+huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
-#   ├── low_noise_model/   # safetensors
+
-#   └── high_noise_model/  # safetensors
+# Text+Image-to-Video 5B (uses a different VAE — z_dim=48)
 huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B
 ```
-**Wan2.2 I2V-14B** — same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V-14B from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).
+Each downloaded directory will have this structure:
 ```
 Wan2.1-T2V-*/
 ├── models_t5_umt5-xxl-enc-bf16.pth       # T5 text encoder
 ├── Wan2.1_VAE.pth                         # 3D VAE
 └── diffusion_pytorch_model*.safetensors   # transformer (single)
 Wan2.2-T2V-A14B/ or Wan2.2-I2V-A14B/
 ├── models_t5_umt5-xxl-enc-bf16.pth
 ├── Wan2.1_VAE.pth
 ├── low_noise_model/                       # dual-model low-noise transformer
 └── high_noise_model/                      # dual-model high-noise transformer
 Wan2.2-TI2V-5B/
 ├── models_t5_umt5-xxl-enc-bf16.pth
 ├── Wan2.2_VAE.pth                         # different VAE (z_dim=48)
 └── diffusion_pytorch_model*.safetensors   # transformer (single)
 ```
 > **Wan2.2 I2V-14B** shares the same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).
 ### Step 2: Convert to MLX Format
-The conversion script auto-detects the model version based on the directory structure (presence of `low_noise_model/` subdirectory) and model type (`model_type` in source config.json for I2V vs T2V).
+The conversion script auto-detects the model version from the directory structure (presence of `low_noise_model/` → Wan2.2 dual model) and the model type from `config.json` (I2V vs T2V).
 #### Wan2.1 T2V 1.3B
 ```bash
 # Auto-detect version
 python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/wan_checkpoint \
+    --checkpoint-dir ./Wan2.1-T2V-1.3B \
-    --output-dir wan_mlx
+    --output-dir ./Wan2.1-T2V-1.3B-MLX
 # Explicit version
 python -m mlx_video.convert_wan \
    --checkpoint-dir /path/to/wan21_checkpoint \
    --output-dir wan21_mlx \
    --model-version 2.1
 python -m mlx_video.convert_wan \
    --checkpoint-dir /path/to/wan22_checkpoint \
    --output-dir wan22_mlx \
    --model-version 2.2
 ```
 #### Wan2.1 T2V 14B
 ```bash
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.1-T2V-14B \
    --output-dir ./Wan2.1-T2V-14B-MLX
 ```
 #### Wan2.2 T2V 14B
 ```bash
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-T2V-A14B \
    --output-dir ./Wan2.2-T2V-A14B-MLX
 ```
 #### Wan2.2 I2V 14B
 ```bash
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-I2V-A14B \
    --output-dir ./Wan2.2-I2V-A14B-MLX
 ```
 The I2V model is auto-detected from `config.json`; the output will include a `vae_encoder.safetensors` used to encode the conditioning image.
 #### Wan2.2 TI2V 5B
 ```bash
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-TI2V-5B \
    --output-dir ./Wan2.2-TI2V-5B-MLX
 ```
 The TI2V model uses a different VAE (`z_dim=48`, `vae_stride=(4,16,16)`) and is auto-detected during conversion.
 ---
 You can also pass `--model-version 2.1` or `--model-version 2.2` to force the version instead of relying on auto-detection.
 #### Conversion Options
 | Option | Default | Description |
@@ -90,20 +139,158 @@ wan_mlx/
 └── high_noise_model.safetensors   # (Wan2.2) High-noise transformer
 ```
-### Quantization (Reduced Memory)
+### Step 3: Generate Video
-Quantize the transformer weights to reduce memory usage by ~3.4x. This is especially useful for the 14B model or memory-constrained devices:
+#### Wan2.1 T2V 1.3B
 ```bash
-# Convert with 4-bit quantization
+python -m mlx_video.generate_wan \
    --model-dir ./Wan2.1-T2V-1.3B-MLX \
    --prompt "A cat playing piano in a cozy living room, cinematic lighting" \
    --width 832 --height 480 --num-frames 81 \
    --steps 50 --guide-scale 5.0 \
    --seed 42 \
    --output-path wan21_1b.mp4
 ```
 #### Wan2.1 T2V 14B
 ```bash
 python -m mlx_video.generate_wan \
    --model-dir ./Wan2.1-T2V-14B-MLX \
    --prompt "A woman walks through a misty forest at dawn, slow motion, cinematic" \
    --width 1280 --height 704 --num-frames 81 \
    --steps 50 --guide-scale 5.0 \
    --seed 42 \
    --output-path wan21_14b.mp4
 ```
 > **Tip**: If the first few frames look washed out or have color artifacts, add `--trim-first-frames 1` to generate 4 extra frames at the start and discard them. With the `unipc` scheduler (default), **10 steps** often gives satisfying results — useful for quick iteration.
 #### Wan2.2 T2V 14B
 Wan2.2 uses a dual-model pipeline (separate high-noise and low-noise transformers) and takes guidance as a `high,low` pair:
 ```bash
 python -m mlx_video.generate_wan \
    --model-dir ./Wan2.2-T2V-A14B-MLX \
    --prompt "Two astronauts playing chess on the surface of the moon, dramatic lighting, 8K" \
    --negative-prompt "low quality, blurry, distorted" \
    --width 1280 --height 704 --num-frames 81 \
    --steps 40 --guide-scale "3.0,4.0" \
    --seed 42 \
    --output-path wan22_t2v.mp4
 ```
 > **Tip**: With the `unipc` scheduler (default), **10 steps** often produces satisfying results for 14B models — a significant speed-up with minimal quality loss. Try `--steps 10` for quick iterations.
 #### Wan2.2 I2V 14B
 Image-to-video: animates a starting image guided by a text prompt. Pass the image with `--image`:
 ```bash
 python -m mlx_video.generate_wan \
    --model-dir ./Wan2.2-I2V-A14B-MLX \
    --image ./my_photo.png \
    --prompt "The person slowly turns their head and smiles, cinematic, natural lighting" \
    --negative-prompt "low quality, blurry, distorted" \
    --width 1280 --height 704 --num-frames 81 \
    --steps 40 --guide-scale "3.5,3.5" \
    --seed 42 \
    --output-path wan22_i2v.mp4
 ```
 > **Tip**: As with T2V, `--steps 10` with the `unipc` scheduler is often sufficient for fast prototyping.
 #### Wan2.2 TI2V 5B
 Text+image-to-video: a single-model variant with a larger VAE (`z_dim=48`). Resolution must be divisible by **32** (not 16 as with other models):
 ```bash
 python -m mlx_video.generate_wan \
    --model-dir ./Wan2.2-TI2V-5B-MLX \
    --image ./my_photo.png \
    --prompt "The subject waves hello, warm sunlight, film grain" \
    --width 1280 --height 704 --num-frames 41 \
    --steps 40 --guide-scale 5.0 \
    --seed 42 \
    --output-path wan22_ti2v.mp4
 ```
 > **Note**: The 5B model is fast — 40 steps run quickly and are recommended for best quality.
 > **Frame count**: `--num-frames` must satisfy `4n+1` for all models (e.g. 5, 9, 13, 21, 41, 81, 101 …).
 > **Resolution**: Always use the model's native resolution. While generation will succeed at other sizes, mismatched resolutions or aspect ratios are likely to produce visual artifacts. Preferred resolutions are:
 > - **480P** — 832×480 (landscape) or 480×832 (portrait) — for Wan2.1 1.3B
 > - **720P** — 1280×704 (landscape) or 704×1280 (portrait) — for Wan2.1 14B, Wan2.2 T2V/I2V/TI2V
 #### Generation Options
 | Option | Default | Description |
 |--------|---------|-------------|
 | `--model-dir` | (required) | Path to converted MLX model directory |
 | `--prompt` | (required) | Text prompt |
 | `--image` | — | Input image path (I2V and TI2V modes) |
 | `--negative-prompt` | config default | Negative guidance prompt |
 | `--width` | `1280` | Output width in pixels |
 | `--height` | `704` | Output height in pixels |
 | `--num-frames` | `81` | Number of frames (must be `4n+1`) |
 | `--steps` | config default | Diffusion steps |
 | `--guide-scale` | config default | Guidance scale; use `"high,low"` pair for Wan2.2 dual models |
 | `--shift` | config default | Noise schedule shift |
 | `--seed` | `-1` (random) | Random seed for reproducibility |
 | `--output-path` | `output.mp4` | Output video file path |
 | `--scheduler` | `unipc` | Solver: `euler`, `dpm++`, or `unipc` |
 | `--trim-first-frames` | `0` | Drop N leading frames (fixes first-frame artifacts on 14B models) |
 | `--tiling` | `auto` | VAE tiling: `auto`, `none`, `spatial`, `temporal` |
 ### Quantization (Reduced Memory)
 Quantize the transformer weights to reduce memory usage by ~3.4×. Quantization is supported for all model variants and is especially important for running 14B models on devices with limited unified memory:
 ```bash
 # Convert with 4-bit quantization (works for any variant)
 python -m mlx_video.convert_wan \
-    --checkpoint-dir /path/to/Wan2.1-T2V-1.3B \
+    --checkpoint-dir ./Wan2.1-T2V-1.3B \
-    --output-dir wan21_mlx_q4 \
+    --output-dir ./Wan2.1-T2V-1.3B-MLX-Q4 \
    --quantize --bits 4 --group-size 64
-# Generate with quantized model (auto-detected from config.json)
+python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.1-T2V-14B \
    --output-dir ./Wan2.1-T2V-14B-MLX-Q4 \
    --quantize --bits 4 --group-size 64
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-T2V-A14B \
    --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
    --quantize --bits 4 --group-size 64
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-I2V-A14B \
    --output-dir ./Wan2.2-I2V-A14B-MLX-Q4 \
    --quantize --bits 4 --group-size 64
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-TI2V-5B \
    --output-dir ./Wan2.2-TI2V-5B-MLX-Q4 \
    --quantize --bits 4 --group-size 64
 ```
 You can also quantize an already-converted MLX model without re-converting from PyTorch:
 ```bash
 python -m mlx_video.convert_wan \
    --checkpoint-dir ./Wan2.2-T2V-A14B-MLX \
    --output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
    --quantize-only --bits 4
 ```
 Quantized models are used exactly the same way — the quantization is auto-detected from `config.json`:
 ```bash
 python -m mlx_video.generate_wan \
-    --model-dir wan21_mlx_q4 \
+    --model-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
    --prompt "A cat playing piano"
 ```
@@ -157,5 +344,6 @@ python -m mlx_video.generate_wan \
    --lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1
 ```
-Which results in 
+## Enjoy
 ![Poodles](../../../examples/poodles-wan.gif)