Update Wan2.1/Wan2.2 README.md
This commit is contained in:
@@ -10,6 +10,7 @@ They share the same model architecture — the difference is in the inference pi
|
|||||||
| **Task** | Text-to-Video | Text-to-Video | Image-to-Video | Text+Image-to-Video |
|
| **Task** | Text-to-Video | Text-to-Video | Image-to-Video | Text+Image-to-Video |
|
||||||
| **Pipeline** | Single model | Dual model | Dual model | Single model |
|
| **Pipeline** | Single model | Dual model | Dual model | Single model |
|
||||||
| **Sizes** | 1.3B, 14B | 14B | 14B | 5B |
|
| **Sizes** | 1.3B, 14B | 14B | 14B | 5B |
|
||||||
|
| **Resolution** | 480P (1.3B), 720P (14B) | 720P | 720P | 720P |
|
||||||
| **Steps** | 50 | 40 | 40 | 40 |
|
| **Steps** | 50 | 40 | 40 | 40 |
|
||||||
| **Guidance** | 5.0 (fixed) | 3.0 / 4.0 | 3.5 / 3.5 | 5.0 (fixed) |
|
| **Guidance** | 5.0 (fixed) | 3.0 / 4.0 | 3.5 / 3.5 | 5.0 (fixed) |
|
||||||
| **Shift** | 5.0 | 12.0 | 5.0 | 5.0 |
|
| **Shift** | 5.0 | 12.0 | 5.0 | 5.0 |
|
||||||
@@ -17,55 +18,103 @@ They share the same model architecture — the difference is in the inference pi
|
|||||||
|
|
||||||
### Step 1: Download Weights
|
### Step 1: Download Weights
|
||||||
|
|
||||||
Download the original PyTorch checkpoints:
|
Download the original PyTorch checkpoints from HuggingFace using the `huggingface-cli` tool (install with `pip install huggingface_hub`):
|
||||||
|
|
||||||
**Wan2.1 (14B)**
|
**Wan2.1**
|
||||||
```bash
|
```bash
|
||||||
# From https://github.com/Wan-Video/Wan2.1 or HuggingFace
|
# Text-to-Video 1.3B (fast, fits in ~4 GB)
|
||||||
# Expected directory structure:
|
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./Wan2.1-T2V-1.3B
|
||||||
# wan21_checkpoint/
|
|
||||||
# ├── models_t5_umt5-xxl-enc-bf16.pth
|
# Text-to-Video 14B
|
||||||
# ├── Wan2.1_VAE.pth
|
huggingface-cli download Wan-AI/Wan2.1-T2V-14B --local-dir ./Wan2.1-T2V-14B
|
||||||
# └── diffusion_pytorch_model*.safetensors # single model
|
|
||||||
```
|
```
|
||||||
|
|
||||||
**Wan2.1 (1.3B)** — same structure, smaller transformer weights.
|
**Wan2.2**
|
||||||
|
|
||||||
**Wan2.2 (14B)**
|
|
||||||
```bash
|
```bash
|
||||||
# From https://github.com/Wan-Video/Wan2.2 or HuggingFace
|
# Text-to-Video 14B
|
||||||
# Expected directory structure:
|
huggingface-cli download Wan-AI/Wan2.2-T2V-A14B --local-dir ./Wan2.2-T2V-A14B
|
||||||
# wan22_checkpoint/
|
|
||||||
# ├── models_t5_umt5-xxl-enc-bf16.pth
|
# Image-to-Video 14B
|
||||||
# ├── Wan2.1_VAE.pth
|
huggingface-cli download Wan-AI/Wan2.2-I2V-A14B --local-dir ./Wan2.2-I2V-A14B
|
||||||
# ├── low_noise_model/ # safetensors
|
|
||||||
# └── high_noise_model/ # safetensors
|
# Text+Image-to-Video 5B (uses a different VAE — z_dim=48)
|
||||||
|
huggingface-cli download Wan-AI/Wan2.2-TI2V-5B --local-dir ./Wan2.2-TI2V-5B
|
||||||
```
|
```
|
||||||
|
|
||||||
**Wan2.2 I2V-14B** — same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V-14B from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).
|
Each downloaded directory will have this structure:
|
||||||
|
|
||||||
|
```
|
||||||
|
Wan2.1-T2V-*/
|
||||||
|
├── models_t5_umt5-xxl-enc-bf16.pth # T5 text encoder
|
||||||
|
├── Wan2.1_VAE.pth # 3D VAE
|
||||||
|
└── diffusion_pytorch_model*.safetensors # transformer (single)
|
||||||
|
|
||||||
|
Wan2.2-T2V-A14B/ or Wan2.2-I2V-A14B/
|
||||||
|
├── models_t5_umt5-xxl-enc-bf16.pth
|
||||||
|
├── Wan2.1_VAE.pth
|
||||||
|
├── low_noise_model/ # dual-model low-noise transformer
|
||||||
|
└── high_noise_model/ # dual-model high-noise transformer
|
||||||
|
|
||||||
|
Wan2.2-TI2V-5B/
|
||||||
|
├── models_t5_umt5-xxl-enc-bf16.pth
|
||||||
|
├── Wan2.2_VAE.pth # different VAE (z_dim=48)
|
||||||
|
└── diffusion_pytorch_model*.safetensors # transformer (single)
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Wan2.2 I2V-14B** shares the same directory structure as Wan2.2 T2V. The conversion script auto-detects I2V from the model's `config.json` (`model_type: "i2v"`, `in_dim: 36`).
|
||||||
|
|
||||||
### Step 2: Convert to MLX Format
|
### Step 2: Convert to MLX Format
|
||||||
|
|
||||||
The conversion script auto-detects the model version based on the directory structure (presence of `low_noise_model/` subdirectory) and model type (`model_type` in source config.json for I2V vs T2V).
|
The conversion script auto-detects the model version from the directory structure (presence of `low_noise_model/` → Wan2.2 dual model) and the model type from `config.json` (I2V vs T2V).
|
||||||
|
|
||||||
|
#### Wan2.1 T2V 1.3B
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Auto-detect version
|
|
||||||
python -m mlx_video.convert_wan \
|
python -m mlx_video.convert_wan \
|
||||||
--checkpoint-dir /path/to/wan_checkpoint \
|
--checkpoint-dir ./Wan2.1-T2V-1.3B \
|
||||||
--output-dir wan_mlx
|
--output-dir ./Wan2.1-T2V-1.3B-MLX
|
||||||
|
|
||||||
# Explicit version
|
|
||||||
python -m mlx_video.convert_wan \
|
|
||||||
--checkpoint-dir /path/to/wan21_checkpoint \
|
|
||||||
--output-dir wan21_mlx \
|
|
||||||
--model-version 2.1
|
|
||||||
|
|
||||||
python -m mlx_video.convert_wan \
|
|
||||||
--checkpoint-dir /path/to/wan22_checkpoint \
|
|
||||||
--output-dir wan22_mlx \
|
|
||||||
--model-version 2.2
|
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Wan2.1 T2V 14B
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.1-T2V-14B \
|
||||||
|
--output-dir ./Wan2.1-T2V-14B-MLX
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Wan2.2 T2V 14B
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-T2V-A14B \
|
||||||
|
--output-dir ./Wan2.2-T2V-A14B-MLX
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Wan2.2 I2V 14B
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-I2V-A14B \
|
||||||
|
--output-dir ./Wan2.2-I2V-A14B-MLX
|
||||||
|
```
|
||||||
|
|
||||||
|
The I2V model is auto-detected from `config.json`; the output will include a `vae_encoder.safetensors` used to encode the conditioning image.
|
||||||
|
|
||||||
|
#### Wan2.2 TI2V 5B
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-TI2V-5B \
|
||||||
|
--output-dir ./Wan2.2-TI2V-5B-MLX
|
||||||
|
```
|
||||||
|
|
||||||
|
The TI2V model uses a different VAE (`z_dim=48`, `vae_stride=(4,16,16)`) and is auto-detected during conversion.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
You can also pass `--model-version 2.1` or `--model-version 2.2` to force the version instead of relying on auto-detection.
|
||||||
|
|
||||||
#### Conversion Options
|
#### Conversion Options
|
||||||
|
|
||||||
| Option | Default | Description |
|
| Option | Default | Description |
|
||||||
@@ -90,20 +139,158 @@ wan_mlx/
|
|||||||
└── high_noise_model.safetensors # (Wan2.2) High-noise transformer
|
└── high_noise_model.safetensors # (Wan2.2) High-noise transformer
|
||||||
```
|
```
|
||||||
|
|
||||||
### Quantization (Reduced Memory)
|
### Step 3: Generate Video
|
||||||
|
|
||||||
Quantize the transformer weights to reduce memory usage by ~3.4x. This is especially useful for the 14B model or memory-constrained devices:
|
#### Wan2.1 T2V 1.3B
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
# Convert with 4-bit quantization
|
python -m mlx_video.generate_wan \
|
||||||
|
--model-dir ./Wan2.1-T2V-1.3B-MLX \
|
||||||
|
--prompt "A cat playing piano in a cozy living room, cinematic lighting" \
|
||||||
|
--width 832 --height 480 --num-frames 81 \
|
||||||
|
--steps 50 --guide-scale 5.0 \
|
||||||
|
--seed 42 \
|
||||||
|
--output-path wan21_1b.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Wan2.1 T2V 14B
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.generate_wan \
|
||||||
|
--model-dir ./Wan2.1-T2V-14B-MLX \
|
||||||
|
--prompt "A woman walks through a misty forest at dawn, slow motion, cinematic" \
|
||||||
|
--width 1280 --height 704 --num-frames 81 \
|
||||||
|
--steps 50 --guide-scale 5.0 \
|
||||||
|
--seed 42 \
|
||||||
|
--output-path wan21_14b.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Tip**: If the first few frames look washed out or have color artifacts, add `--trim-first-frames 1` to generate 4 extra frames at the start and discard them. With the `unipc` scheduler (default), **10 steps** often gives satisfying results — useful for quick iteration.
|
||||||
|
|
||||||
|
#### Wan2.2 T2V 14B
|
||||||
|
|
||||||
|
Wan2.2 uses a dual-model pipeline (separate high-noise and low-noise transformers) and takes guidance as a `high,low` pair:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.generate_wan \
|
||||||
|
--model-dir ./Wan2.2-T2V-A14B-MLX \
|
||||||
|
--prompt "Two astronauts playing chess on the surface of the moon, dramatic lighting, 8K" \
|
||||||
|
--negative-prompt "low quality, blurry, distorted" \
|
||||||
|
--width 1280 --height 704 --num-frames 81 \
|
||||||
|
--steps 40 --guide-scale "3.0,4.0" \
|
||||||
|
--seed 42 \
|
||||||
|
--output-path wan22_t2v.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Tip**: With the `unipc` scheduler (default), **10 steps** often produces satisfying results for 14B models — a significant speed-up with minimal quality loss. Try `--steps 10` for quick iterations.
|
||||||
|
|
||||||
|
#### Wan2.2 I2V 14B
|
||||||
|
|
||||||
|
Image-to-video: animates a starting image guided by a text prompt. Pass the image with `--image`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.generate_wan \
|
||||||
|
--model-dir ./Wan2.2-I2V-A14B-MLX \
|
||||||
|
--image ./my_photo.png \
|
||||||
|
--prompt "The person slowly turns their head and smiles, cinematic, natural lighting" \
|
||||||
|
--negative-prompt "low quality, blurry, distorted" \
|
||||||
|
--width 1280 --height 704 --num-frames 81 \
|
||||||
|
--steps 40 --guide-scale "3.5,3.5" \
|
||||||
|
--seed 42 \
|
||||||
|
--output-path wan22_i2v.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Tip**: As with T2V, `--steps 10` with the `unipc` scheduler is often sufficient for fast prototyping.
|
||||||
|
|
||||||
|
#### Wan2.2 TI2V 5B
|
||||||
|
|
||||||
|
Text+image-to-video: a single-model variant with a larger VAE (`z_dim=48`). Resolution must be divisible by **32** (not 16 as with other models):
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.generate_wan \
|
||||||
|
--model-dir ./Wan2.2-TI2V-5B-MLX \
|
||||||
|
--image ./my_photo.png \
|
||||||
|
--prompt "The subject waves hello, warm sunlight, film grain" \
|
||||||
|
--width 1280 --height 704 --num-frames 41 \
|
||||||
|
--steps 40 --guide-scale 5.0 \
|
||||||
|
--seed 42 \
|
||||||
|
--output-path wan22_ti2v.mp4
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Note**: The 5B model is fast — 40 steps run quickly and are recommended for best quality.
|
||||||
|
|
||||||
|
> **Frame count**: `--num-frames` must satisfy `4n+1` for all models (e.g. 5, 9, 13, 21, 41, 81, 101 …).
|
||||||
|
|
||||||
|
> **Resolution**: Always use the model's native resolution. While generation will succeed at other sizes, mismatched resolutions or aspect ratios are likely to produce visual artifacts. Preferred resolutions are:
|
||||||
|
> - **480P** — 832×480 (landscape) or 480×832 (portrait) — for Wan2.1 1.3B
|
||||||
|
> - **720P** — 1280×704 (landscape) or 704×1280 (portrait) — for Wan2.1 14B, Wan2.2 T2V/I2V/TI2V
|
||||||
|
|
||||||
|
#### Generation Options
|
||||||
|
|
||||||
|
| Option | Default | Description |
|
||||||
|
|--------|---------|-------------|
|
||||||
|
| `--model-dir` | (required) | Path to converted MLX model directory |
|
||||||
|
| `--prompt` | (required) | Text prompt |
|
||||||
|
| `--image` | — | Input image path (I2V and TI2V modes) |
|
||||||
|
| `--negative-prompt` | config default | Negative guidance prompt |
|
||||||
|
| `--width` | `1280` | Output width in pixels |
|
||||||
|
| `--height` | `704` | Output height in pixels |
|
||||||
|
| `--num-frames` | `81` | Number of frames (must be `4n+1`) |
|
||||||
|
| `--steps` | config default | Diffusion steps |
|
||||||
|
| `--guide-scale` | config default | Guidance scale; use `"high,low"` pair for Wan2.2 dual models |
|
||||||
|
| `--shift` | config default | Noise schedule shift |
|
||||||
|
| `--seed` | `-1` (random) | Random seed for reproducibility |
|
||||||
|
| `--output-path` | `output.mp4` | Output video file path |
|
||||||
|
| `--scheduler` | `unipc` | Solver: `euler`, `dpm++`, or `unipc` |
|
||||||
|
| `--trim-first-frames` | `0` | Drop N leading frames (fixes first-frame artifacts on 14B models) |
|
||||||
|
| `--tiling` | `auto` | VAE tiling: `auto`, `none`, `spatial`, `temporal` |
|
||||||
|
|
||||||
|
### Quantization (Reduced Memory)
|
||||||
|
|
||||||
|
Quantize the transformer weights to reduce memory usage by ~3.4×. Quantization is supported for all model variants and is especially important for running 14B models on devices with limited unified memory:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Convert with 4-bit quantization (works for any variant)
|
||||||
python -m mlx_video.convert_wan \
|
python -m mlx_video.convert_wan \
|
||||||
--checkpoint-dir /path/to/Wan2.1-T2V-1.3B \
|
--checkpoint-dir ./Wan2.1-T2V-1.3B \
|
||||||
--output-dir wan21_mlx_q4 \
|
--output-dir ./Wan2.1-T2V-1.3B-MLX-Q4 \
|
||||||
--quantize --bits 4 --group-size 64
|
--quantize --bits 4 --group-size 64
|
||||||
|
|
||||||
# Generate with quantized model (auto-detected from config.json)
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.1-T2V-14B \
|
||||||
|
--output-dir ./Wan2.1-T2V-14B-MLX-Q4 \
|
||||||
|
--quantize --bits 4 --group-size 64
|
||||||
|
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-T2V-A14B \
|
||||||
|
--output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
|
||||||
|
--quantize --bits 4 --group-size 64
|
||||||
|
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-I2V-A14B \
|
||||||
|
--output-dir ./Wan2.2-I2V-A14B-MLX-Q4 \
|
||||||
|
--quantize --bits 4 --group-size 64
|
||||||
|
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-TI2V-5B \
|
||||||
|
--output-dir ./Wan2.2-TI2V-5B-MLX-Q4 \
|
||||||
|
--quantize --bits 4 --group-size 64
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also quantize an already-converted MLX model without re-converting from PyTorch:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python -m mlx_video.convert_wan \
|
||||||
|
--checkpoint-dir ./Wan2.2-T2V-A14B-MLX \
|
||||||
|
--output-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
|
||||||
|
--quantize-only --bits 4
|
||||||
|
```
|
||||||
|
|
||||||
|
Quantized models are used exactly the same way — the quantization is auto-detected from `config.json`:
|
||||||
|
|
||||||
|
```bash
|
||||||
python -m mlx_video.generate_wan \
|
python -m mlx_video.generate_wan \
|
||||||
--model-dir wan21_mlx_q4 \
|
--model-dir ./Wan2.2-T2V-A14B-MLX-Q4 \
|
||||||
--prompt "A cat playing piano"
|
--prompt "A cat playing piano"
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -157,5 +344,6 @@ python -m mlx_video.generate_wan \
|
|||||||
--lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1
|
--lora-low /Volumes/SSD/Wan-AI/lightx2v/Wan2.2-Lightning/Wan2.2-T2V-A14B-4steps-lora-rank64-Seko-V2.0/low_noise_model.safetensors 1
|
||||||
```
|
```
|
||||||
|
|
||||||
Which results in
|
## Enjoy
|
||||||
|
|
||||||

|

|
||||||
Reference in New Issue
Block a user