Enhance README.md with new usage examples for STG and modality scale parameters in video generation. Update generate.py to support STG and modality guidance in the denoising process, allowing for improved audio-visual integration. Refactor attention mechanisms in the transformer to include options for skipping self-attention, facilitating STG perturbation and modality isolation. Update LTXModel and transformer block processing to accommodate new parameters for enhanced flexibility in model configurations.

2026-03-14 10:26:12 +01:00
parent f346e09de4
commit 9cba2ea7cd
5 changed files with 200 additions and 78 deletions
--- a/README.md
+++ b/README.md
@@ -78,6 +78,10 @@ uv run mlx_video.generate --pipeline dev --prompt "Waves crashing" --image beach
 ```bash
 uv run mlx_video.generate --prompt "Ocean waves crashing" --audio
 uv run mlx_video.generate --pipeline dev --prompt "A jazz band playing" --audio --enhance-prompt
+
+# With full guidance (STG + modality_scale, matches PyTorch defaults)
+uv run mlx_video.generate --pipeline dev --prompt "Ocean waves crashing" --audio \
+    --stg-scale 1.0 --stg-blocks 29 --modality-scale 3.0
 ```

 ### LoRA
@@ -146,6 +150,9 @@ uv run mlx_video.upscale --input video.mp4 --output upscaled.mp4 --refine --prom
 | `--cfg-rescale` | 0.7 | CFG rescale factor (reduces over-saturation) |
 | `--negative-prompt` | (default) | Negative prompt for CFG |
 | `--apg` | false | Use Adaptive Projected Guidance (more stable for I2V) |
+| `--stg-scale` | 0.0 | STG scale (PyTorch default: 1.0, requires `--audio`) |
+| `--stg-blocks` | None | Transformer blocks for STG ([29] for LTX-2, [28] for LTX-2.3) |
+| `--modality-scale` | 1.0 | Cross-modal guidance scale (PyTorch default: 3.0, requires `--audio`) |

 **Dev-Two-Stage LoRA options:**