WAN VACE is a powerful unified framework for video tasks enabling reference-to-video generation, video-to-video editing, masked video-to-video editing, and free composition of these tasks in one model.
Get Started with WAN VACE →All videos generated using WAN VACE framework
Try WAN VACE NowExperience WAN VACE's capabilities through this interactive demo. Generate, edit, and transform videos with powerful multimodal inputs.
Loading demo, please wait...
Note: This demo runs on shared resources. For best performance, install WAN VACE locally.
WAN VACE integrates multiple video tasks into a single model, reducing deployment complexity and providing a streamlined user experience.
Combine different video tasks (reference, editing, masking) in flexible ways to create complex creative scenarios previously impossible with single-task models.
Advanced Video Condition Unit (VCU) technology maintains consistency across both temporal and spatial dimensions for high-quality outputs.
Generate diverse video content from reference images, edit videos with precise controls, and seamlessly blend different elements.
Build upon the unified architecture to create new video synthesis models and applications with the flexible Diffusion Transformer framework.
Create and edit promotional videos, transform existing footage, and generate new content with unprecedented control and efficiency.
Unified interface that integrates diverse input conditions including text, frames, and masks for flexible video generation.
Separates visual modalities in editing and reference tasks, allowing the model to understand what to retain and what to modify.
Pluggable architecture that injects different task concepts into the model through collaborative spatiotemporal representation.
Handles reference-to-video generation, video-to-video editing, masked editing, and all their combinations in a single framework.
Built on Diffusion Transformer architecture for better scalability and handling of long video sequences.
Achieves results comparable to task-specific models while providing the flexibility of a unified framework.
WAN VACE provides multiple model variants to balance performance, speed, and resource requirements:
on HuggingFace
git clone https://github.com/ali-vilab/VACE.git && cd VACE
pip install torch==2.5.1 torchvision==0.20.1 --index-url https://download.pytorch.org/whl/cu124 # If PyTorch is not installed.
pip install -r requirements.txt
pip install wan@git+https://github.com/Wan-Video/Wan2.1 # If you want to use Wan2.1-based VACE.
pip install ltx-video@git+https://github.com/Lightricks/[email protected] sentencepiece --no-deps # If you want to use LTX-Video-0.9-based VACE. It may conflict with Wan.
The WAN VACE framework is also available through various integration methods:
WAN VACE uniquely integrates multiple video tasks into a single unified framework, allowing for task composition and flexible creative workflows not possible with single-task models.
For the LTX-Video-based version, modern GPUs with at least 8GB VRAM. The Wan-T2V-based version performs best with 16GB+ VRAM for high-resolution outputs.
Yes, WAN VACE is built for both research and commercial applications. Check the specific license terms for the model variant you're using.
Yes, WAN VACE can handle longer video sequences thanks to its efficient spatiotemporal representation that maintains consistency throughout.
Yes, WAN VACE supports both full fine-tuning and the more efficient Context Adapter Tuning method for customizing to specific domains.
The Video Condition Unit (VCU) unifies inputs from different tasks, allowing you to combine reference images, editing instructions, and masks in the same generation process.