Alibaba Cloud officially launched its WAN 2.6 series video generation models on January 3, 2026, introducing China’s first reference-to-video (R2V) capability that lets users insert themselves into AI-generated scenes. The WAN 2.6 video generation suite includes text-to-video (T2V), image-to-video (I2V), and the breakthrough reference-to-video model, all supporting up to 15-second outputs with automatic voiceover and 1080p resolution. This marks a significant upgrade from WAN 2.5, which offered preview-level features with shorter generation limits.
What’s New in WAN 2.6
The WAN 2.6 series delivers three major capabilities previously unavailable in open-source Chinese AI video models. The reference-to-video model (WAN 2.6-R2V) analyzes input videos to extract character appearance, motion patterns, and voice characteristics, then generates new scenes while maintaining consistency. Multi-shot narrative generation structures 15-second videos into distinct scenes with smooth transitions, enabling creators to build story arcs instead of single continuous clips. Native audio-visual synchronization automatically generates voiceovers that match lip movements and scene context, with support for custom audio file imports.
The models are available through Alibaba Cloud Model Studio with three variants:
- WAN 2.6-T2V: Text-to-video at $0.10/second (720p)
- WAN 2.6-I2V: Image-to-video at $0.10/second (720p), $0.15/second (1080p)
- WAN 2.6-R2V: Reference-to-video at $0.10/second (720p), $0.15/second (1080p)
WAN 2.5 remains available as a preview version priced at $0.05/second (480p), offering automatic dubbing but limited to 50-second maximum outputs.
Why It Matters
WAN 2.6’s reference-based generation solves a critical pain point for short-form content creators who need character consistency across multiple videos. Traditional AI video models generate random characters each time, forcing creators to manually edit or reshoot scenes. WAN 2.6 allows users to upload a reference video once and generate unlimited scenes featuring the same person, cartoon character, or object while maintaining visual and audio consistency.
The 15-second output limit positions WAN 2.6 competitively against most open-source models that cap at 2-5 seconds, giving creators enough time to develop complete story arcs, product showcases, or ad concepts without stitching multiple clips. For developers and production teams working on short-form drama or social media content, this streamlines workflows that previously required multiple tools and manual editing.
How Multi-Shot Narrative Works
WAN 2.6 uses structured prompting to create scene-based videos with temporal control. Creators define shots using time brackets within a single prompt:
Prompt structure:
- Global style description (lighting, quality, cinematic tone)
- Shot-by-shot breakdown with timing markers
- Character labels (character1, character2) for consistency
Example prompt:
textA cinematic tech demo, 4K, film grain.
Shot 1 [0-5s] character1 walks through a server room.
Shot 2 [5-10s] Close-up of character1 examining holographic data.
Shot 3 [10-15s] Wide shot as character1 exits the facility.
The model maintains character appearance and voice across all three shots while handling scene transitions automatically. Up to two characters can be included per video when using reference inputs.
What’s Next
Alibaba Cloud has made WAN 2.6 available immediately through its Model Studio API and web interface, with a 90-day free trial offering 50 seconds of 720p generation. The company has not announced specific roadmap details for WAN 2.7 or extended video lengths beyond 15 seconds. Current limitations include a maximum video duration of 50 seconds across all WAN models and an 800-character prompt limit, though built-in prompt expansion helps optimize shorter inputs.
Third-party platforms including AKOOL, WaveSpeedAI, and fal.ai have begun integrating WAN 2.6 models, expanding access beyond Alibaba’s ecosystem. Pricing remains consistent at $0.10/second for 720p across both Singapore and Beijing regions, making it competitive with existing text-to-video services.
Featured Snippet Boxes
What is WAN 2.6 video generation?
WAN 2.6 is Alibaba Cloud’s latest AI video generation model series that creates up to 15-second videos from text, images, or reference videos with multi-shot narratives and automatic audio synchronization. It includes text-to-video, image-to-video, and reference-to-video capabilities.
How does WAN 2.6 reference-to-video work?
The WAN 2.6-R2V model analyzes an input video to extract character appearance, motion style, and voice characteristics, then generates new scenes maintaining those traits. Users can include up to two characters per video and specify actions through text prompts.
What’s the difference between WAN 2.6 and WAN 2.5?
WAN 2.6 extends video generation to 15 seconds with multi-shot narratives and reference-based character consistency, while WAN 2.5 is a preview version limited to 50 seconds total output with basic automatic dubbing. WAN 2.6 also offers 1080p resolution versus WAN 2.5’s maximum 720p.
How much does WAN 2.6 cost?
WAN 2.6 pricing is $0.10 per second for 720p and $0.15 per second for 1080p video generation. Alibaba Cloud provides a 90-day free trial with 50 seconds of 720p generation quota upon activating Model Studio.

