Chinese Kuaishou AI lab Kling AI has launched “Video O1,” a unified model that merges generation and editing into a single architecture.
Challenging the fragmented workflows of incumbents like Runway and Google, the system introduces “Multimodal Visual Language” (MVL) to enable pixel-level manipulation via natural language prompts.
Arriving amid a “Super Sunday” of releases from Tencent, ByteDance, and Runway, the launch signals a shift from pure generation to precise, agentic video control.
Promo
The Unified Architecture: Merging Generation and Editing
Moving beyond the industry standard of separate models for generation and post-production, Kling AI has consolidated these functions into a single architecture. By integrating generation, editing, and extension into one pipeline, the company aims to eliminate the friction of switching between specialized tools.
The official announcement details how this consolidation impacts the creative workflow.
“We’ve redefined the video creation process by combining multiple tasks… into a single, all-powerful engine.”
Dubbed “Video O1,” the system natively handles text-to-video, image-to-video, and complex video extension tasks without requiring model switching.
Central to this integration is the “Multimodal Visual Language” (MVL), a new interaction layer designed to interpret complex user intents. Addressing the limitations of traditional text encoders in handling spatial instructions, the architecture introduces a novel method for signal processing. According to the release notes:
“The VIDEO O1 model innovatively introduces MVL as an interactive medium. Through the Transformer, it deeply merges text semantics with multimodal signals, strengthening the model’s understanding capabilities at its core. It supports the flexible invocation and seamless integration of multiple tasks within a single input box.”
By processing text semantics alongside visual signals, the model can understand instruction-based edits rather than just generating pixels from scratch. This capability allows for precise modifications, such as altering specific objects while preserving the surrounding scene.
Addressing a core technical bottleneck in generative video, this unified approach mitigates the “pipeline problem” where errors accumulate as assets move between different specialized models.
By handling all tasks within a single transformer context, the system maintains semantic consistency throughout the editing process. The Kling AI Team explains the scope of this integration:
“The Kling VIDEO O1 Model is the first in the video generation field to integrate a wide range of tasks – including Reference to Video, Text-to-Video, Start & End Frames generation, video content editing, modifications, transformations, restyling, and camera extension – all into one unified model. No need to switch between different models and tools; with VIDEO O1, you can seamlessly go from ideation to generation, and from generation to modification, all in one place.”
The ‘Nano Banana’ Factor: Natural Language Control
Analysts have drawn parallels between Kling’s new capabilities and Google’s Nano Banana model, specifically regarding the precision of semantic editing. While “Nano Banana” refers to Google’s image editing technology, the comparison highlights the industry’s push toward granular control in video.
Alvaro Cintas-Canto, a Professor of AI at Marymount University, framed the significance of this capability in a report by the SCMP.
“Kling O1 is the Nano Banana for AI video.”
Users can execute complex edits – such as “remove bystanders” or “change weather from day to night” – using conversational natural language.
This is VFX House – a full VFX studio built on @Kling_ai o1 inside invideo.
AI Companies have been LYING to creators for 2 years.
Cool demos but – Zero control. Zero reliability. Zero continuity.Today… that ends.
The chaos era is over.
The control era begins now.Kling o1… pic.twitter.com/RyNQJ0p70x
— Invideo (@invideoOfficial) December 1, 2025
This approach replaces the traditional, labor-intensive workflow of manual rotoscoping, masking, and keyframing, which often requires specialized software and significant time investment. Kling AI Team emphasized the operational benefits, stating:
“No need for manual masking or keyframes – just type in prompts like ‘remove bystanders,’ ‘change daylight to dusk,’ or ‘swap the main character’s outfit,’ and the model will understand the visual logic.”
Beyond environmental edits, the model introduces “All-in-One Reference” technology to solve the persistent issue of temporal coherence. By locking onto character and prop identities, the system maintains visual consistency across dynamic shots, preventing the “flicker” or identity drift common in diffusion models.
Super Sunday: The Four-Way Market Clash
Arriving amidst a “Super Sunday” for the AI video sector, the launch coincides with simultaneous releases from major global competitors. Runway’s Gen-4.5 release claimed the top spot on the Video Arena leaderboard with a focus on physics and world models, intensifying the pressure on incumbents.
Domestically, Tencent’s HunyuanVideo-1.5 (8.3B parameters) and ByteDance’s Vidi2 (12B parameters) launched targeting the open-source community. Kling AI has positioned itself aggressively against these rivals, releasing internal benchmark data to support its claims of superiority.
Performance metrics in the official announcement provide specific comparisons against Google’s proprietary model.
“In the ‘Image Reference’ task… we compared the Kling AI VIDEO O1 Model with Google Veo 3.1’s Ingredients to Video as the benchmark model… The comparison results show that the VIDEO O1 Model excels… with a performance win ratio of 247% compared to Google Veo 3.1’s Ingredients to Video.”
In transformation tasks, the company claims a 230% performance win ratio over Runway’s Aleph model. These figures suggest a significant lead in specific editing workflows, particularly those involving image references and style transfer.
However, these figures remain internally sourced, creating a sharp contrast with the reproducible, open-weight models released by Tencent and ByteDance. Independent verification will be crucial to validate whether these “win ratios” translate to real-world production environments.
Pricing and Commercial Strategy
Unlike its open-source domestic rivals, Kuaishou is pursuing a closed SaaS revenue model, restricting Video O1 to its “Pro Mode.” Pricing is tiered based on computational load: 8 credits per second for standard generation, rising to 12 credits per second when using video inputs, as detailed in the user guide.
Aligning with the company’s broader strategy, this premium positioning aims to capture the professional media production market.
Testing the commercial viability of this strategy, the Kling AI business unit reported 300 million yuan in Q3 sales. By targeting high-end creators with a “unified” tool, Kling aims to justify its subscription costs in an increasingly commoditized market where open-source alternatives are rapidly gaining ground.

