ByteDance Releases Vidi2 Open-Source AI Model For Video Understanding and Creation


ByteDance has released Vidi2, a 12-billion parameter multimodal model designed to democratize professional video editing.

Unlike standard generators, the system introduces “Spatio-Temporal Grounding” (STG), allowing users to pinpoint and edit specific objects across timeframes with pixel-level precision.

Reportedly outperforming proprietary giants like Google’s and OpenAI’s offerings on internal benchmarks, the model already underpins TikTok’s “Smart Split” feature. The deployment signals a strategic shift from pure generation to granular, agentic video control on consumer hardware.

Promo

Beyond Generation: The Spatio-Temporal Breakthrough

Most current AI video models operate somewhat blindly. They generate pixels based on text prompts but lack a deep, structural understanding of where specific objects are located within the frame or how they move over time. Vidi2 attempts to solve this by implementing Spatio-Temporal Grounding (STG).

By moving beyond simple timestamp identification, the model assigns pixel-level bounding boxes to objects, tracking them continuously as “tubes” through the video duration. This allows the system to maintain a persistent identity for a subject, even if they leave the frame and re-enter later.

According to the technical report, this precision is central to the model’s design. ByteDance says, that “Given a text query, Vidi2 can identify not only the corresponding timestamps but also the bounding boxes of target objects within the output time ranges.”

For creators, this capability shifts the workflow from random generation to targeted manipulation. Instead of regenerating an entire clip to fix a mistake, a user could theoretically target a specific character or object for removal or alteration without affecting the surrounding scene.

Example of highlight extraction application

Practical applications for this technology are already emerging in vertical video formats.

By understanding the “plot” of a scene, such as distinguishing between a main speaker and a background character, the model can automatically crop horizontal footage into vertical 9:16 formats while keeping the relevant subject in focus.

Such granular control is essential for complex post-production tasks. Highlighting the broader implications for the industry, ByteDance says that “this end-to-end spatio-temporal grounding capability enables potential applications in complex editing scenarios, such as plot or character understanding, automatic multi-view switching, and intelligent, composition-aware reframing and cropping.”

Bytedance Vidi2 Example of storyline-based video creation application
Example of storyline-based video creation application

Video Analysis & Understanding Features

The core of Vidi2’s architecture is Spatio-Temporal Grounding (STG), which allows the model to understand video in four dimensions (height, width, time, and semantic meaning).

  • Spatio-Temporal Grounding (STG): Unlike models that only identify when an event happens (timestamps), Vidi2 identifies where it happens. It generates “tubes”—pixel-level bounding boxes that track specific objects or characters continuously across frames, even if they leave and re-enter the shot.

  • Temporal Retrieval (TR): The model can scan long-form videos (ranging from 10 seconds to over an hour) to locate specific moments based on complex text queries. It reportedly outperforms Gemini 3 Pro in retrieving precise segments from “ultra-long” content.

  • Video Question Answering (Video QA): Vidi2 supports open-ended reasoning. It can answer questions regarding plot points, character motivations, or visual details (e.g., “What are the dentist’s financial practices?”) by synthesizing visual and audio cues.

  • Plot & Character Understanding: The model can distinguish between multiple characters in a scene, track their interactions, and understand narrative causality, which is essential for editing story-driven content.

Video Creation & Automated Editing Features

Vidi2 also operates as an “agentic” editor, outputting instructions for editing engines rather than just raw video frames.

  • Smart Split (Highlight Extraction): Commercially deployed in TikTok, this feature analyzes long-form content (like streams or podcasts) to automatically identify viral moments. It extracts these highlights and cuts them into standalone clips.

  • Composition-Aware Reframing: Leveraging its STG capability, the model automatically crops horizontal (16:9) footage into vertical (9:16) formats. Because it tracks the subject via bounding boxes, it ensures the main speaker or action remains centered without manual keyframing.

  • Storyline-Based Video Creation: The model can take a collection of raw video assets and “direct” a finished product. It generates a narrative script, selects the appropriate clips to match the story, and outputs a complete editing timeline including cuts, transitions, and music placement.

  • AI Outline: This tool generates structured content plans, including titles, hooks, and script outlines, based on simple text prompts or trending topics.

  • Multi-View Switching: For footage with multiple camera angles, the model can intelligently switch between views based on who is speaking or where the action is occurring.

Efficiency vs. Scale: The 12B Parameter Advantage

Contrary to initial reports suggesting a massive 120-billion parameter architecture, the technical paper confirms that Vidi2 relies on a highly efficient 12-billion parameter configuration. This leaner architecture allows it to run on a wider range of hardware while still delivering state-of-the-art results.

According to ByteDance, “Vidi2 retains the multimodal architecture of Vidi [a precursor], designed to jointly process text, visual, and audio inputs, while introducing key enhancements in both the encoder and LLM backbone (e.g., Gemma-3).”

By optimizing the encoder and backbone, ByteDance claims to have achieved performance levels that rival or exceed much larger proprietary models. In its own testing, the company pitted Vidi2 against industry heavyweights on its newly proposed benchmarks.

On the VUE-STG benchmark, which measures spatio-temporal grounding accuracy, Vidi2 achieved a Temporal IoU (Intersection over Union) of 53.19%. By comparison, Google’s Gemini 3 Pro scored 27.50%, and OpenAI’s GPT-5 scored 16.40%.

Regarding these comparative results, the ByteDance claims that “the Vidi2 model substantially outperforms leading proprietary systems, such as Gemini 3 Pro (Preview) and GPT-5, on both VUE-TR-V2 and VUE-STG.”

ByteDance Vidi2 Spatio-temporal grounding and temporal retrieval benchmarks.

It is important to note that these victories were secured on benchmarks created and released by ByteDance itself. While “home court” advantages are common in AI research, independent verification will be necessary to confirm if these margins hold up in neutral testing environments.

To support this evaluation, the company has released detailed methodologies for its testing protocols. The technical report details the new standard:

“To enable comprehensive evaluation of STG in practical settings, we introduce a new benchmark, VUE-STG, which offers four key improvements over existing STG datasets: 1) Video duration: spans from roughly 10s to 30 mins, enabling long-context reasoning… 3) Annotation quality: all ground-truth time ranges and bounding boxes are manually annotated with high accuracy; 4) Evaluation metric: a refined vIoU/tIoU/vIoU-Intersection scheme for multi-segment spatio-temporal evaluation.”

Beyond grounding, the model also showed significant improvements in long-context reasoning. The updated VUE-TR-V2 benchmark tests the model’s ability to understand and retrieve information from videos ranging from 10 seconds up to 30 minutes in length, a critical requirement for editing long-form content.

From Research to TikTok: The ‘Smart Split’ Engine

While many academic papers remain theoretical, Vidi2 has appearingly already been battle-tested in a massive production environment. The technology seemingly quietly powers the “Smart Split” and “AI Outline” features included in TikTok’s announcement from late October 2025.

“Smart Split” automates the labor-intensive process of repurposing long-form content. Designed for streamers and podcasters, the tool analyzes uploaded videos to identify viral moments, edits them into short clips, and automatically reframes them for the TikTok feed.

This functionality demonstrates the model’s “agentic” capabilities. Rather than just generating video frames, Vidi2 acts as a director, outputting a set of instructions that tell an editing engine how to cut, crop, and arrange existing footage. As described in the paper:

“Vidi2 demonstrates the ability to generate end-to-end video creation instructions conditioned on multiple input videos. As shown in Figure 8, the model takes six videos as input and outputs a complete instruction set for producing a publishable, storyline-driven video.”

The report further elaborates on the output quality:

“The final rendered output includes narration, music, animations, and transitions, showcasing the model’s potential to automate the entire creative editing workflow.”

Commercial viability is further supported by the model’s efficiency. The existence of a 7B parameter variant, referenced in the official repository, suggests that ByteDance is optimizing these tools to run effectively on consumer-grade hardware or via efficient cloud inference.

This lowers the cost of automated video production significantly.

The Open Source Battlefield: ByteDance vs. Tencent

The release of Vidi2 coincides with another major development from the Chinese tech sector. Just recently, Tencent’s HunyuanVideo-1.5 was released as well, offering a powerful open-source model focused on video generation.

While Tencent focused on the generative aspect, creating pixels from scratch, ByteDance is prioritizing understanding and manipulation.

Vidi2 is designed to deconstruct and reorganize existing video, a capability that aligns closely with its parent company’s core business.

ByteDance has adopted an aggressive open-source strategy to challenge Western incumbents. By releasing the code and benchmarks immediately, with model weights “coming soon,” the company is positioning itself as a foundational player in the open ecosystem.

This contrasts with the closed-garden approach of Google’s Veo 3.1 update and Runway’s Aleph model, which are accessible only via APIs or proprietary platforms.

By commoditizing the base layer of video understanding, ByteDance and its domestic rivals are attempting to undercut the business models of their US competitors.

Ultimately, Vidi2 represents a step toward “World Models”, AI systems that understand the physics, causality, and temporal dynamics of the physical world, rather than just the visual texture of a single frame.



Source link

Recent Articles

Related Stories