TL;DR
- Launch: Chinese artificial intelligence company StepFun has launched StepAudio 2.5 Realtime as an AI live voice model for assistants, support bots, and similar interactive tools.
- Voice Stack: StepFun ties the model to a single audio-in, audio-out design, benchmark claims from testing in April 2026, and WebSocket support for streaming use.
- Competition: OpenAI, Google, Tencent, and other rivals already ship or preview comparable voice systems, so StepFun enters a crowded category.
- Data Risk: Public details still do not explain the consent and copyright boundaries behind the voice data used to train the model.
Chinese developer StepFun has launched the StepAudio 2.5 Realtime live voice model for assistants, support bots, and other interactive tools. Developers already compare low-latency systems on turn-taking, responsiveness, and how naturally they hold a conversation, so the release lands in an already crowded field.
StepAudio is presented as an end-to-end real-time speech large language model with persona controls. The launch also ties the system to a design that takes audio in and spits audio out instead of splitting speech recognition, reasoning, and synthesis into separate services. StepAudio also enters the market with Chinese and English support.
StepFun uses “global scene-level tonal setting” to describe part of its tone-control design, and the official page promises “real warmth, real temper, and real personality”. Missing disclosure about training data keeps the copyright accountability question open from the start.
How StepAudio Frames Its Live Voice Stack
Roleplay-specific RLHF sits at the center of StepFun’s pitch for stronger persona stability. Reinforcement learning from human feedback is training tuned by human preference signals, and the method is meant to reduce out-of-character drift during live exchanges.
StepFun expanded more than 10,000 authored personas into a million-scale persona feature matrix and paired that structure with millions of conversational samples. In that setup, a voice agent is meant to keep its role, pacing, and affect aligned across multiple turns instead of resetting after each reply.
Those benchmark claims statean 80.41 human-evaluation score, plus 86.36 for general dialogue, 84.80 for automotive scenarios, 79.80 for spoken question answering, and 82.18 for paralinguistic comprehension. Paralinguistic comprehension covers cues beyond literal words, such as laughter, hesitation, pace, and emotional tone.
The developer angle extends to infrastructure as well. A WebSocket channel gives developers a persistent path for two-way streaming audio, while real-time latency under 300ms remains a claim that still needs outside reproduction.
Where It Sits in the Voice AI Race
In comparison, OpenAI’s recently released gpt-realtime model processes and generates audio directly through a single model and API. Its split voice stack shows that rivals are still testing different tradeoffs between latency, reasoning depth, and tool use.
In that same race, Google positioned Gemini’s native-audio system in December 2025 around smoother conversations by retrieving context from previous turns. Google used the same update to pitch Gemini as a voice-agent platform for customer-service and related use cases.
In March 2026, Tencent introduced Covo-Audio as another single-architecture speech model, and a later full-duplex voice preview pushed overlap and interruption handling higher up the checklist for the category.
The Training-Data Questions That Remain Open
That product push leaves a second test unresolved. StepAudio may have used thousands of hours of licensed voice actor recordings, crowdsourced emotional speech clips, and proprietary micro-expression audio. But publicly available descriptions still do not define the consent boundaries, licensing scope, or disclosure standards around that mix.
Any model built to read or reproduce vocal affect would need training material that captures how real people laugh, hesitate, and react emotionally. Buyers can measure latency and output quality, but they still cannot see the provenance limits behind the voices that trained the system.
StepFun has put persona design details and benchmark results into public view, but outsiders still lack enough sourcing detail to judge copyright exposure or data-collection safeguards.
StepFun now has to prove that StepAudio can meet its performance targets in production and that customers and creators can understand the legal boundaries behind the voices that trained it.

