StepFun Launches StepAudio 2.5 Realtime Live Voice AI Model


TL;DR

  • Launch: Chinese artificial intelligence company StepFun has launched StepAudio 2.5 Realtime as an AI live voice model for assistants, support bots, and similar interactive tools.
  • Voice Stack: StepFun ties the model to a single audio-in, audio-out design, benchmark claims from testing in April 2026, and WebSocket support for streaming use.
  • Competition: OpenAI, Google, Tencent, and other rivals already ship or preview comparable voice systems, so StepFun enters a crowded category.
  • Data Risk: Public details still do not explain the consent and copyright boundaries behind the voice data used to train the model.

Chinese developer StepFun has launched the StepAudio 2.5 Realtime live voice model for assistants, support bots, and other interactive tools. Developers already compare low-latency systems on turn-taking, responsiveness, and how naturally they hold a conversation, so the release lands in an already crowded field.

StepAudio is presented as an end-to-end real-time speech large language model with persona controls. The launch also ties the system to a design that takes audio in and spits audio out instead of splitting speech recognition, reasoning, and synthesis into separate services. StepAudio also enters the market with Chinese and English support.

StepFun uses “global scene-level tonal setting” to describe part of its tone-control design, and the official page promises “real warmth, real temper, and real personality”. Missing disclosure about training data keeps the copyright accountability question open from the start.

How StepAudio Frames Its Live Voice Stack

Roleplay-specific RLHF sits at the center of StepFun’s pitch for stronger persona stability. Reinforcement learning from human feedback is training tuned by human preference signals, and the method is meant to reduce out-of-character drift during live exchanges.

StepFun expanded more than 10,000 authored personas into a million-scale persona feature matrix and paired that structure with millions of conversational samples. In that setup, a voice agent is meant to keep its role, pacing, and affect aligned across multiple turns instead of resetting after each reply.