TL;DR
- New Release: Tencent has open-sourced Covo-Audio, a 7B-parameter speech model that unifies recognition, reasoning, and synthesis in a single end-to-end architecture.
- Benchmark Performance: Covo-Audio achieved the highest scores among 7B-scale models on MMAU and MMSU benchmarks, matching or exceeding some 32B-parameter systems.
- Key Capabilities: The model supports full-duplex conversation, voice customization, and multi-turn dialogue under a permissive CC BY 4.0 license.
- Competitive Context: Open-weights speech models are improving rapidly but still trail proprietary systems from providers like Google and xAI by a measurable margin.
Open-source speech models have long required stitching together separate systems for recognition, reasoning, and synthesis. Tencent upended that approach this week when it launched Covo-Audio, a 7B-parameter model that handles all three in a single architecture and posts the highest benchmark scores among models its size.
According to a technical report authored by 26 Tencent AI Lab researchers, Covo-Audio eliminates the traditional cascaded pipeline of automatic speech recognition, language model processing, and text-to-speech synthesis. Instead of routing audio through separate stages that each introduce latency and potential errors, Covo-Audio directly ingests continuous audio and produces spoken output end-to-end.
Both model weights and inference code are available on GitHub and HuggingFace under a CC BY 4.0 license, with the underlying research paper originally submitted to arXiv in February and revised in March.
How Covo-Audio Unifies Speech Understanding and Generation
Covo-Audio chains several established components in a novel configuration. A Whisper-large-v3 audio encoder captures incoming speech at 50 Hz, while three downsampling modules using linear and convolution layers compress that frame rate to 6.25 Hz for efficient processing. At this reduced rate, each second of speech produces only about six feature frames, keeping the overall computational cost manageable for real-time applications.
Alibaba’s Qwen2.5-7B-Base language model serves as the backbone, adapted to handle interleaved sequences of continuous acoustic features and textual tokens. On the output side, a speech tokenizer based on WavLM-large produces discrete audio tokens at 25 Hz from a codebook of 16,384 entries. A Flow-Matching framework paired with a BigVGAN vocoder then reconstructs 24 kHz waveforms from that token sequence, producing natural-sounding speech from the model’s internal representations.
One of Covo-Audio’s training innovations is Hierarchical Tri-modal Speech-Text Interleaving, which aligns continuous acoustic features, discrete speech tokens, and natural language text at both phrase and sentence levels. Previous speech-text interleaving methods operated solely at the word or character level, missing structural relationships that span larger units of meaning.
By capturing alignment at multiple granularities, Covo-Audio better preserves the relationship between how something sounds and what it means. Tencent’s researchers trained the system through a two-stage pre-training pipeline processing 2 trillion tokens in total, spanning both speech and text modalities across multiple languages.
Collapsing what was traditionally a three-stage pipeline into a single forward pass addresses a persistent bottleneck in voice AI development: cascaded systems accumulate errors at each handoff between recognition, understanding, and synthesis. An end-to-end approach allows the unified model to be fine-tuned as one piece, potentially accelerating iteration cycles for developers building voice-enabled applications.
For the open-source community, having a fully integrated system available under a permissive license lowers the barrier to experimenting with speech AI without assembling separate components from different providers.
Full-Duplex Conversation and Voice Customization
Beyond standard turn-based dialogue, Covo-Audio-Chat-FD supports full-duplex voice interaction where both user and model can speak simultaneously. The model reformats its audio encoder into chunk-streaming with a 1:4 user-model chunk ratio, each chunk representing 0.16 seconds of audio. Rather than waiting for a user to finish speaking before generating a response, the model continuously processes incoming audio while producing its own output.
Conversational flow relies on three specialized architectural tokens: THINK signals a listening state, SHIFT handles speaking turn transitions, and BREAK detects interruptions and barge-in events. When a user interrupts mid-sentence, BREAK triggers the model to halt its current output and begin processing new input, mimicking the natural rhythm of human conversation.
Covo-Audio also incorporates Chain-of-Thought reasoning and Group Relative Policy Optimization with a composite reward function covering accuracy, format, consistency, and thinking quality.
Separately, an Intelligence-Speaker Decoupling strategy separates dialogue intelligence from voice rendering, enabling voice customization with minimal text-to-speech data. Developers can swap voice characteristics while preserving conversational capabilities, allowing a single deployment to serve multiple voice personas without separate fine-tuning runs.
For multi-turn conversations, Covo-Audio uses a recursive context-filling strategy where continuous audio features and generated tokens from previous turns are prefixed as historical context. Maintaining coherence across extended dialogues requires no external memory modules.
Tencent’s team acknowledged a known limitation: long silent pauses between vocal fragments in full-duplex mode can trigger premature model responses, an issue identified on the GaokaoEval benchmark and flagged as a priority for future optimization. Combined with voice decoupling and multi-turn memory, full-duplex capability positions Covo-Audio closer to the conversational fluency users expect from commercial voice assistants, though the acknowledged early-response issue indicates the technology is not yet production-ready for all deployment scenarios.
Benchmark Results and Competitive Landscape
According to Tencent’s technical report, Covo-Audio achieved 75.30% on the MMAU audio understanding benchmark, placing it highest among evaluated 7B-scale models. Per the same report, it also posted a leading 66.64% average accuracy on the MMSU benchmark for speech understanding.
On the URO-Bench Chinese track, Covo-Audio-Chat outperformed Alibaba’s Qwen3-Omni in speech reasoning and spoken dialogue tasks, a notable result given that Alibaba positions Qwen3-Omni as the industry’s first native end-to-end omnimodal model.
According to the research team’s own evaluations, Covo-Audio matches or exceeds systems with 32B parameters on several audio and speech understanding tasks. Independent verification of those claims remains pending, but the self-reported numbers suggest that parameter efficiency in speech models may be advancing faster than model size alone would predict.
If confirmed, such results would challenge the assumption that competitive speech AI requires scaling to tens of billions of parameters, an assumption that has driven much of the compute investment in voice model development over the past two years.
Broader competitive context reveals a more nuanced picture. According to benchmarking platform Artificial Analysis, NVIDIA’s Nemotron 3 VoiceChat (12B) currently leads the open-weights frontier, and NVIDIA’s PersonaPlex at roughly 7B parameters scores 91.0% on the Full Duplex Bench for conversational dynamics.
Proprietary models maintain a wider lead still. According to Artificial Analysis, Step-Audio R1.1 scores 96% on Big Bench Audio, with Grok Voice Agent and Gemini 2.5 Flash each at 92%. Covo-Audio’s benchmarks use different evaluation suites, making direct comparisons difficult, but the pattern is consistent: open-weights models are improving rapidly while still trailing closed systems by a measurable margin.
“Open weights speech to speech models still significantly underperform leading proprietary offerings,” Artificial Analysis noted in a March 2026 report.
“Our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.”
Covo-Audio research team, Tencent AI Lab (via arXiv)
Covo-Audio arrives amid a surge of open-source speech model releases in early 2026, each targeting a different segment of the voice AI stack. IBM released Granite 4.0 1B Speech in mid-March, targeting multilingual speech recognition and translation with a compact 1B-parameter footprint designed for resource-constrained environments. Alibaba’s Qwen team launched Qwen3-TTS in January with voice-cloning capabilities that can replicate a speaker’s voice from just three seconds of audio.
NVIDIA, Inworld, and FlashLabs all shipped voice AI models earlier this year, collectively expanding the range of open-weights options available to developers building conversational and voice-enabled applications.
Accelerating competition in the open-source speech AI space continues to narrow, though not yet close, the performance gap between open-weights and proprietary systems. For developers evaluating Covo-Audio, the permissive CC BY 4.0 license and full availability of both model weights and inference pipeline code on public repositories lower the barrier to integration, even as real-world performance outside controlled benchmarks remains to be tested at scale.

