Tencent Releases Covo-Audio Open-Source 7B Speech AI Model


TL;DR

  • New Release: Tencent has open-sourced Covo-Audio, a 7B-parameter speech model that unifies recognition, reasoning, and synthesis in a single end-to-end architecture.
  • Benchmark Performance: Covo-Audio achieved the highest scores among 7B-scale models on MMAU and MMSU benchmarks, matching or exceeding some 32B-parameter systems.
  • Key Capabilities: The model supports full-duplex conversation, voice customization, and multi-turn dialogue under a permissive CC BY 4.0 license.
  • Competitive Context: Open-weights speech models are improving rapidly but still trail proprietary systems from providers like Google and xAI by a measurable margin.

Open-source speech models have long required stitching together separate systems for recognition, reasoning, and synthesis. Tencent upended that approach this week when it launched Covo-Audio, a 7B-parameter model that handles all three in a single architecture and posts the highest benchmark scores among models its size.

According to a technical report authored by 26 Tencent AI Lab researchers, Covo-Audio eliminates the traditional cascaded pipeline of automatic speech recognition, language model processing, and text-to-speech synthesis. Instead of routing audio through separate stages that each introduce latency and potential errors, Covo-Audio directly ingests continuous audio and produces spoken output end-to-end.

Both model weights and inference code are available on GitHub and HuggingFace under a CC BY 4.0 license, with the underlying research paper originally submitted to arXiv in February and revised in March.

How Covo-Audio Unifies Speech Understanding and Generation

Covo-Audio chains several established components in a novel configuration. A Whisper-large-v3 audio encoder captures incoming speech at 50 Hz, while three downsampling modules using linear and convolution layers compress that frame rate to 6.25 Hz for efficient processing. At this reduced rate, each second of speech produces only about six feature frames, keeping the overall computational cost manageable for real-time applications.

Alibaba’s Qwen2.5-7B-Base language model serves as the backbone, adapted to handle interleaved sequences of continuous acoustic features and textual tokens. On the output side, a speech tokenizer based on WavLM-large produces discrete audio tokens at 25 Hz from a codebook of 16,384 entries. A Flow-Matching framework paired with a BigVGAN vocoder then reconstructs 24 kHz waveforms from that token sequence, producing natural-sounding speech from the model’s internal representations.