LongCat-Audio-Codec

Audio tokenizer and detokenizer for speech large language models

Overview

LongCat-Audio-Codec is an audio processing module providing low-bitrate, real-time streaming audio tokenization and detokenization for speech LLMs. It converts raw audio signals into parallel semantic and acoustic token sequences, enabling efficient audio encoding and decoding for high-fidelity reconstruction at extremely low bitrates (0.43–0.87 kbps) with low latency.

Key Features

  • Parallel token extraction: Generates semantic and acoustic tokens simultaneously via cascade training and parallel inference
  • Low-bitrate: 0.43–0.87 kbps with flexible acoustic codebook configurations
  • Low-latency streaming: Frame-level incremental processing, ~100ms decoding latency for real-time applications
  • Super-resolution: Upsampling capability in the detokenizer for enhanced output quality (16k and 24k variants)
  • High fidelity: At 0.87 kbps (4 codebooks) — WER 1.48, PESQ 2.30, STOI 0.921, speaker similarity 0.942

Use Cases

Designed for speech large language models (Speech LLMs), enabling efficient audio encoding and decoding. Integrates with the LongCat-Flash-Omni pipeline for real-time multi-modal interaction. Ideal for voice assistants, streaming ASR, and real-time dialogue systems.