LongCat-Flash-Omni

First open-source real-time all-modality interaction model (November 2025)

Overview

Since September 1, Meituan has released the LongCat-Flash series and open-sourced LongCat-Flash-Chat and LongCat-Flash-Thinking. Today, the family is upgraded with LongCat-Flash-Omni — the first open-source, real-time, all-modality interaction model.

Built upon the series’ efficient Shortcut-Connected MoE (ScMoE) backbone (with Zero-Computation Experts), Omni integrates efficient multi-modal perception and a speech reconstruction module. Even at a 560B total parameter scale with ~27B active, it delivers low-latency real-time audio-video interaction, providing developers with an efficient choice for multi-modal applications.

Modalities Supported

  • Text: instruction following, reasoning, coding
  • Image: VQA, fine-grained recognition, OCR
  • Audio: speech understanding, streaming ASR
  • Video: temporal reasoning, event grounding

Architectural Highlights

  • End-to-end: visual and audio encoders as perceptual front-ends; the LLM directly produces text/speech tokens.
  • Speech reconstruction: lightweight audio decoder reconstructs natural speech waveforms for real-time dialogue.
  • Unified ScMoE: single-trunk expert routing across modalities with Zero-Computation Experts.
  • Streaming-efficient: codecs are lightweight (~0.6B each); all modules optimized for streaming inference.
  • Efficiency/performance balance: retains LongCat-Flash efficiency while achieving strong multi-modal quality.

Scale & Real-Time IO

  • 560B total / ~27B active with ScMoE backbone.
  • 128K context and > 8 minutes AV sessions for long-horizon dialogue.
  • Chunked AV feature interleaving for efficient temporal processing.
  • Low-latency speech generation with high fidelity.

Progressive Early Multi-Modal Fusion

Addressing heterogeneous modality distributions with a staged strategy:

  1. Stage 0: large-scale text pretraining to build a strong LLM base.
  2. Stage 1: introduce speech data and align acoustic-language spaces.
  3. Stage 2: add image–caption pairs and interleaved V-L corpora for vision–language alignment.
  4. Stage 3: incorporate video for spatio-temporal reasoning; strengthen image datasets.
  5. Stage 4: extend context from 8K to 128K.
  6. Stage 5: audio encoder alignment to mitigate discrete-token information loss.

Benchmark Highlights

Open-source SOTA across modalities on comprehensive suites (e.g., Omni-Bench, WorldSense). Strong single-modality performance:

  • Text: maintains and improves textual capabilities across domains.
  • Image: RealWorldQA 74.8 — comparable to Gemini-2.5-Pro; above open-source Qwen3-Omni; strong on multi-image tasks.
  • Audio: strong ASR on LibriSpeech/AISHELL-1, S2TT on CoVost2, top audio understanding on TUT2017/Nonspeech7k; near closed-source on real-time AV interaction.
  • Video: SOTA on video-to-text; short-video understanding far ahead; long-video on par with Gemini-2.5-Pro and Qwen3-VL.
  • Cross-modal: better than Gemini-2.5-Flash (non-thinking), on par with Gemini-2.5-Pro (non-thinking); significant advantage on WorldSense.

Applications & Resources

  • Multi-modal assistants and voice agents
  • Visual Q&A and scene understanding
  • Real-time AI video customer support