Overview

Since September 1, Meituan has released the LongCat-Flash series and open-sourced LongCat-Flash-Chat and LongCat-Flash-Thinking. Today, the family is upgraded with LongCat-Flash-Omni — the first open-source, real-time, all-modality interaction model.

Built upon the series’ efficient Shortcut-Connected MoE (ScMoE) backbone (with Zero-Computation Experts), Omni integrates efficient multi-modal perception and a speech reconstruction module. Even at a 560B total parameter scale with ~27B active, it delivers low-latency real-time audio-video interaction, providing developers with an efficient choice for multi-modal applications.

Hugging Face GitHub

Modalities Supported

Text: instruction following, reasoning, coding
Image: VQA, fine-grained recognition, OCR
Audio: speech understanding, streaming ASR
Video: temporal reasoning, event grounding

Architectural Highlights

End-to-end: visual and audio encoders as perceptual front-ends; the LLM directly produces text/speech tokens.
Speech reconstruction: lightweight audio decoder reconstructs natural speech waveforms for real-time dialogue.
Unified ScMoE: single-trunk expert routing across modalities with Zero-Computation Experts.
Streaming-efficient: codecs are lightweight (~0.6B each); all modules optimized for streaming inference.
Efficiency/performance balance: retains LongCat-Flash efficiency while achieving strong multi-modal quality.

Scale & Real-Time IO

560B total / ~27B active with ScMoE backbone.
128K context and > 8 minutes AV sessions for long-horizon dialogue.
Chunked AV feature interleaving for efficient temporal processing.
Low-latency speech generation with high fidelity.

View Full Benchmarks

Progressive Early Multi-Modal Fusion

Addressing heterogeneous modality distributions with a staged strategy:

Stage 0: large-scale text pretraining to build a strong LLM base.
Stage 1: introduce speech data and align acoustic-language spaces.
Stage 2: add image–caption pairs and interleaved V-L corpora for vision–language alignment.
Stage 3: incorporate video for spatio-temporal reasoning; strengthen image datasets.
Stage 4: extend context from 8K to 128K.
Stage 5: audio encoder alignment to mitigate discrete-token information loss.

Benchmark Highlights

Open-source SOTA across modalities on comprehensive suites (e.g., Omni-Bench, WorldSense). Strong single-modality performance:

Text: maintains and improves textual capabilities across domains.
Image: RealWorldQA 74.8 — comparable to Gemini-2.5-Pro; above open-source Qwen3-Omni; strong on multi-image tasks.
Audio: strong ASR on LibriSpeech/AISHELL-1, S2TT on CoVost2, top audio understanding on TUT2017/Nonspeech7k; near closed-source on real-time AV interaction.
Video: SOTA on video-to-text; short-video understanding far ahead; long-video on par with Gemini-2.5-Pro and Qwen3-VL.
Cross-modal: better than Gemini-2.5-Flash (non-thinking), on par with Gemini-2.5-Pro (non-thinking); significant advantage on WorldSense.

Applications & Resources

Multi-modal assistants and voice agents
Visual Q&A and scene understanding
Real-time AI video customer support