LongCat-Flash
A fast and efficient open-source Mixture-of-Experts LLM by Meituan, optimized for agentic tasks.
About the Model
LongCat-Flash is a powerful language model with a total parameter count of 560B under a Mixture-of-Experts (MoE) architecture. It activates approximately 18.6B–31.3B parameters per token (averaging ~27B) based on context, enabling competitive quality with high throughput and low latency. The model supports long context lengths (up to 128k) and demonstrates strong instruction following, reasoning, and coding capabilities—especially in agentic tool-use scenarios.
This page summarizes highlights from the official resources and presents a concise overview for new users.
Key Features
- Dynamic MoE routing: Zero-computation experts and PID-controlled expert bias maintain ~27B activated params/token.
- Shortcut-connected MoE (ScMoE): Overlaps computation and communication for speed at scale.
- High throughput inference: Optimizations enable 100+ tokens per second (hardware dependent).
- Robust scaling & stability: Hyperparameter transfer, model-growth initialization, and stability suite (router balancing, z-loss, tuned optimizers).
- Deterministic training: Exact reproducibility helps detect silent data corruption (SDC).
- Agentic capability pipeline: Multi-stage post-training focused on complex tasks with tools and iterative reasoning.
- Extended context: Up to 128k tokens for information-dense, multi-document tasks.
- Open-source: Released under the MIT License.
Selected Benchmarks
Representative results reported by the authors (non-exhaustive):
Category | Benchmark | Metric | LongCat-Flash |
---|---|---|---|
General Domains | MMLU | acc | 89.71 |
Instruction Following | IFEval | acc | 89.65 |
Math Reasoning | MATH500 | acc | 96.40 |
General Reasoning | DROP | F1 | 79.06 |
Coding | Humaneval+ | pass@1 | 88.41 |
Agentic Tool Use | τ²-Bench (telecom) | avg@4 | 73.68 |
Values summarized from public reports; please consult the official resources for full details and conditions.
Quick Start
LongCat-Flash uses a chat template defined in tokenizer_config.json
. Examples:
First Turn
[Round 0] USER:{query} ASSISTANT:
With System Prompt
SYSTEM:{system_prompt} [Round 0] USER:{query} ASSISTANT:
Multi-Turn
SYSTEM:{system_prompt} [Round 0] USER:{q} ASSISTANT:{r} ... [Round N-1] USER:{q} ASSISTANT:{r} [Round N] USER:{q} ASSISTANT:
Tool Call Envelope
{tool_description}
## Messages
SYSTEM:{system_prompt} [Round 0] USER:{query} ASSISTANT:
<longcat_tool_call>{"name": <function-name>, "arguments": <args-dict>}</longcat_tool_call>
Deployment
The authors provide adaptations for SGLang and vLLM to deploy LongCat-Flash with high throughput. Refer to the Deployment Guide in the official repository for environment setup, tensor parallelism, and inference settings.
Official Links
License & Usage
LongCat-Flash-Chat is released under the MIT License. Evaluate and validate the model before use in sensitive or high-risk scenarios, and ensure compliance with applicable laws and regulations for your use case.