LongCat-Flash

A fast and efficient open-source Mixture-of-Experts LLM by Meituan, optimized for agentic tasks.

About the Model

LongCat-Flash is a powerful language model with a total parameter count of 560B under a Mixture-of-Experts (MoE) architecture. It activates approximately 18.6B–31.3B parameters per token (averaging ~27B) based on context, enabling competitive quality with high throughput and low latency. The model supports long context lengths (up to 128k) and demonstrates strong instruction following, reasoning, and coding capabilities—especially in agentic tool-use scenarios.

This page summarizes highlights from the official resources and presents a concise overview for new users.

Key Features

  • Dynamic MoE routing: Zero-computation experts and PID-controlled expert bias maintain ~27B activated params/token.
  • Shortcut-connected MoE (ScMoE): Overlaps computation and communication for speed at scale.
  • High throughput inference: Optimizations enable 100+ tokens per second (hardware dependent).
  • Robust scaling & stability: Hyperparameter transfer, model-growth initialization, and stability suite (router balancing, z-loss, tuned optimizers).
  • Deterministic training: Exact reproducibility helps detect silent data corruption (SDC).
  • Agentic capability pipeline: Multi-stage post-training focused on complex tasks with tools and iterative reasoning.
  • Extended context: Up to 128k tokens for information-dense, multi-document tasks.
  • Open-source: Released under the MIT License.

Selected Benchmarks

Representative results reported by the authors (non-exhaustive):

Category Benchmark Metric LongCat-Flash
General Domains MMLU acc 89.71
Instruction Following IFEval acc 89.65
Math Reasoning MATH500 acc 96.40
General Reasoning DROP F1 79.06
Coding Humaneval+ pass@1 88.41
Agentic Tool Use τ²-Bench (telecom) avg@4 73.68

Values summarized from public reports; please consult the official resources for full details and conditions.

Quick Start

LongCat-Flash uses a chat template defined in tokenizer_config.json. Examples:

First Turn

[Round 0] USER:{query} ASSISTANT:

With System Prompt

SYSTEM:{system_prompt} [Round 0] USER:{query} ASSISTANT:

Multi-Turn

SYSTEM:{system_prompt} [Round 0] USER:{q} ASSISTANT:{r} ... [Round N-1] USER:{q} ASSISTANT:{r} [Round N] USER:{q} ASSISTANT:

Tool Call Envelope

{tool_description}

## Messages
SYSTEM:{system_prompt} [Round 0] USER:{query} ASSISTANT:

<longcat_tool_call>{"name": <function-name>, "arguments": <args-dict>}</longcat_tool_call>

Deployment

The authors provide adaptations for SGLang and vLLM to deploy LongCat-Flash with high throughput. Refer to the Deployment Guide in the official repository for environment setup, tensor parallelism, and inference settings.