LongCat-Flash-Thinking

Latest Version: LongCat-Flash-Thinking-2601

LongCat-Flash-Thinking-2601 is the latest upgraded version, achieving open-source SOTA performance on core evaluation benchmarks including Agentic Search, Agentic Tool Use, and TIR (Tool Interaction Reasoning). The model demonstrates exceptional generalization capabilities in tool calling, outperforming Claude in random complex tasks that rely on tool calling.

It is the first fully open-source model that supports online free experience of the "Re-thinking Mode", simultaneously activating 8 parallel reasoning paths to ensure thorough thinking and reliable decision-making. Available for free experience on https://longcat.ai.

GitHub Hugging Face ModelScope Try Online

Revolutionary "Re-thinking" Mode

The newly upgraded "Re-thinking" mode teaches the model to "think carefully" before acting. When encountering high-difficulty problems, the model breaks down the thinking process into two steps:

Parallel thinking phase: The model simultaneously and independently explores multiple reasoning paths, ensuring diversity of thought to avoid missing optimal solutions
Summary synthesis phase: Multiple paths are organized, optimized, and synthesized, with optimized results fed back to form closed-loop iterative reasoning, continuously deepening the thinking process
Reinforcement learning enhancement: Additional reinforcement learning components specifically designed to refine the model's summary synthesis capabilities

Comprehensive Benchmark Performance

LongCat-Flash-Thinking-2601 demonstrates outstanding performance across multiple benchmark tests: achieving top-tier open-source levels on BrowseComp, τ²-Bench, and VitaBench, and even approaching closed-source top models in some tasks.

Programming capability: Achieves 82.8 on LCB benchmark and 47.7 on OIBench EN, ranking in the first tier of similar models, demonstrating solid code foundation capabilities
Mathematical reasoning: Outstanding performance with Re-thinking mode enabled, achieving 100.0 (perfect score) on AIME-25 and 86.8 (current SOTA) on IMO-AnswerBench
Agentic tool calling: Scores 88.2 on τ²-Bench and 29.3 on VitaBench, both achieving open-source SOTA, demonstrating excellent performance in multi-domain tool calling scenarios, meeting practical application needs
Agentic search: Achieves 73.1 on BrowseComp (best among all models) and 79.5 on RW Search, demonstrating strong information retrieval and scenario adaptation capabilities, reaching open-source leading levels

The model also demonstrates strong generalization capabilities, performing excellently in unseen random tool combinations and tasks, mastering "meta-abilities for problem-solving." On test sets injected with real noise, performance significantly surpasses other models, validating the effectiveness of active noise training.

Technical Innovation: Multi-Dimensional Approach to Strong Generalization

Current agent systems heavily rely on vertical scenario customization—requiring engineers to carefully craft specific prompts, toolchains, and even environment interfaces. This approach brings high adaptation costs: models perform excellently in one scenario, but once switching domains, tools, or encountering slightly noisy environments (such as tool call timeouts or errors), they become "unadaptable" or even fail.

The fundamental issue is the lack of a base model that can "battle-tested" and stably generalize in diverse, complex, noisy environments. Existing training often occurs in highly idealized, rule-clear environments, lacking sufficient coverage of real-world complex interactions and uncertainties.

To address this, the Meituan LongCat team proposes a universal agent training paradigm centered on "Two Expansions + Noise Training":

Environment Expansion: Building large-scale training grounds covering 20+ domains
Reinforcement Learning Expansion: Achieving efficient and stable training in tens of thousands of heterogeneous environments
Noise Robustness Training: Systematically injecting real-world perturbations to enhance model resilience

Through this combination, the model gains high-level task execution and cross-domain generalization capabilities, achieving "model as agent," significantly reducing subsequent vertical scenario adaptation burden, enabling the model to handle new tasks and challenges in the real complex world.

Environment Expansion: Building Large-Scale Training Grounds

Environment expansion is the core foundation for models to acquire universal agent capabilities. To truly master practical task execution, models must break free from pure text training limitations and practice in interactive environments that simulate real scenarios.

Facing the pain points of high real-world scenario replication costs and low iteration efficiency, the LongCat team built an end-to-end automated environment generation system, creating large-scale training environments covering 20+ domains with tens of thousands of scenarios. The system features efficient intelligent generation capabilities: inputting concise "domain definitions" can complete full-chain environment construction, automatically synthesizing executable environment graphs containing 60+ tools with complex dependencies, and simultaneously generating supporting database architectures, tool calling interfaces, and validation logic.

Environment types cover diverse scenarios such as file management, data analysis, e-commerce retail, and telecommunications services, providing tool interaction experiences consistent with the real world, supporting full-process training of model tool calling, data processing, and feedback reception.

Solvable Path Priority Strategy

The more complex the automatically synthesized environment, the more associated databases need to be automatically synthesized, making it harder to maintain "database consistency" of these automatically synthesized databases—a single environment associates dozens of databases, with intricate parameter dependencies between tools, easily causing logical conflicts that make tasks "seemingly solvable but actually unsolvable," transmitting erroneous training signals to the model.

To address this, the LongCat team innovated a "Solvable Path Priority" environment construction strategy:

Seed Sampling: Randomly sample a long tool calling chain as an anchor, automatically constructing a complex task that adopts this tool calling chain as one solution, while reducing sampling probability for sampled tools
Controlled Expansion: Using this "golden tool chain" as root, generate a maximum environment subgraph through BFS-style expansion (ensuring all predecessor dependency nodes are within existing tool sets for controllable expansion), strictly guaranteeing database logical consistency
Dynamic Environment Construction: The system dynamically decides whether to add new "golden tool chains" based on current environment complexity, difficulty of finding new valid paths in remaining tool graphs, and number of unused tools, enabling environment scale expansion while ensuring task solvability and training effectiveness
Minimum Scale Guarantee: If current environment tool count is too small (less than 20), the system directly randomly selects a medium-scale available tool chain from the global tool library, always maintaining database state consistency to avoid environment failure

This mechanism enables environment scale expansion while ensuring task solvability and effective training signals, completely breaking free from "armchair theorizing" limitations.

Enhanced DORA: Asynchronous Multi-Environment Training

With massive training environments, how to make models learn efficiently? To support large-scale multi-environment training, the LongCat team upgraded the asynchronous training system DORA. Before training starts, the team redefined the pre-training/fine-tuning model objectives from pursuing benchmark high scores to providing "cold-start strategies" for subsequent RL:

Domains with real data (e.g., mathematics, coding): Strict quality control and executability verification to filter high-quality trajectories
Domains lacking real data (e.g., search, tool use): Dual-path synthesis, including text-driven synthesis and environment-anchored synthesis

This ensures data quality while providing diverse exploration foundations for subsequent reinforcement learning.

Fully Asynchronous Streaming Training Architecture

DORA's core breakthrough lies in a fully asynchronous streaming training architecture, revolutionizing traditional synchronous training:

Multi-version model parallel exploration: Training experiences generated by different version models are "produced and collected on demand," directly stored in sample queues. Trainers can start training without waiting for all tasks to complete, completely eliminating inter-task waiting time. When training devices are idle, the system can elastically scale up generation instances, further improving throughput
Distributed scheduling architecture: Decomposing centralized scheduling design, adopting a distributed mode of "lightweight Rollout Manager + multiple Rollout Controllers," where the former manages global metadata, and the latter each manages a virtual rollout group lifecycle, processing environment interactions through data parallelism, solving single-machine scheduling bottlenecks
Flexible environment deployment: Extending PyTorch RPC framework, supporting remote function calls and object instantiation based on CPU idle state, enabling flexible deployment of massive environments to any idle machine, achieving efficient resource utilization

Key Optimizations for 560B Parameter MoE Models

To adapt to 560B parameter MoE model training needs, DORA introduces two key optimizations:

Prefill-Decode (PD) Decoupling: Deploying prefill and decode tasks on different device groups, avoiding long-context request prefill tasks interfering with decode processes, ensuring generation efficiency in multi-turn interactions
KV-cache Swap Mechanism: Reducing data transmission overhead through chunk-level KV-cache aggregation transmission, asynchronous transmission, and computation overlap, combined with CPU-resident KV-cache dynamic swap mechanism, completely solving repeated computation problems caused by insufficient device memory

Two-Layer Balanced Resource Allocation

Overall Balance: Allocate training task volume based on environment difficulty, increasing rollout quotas for complex, low-throughput domains to avoid over-training in simple environments
Intra-batch Balance: Ensure task domain diversity within single batches, preventing models from only adapting to few environments and overfitting

The system achieves 2-4x efficiency compared to traditional synchronous training, supports stable training of 1000+ steps, enabling models to continuously learn and steadily improve in tens of thousands of heterogeneous environments.

Noise Robustness Training: Real-World Resilience

Real-world environments have inherent imperfections—tools may randomly fail due to network issues, return incomplete results; user instructions may be ambiguous or inconsistent; data transmission may have errors. These noises cause models trained only in idealized perfect environments to "fail to adapt" when deployed to real scenarios, with performance significantly declining.

To address this, the LongCat team incorporated real-world "imperfections" into training core, designing systematic robustness training to enhance model stable decision-making capabilities in uncertain environments.

Systematic Noise Decomposition and Modeling

The team first systematically decomposed and modeled real-world noise, identifying two core noise sources:

Tool Noise: Including tool execution failures (e.g., call timeouts, insufficient permissions), incomplete return results (e.g., missing data fields), inconsistent response formats (e.g., sometimes returning JSON, sometimes text), etc.
Instruction Noise: Covering user expression ambiguity (e.g., unclear task objectives), redundant instruction information (e.g., containing irrelevant interference content), dynamic requirement changes (e.g., adjusting task parameters mid-way), etc.

These noises are all based on real scenario observations, maximizing restoration of real-world uncertainty.

Curriculum Learning Injection Strategy

To enable models to gradually adapt to noise, the team adopted a "curriculum learning" injection strategy: training initially injects mild perturbations (e.g., tool return results partially missing, instructions having slight ambiguity). After models show sufficient stability at current noise levels, gradually increase noise complexity and interference intensity (e.g., frequent tool failures, severely ambiguous instructions), forming robust decision-making patterns.

At the training execution level, the team deeply integrated noise injection with multi-environment training: in tens of thousands of environments across 20+ domains, targeted addition of different types and intensities of noise, enabling models to learn domain task capabilities while simultaneously adapting to noisy environments. Through this progressive training, models ultimately maintain robust decision-making capabilities under various real-world perturbations.

Re-thinking Mode: Width + Depth Dual Expansion

On particularly complex tasks, models sometimes get stuck—following one line of thought to the end, even if that path might be wrong. This is similar to how humans need to consider different possibilities when encountering difficult problems.

The core of "Re-thinking" mode is "Width + Depth" dual expansion: first, let the model simultaneously generate multiple reasoning paths, exploring different solutions; then use a specialized summary model to analyze, filter, and extract optimal ideas from these paths. Moreover, through reinforcement learning, the model learns to integrate intermediate results, continuously improving the reasoning process.

In actual testing, whether in long-chain reasoning, tool-integrated reasoning, or complete agent tool use scenarios, "Re-thinking" mode is particularly effective. As test-time computation budget increases, its performance advantages become increasingly apparent, significantly outperforming strategies that only expand reasoning depth or width.

Zigzag Attention: Ultra-Long Context Support

Traditional full attention mechanisms' quadratic computation complexity limits their support for million-token contexts, while existing sparse attention solutions often require complete retraining at high costs.

The LongCat team's proposed Zigzag Attention mechanism innovatively combines two sparse attention patterns: MLA (Multi-head Latent Attention) and SSA (Streaming Sparse Attention). The mechanism adopts a hierarchical design, alternately using these two sparse attention variants in different layers, avoiding common computation imbalance issues in traditional sparse attention, achieving higher hardware utilization.

Core Design

For each query token, attention is limited to two parts:

Local Window: The most recent W tokens, capturing short-term dependencies
Global Anchors: The first B tokens of the sequence, preserving long-term memory

This design significantly reduces computation and memory complexity while maintaining model perception of short- and long-term contexts.

Implementation

Zigzag attention is introduced in mid-training stages, efficiently converting original full-attention models to sparse variants through structured sparsification processes with extremely low conversion overhead. The optimized model supports up to 1 million token context length, providing feasible solutions for ultra-long sequence processing.

The team simultaneously open-sources the model adapted to this mechanism: LongCat-Flash-Thinking-ZigZag (Hugging Face), fully inheriting LongCat-Flash-Thinking-2601's core capabilities while possessing ultra-long context processing advantages, providing developers with ready-to-use long-sequence solutions.

Original Flash-Thinking Features

The original LongCat-Flash-Thinking model (released September 22, 2025) focuses on "Agentic Reasoning" and "Formal Reasoning". Features a dual-path reasoning framework combining agentic tool use with formal reasoning capabilities. Built with DORA (Dynamic Orchestration for Asynchronous rollout) for asynchronous large-scale training.

64.5% token savings: Dramatically improved tool-call efficiency (from 19,653 tokens down to 6,965 in AIME25 tests)
Dual-path reasoning: Combines agentic tool use with formal reasoning
DORA training: Asynchronous large-scale training system
Enhanced efficiency: Cost-effective agentic scenarios