LongCat-Flash-Thinking-2601 Released: Open-Source SOTA in Tool Calling Capabilities
Today, the Meituan LongCat team officially releases and open-sources LongCat-Flash-Thinking-2601. As an upgraded version of the previously released LongCat-Flash-Thinking model, LongCat-Flash-Thinking-2601 achieves open-source SOTA performance on core evaluation benchmarks including Agentic Search, Agentic Tool Use, and TIR (Tool Interaction Reasoning).
The model demonstrates exceptional generalization capabilities in tool calling, outperforming Claude in random complex tasks that rely on tool calling, significantly reducing the training cost for adapting to new tools in real-world scenarios. It is also the first fully open-source model that supports online free experience of the "Re-thinking Mode", simultaneously activating 8 parallel reasoning paths to ensure thorough thinking and reliable decision-making.
This feature is now available for free experience on https://longcat.ai (the Re-thinking Mode is triggered when selecting the deep thinking function).
🧠 Revolutionary "Re-thinking" Mode
The newly upgraded "Re-thinking" mode teaches the model to "think carefully" before acting. When encountering high-difficulty problems, the model breaks down the thinking process into two steps: parallel thinking and summary synthesis.
- Parallel thinking phase: The model simultaneously and independently explores multiple reasoning paths, similar to how humans consider different solutions when facing difficult problems, ensuring diversity of thought to avoid missing optimal solutions
- Summary synthesis phase: Multiple paths are organized, optimized, and synthesized, with optimized results fed back to form closed-loop iterative reasoning, continuously deepening the thinking process
- Reinforcement learning enhancement: Additional reinforcement learning components specifically designed to refine the model's summary synthesis capabilities, enabling LongCat-Flash-Thinking-2601 to truly "think clearly before acting"
📊 Comprehensive Benchmark Performance
Comprehensive and rigorous evaluation shows that LongCat-Flash-Thinking-2601 leads across programming, mathematical reasoning, agentic tool calling, and agentic search dimensions:
- Programming capability: Achieves 82.8 on LCB benchmark and 47.7 on OIBench EN, ranking in the first tier of similar models, demonstrating solid code foundation capabilities
- Mathematical reasoning: Outstanding performance with Re-thinking mode enabled, achieving 100.0 (perfect score) on AIME-25 and 86.8 (current SOTA) on IMO-AnswerBench
- Agentic tool calling: Scores 88.2 on τ²-Bench and 29.3 on VitaBench, both achieving open-source SOTA, demonstrating excellent performance in multi-domain tool calling scenarios, meeting practical application needs
- Agentic search: Achieves 73.1 on BrowseComp (best among all models) and 79.5 on RW Search, demonstrating strong information retrieval and scenario adaptation capabilities, reaching open-source leading levels
🔬 Generalization Testing
To better test the generalization capabilities of agentic models, we propose a novel evaluation method—an automated task synthesis pipeline that supports users to randomly generate complex tasks for arbitrary scenarios based on given keywords. Each generated task is equipped with corresponding tool sets and executable environments.
Since tool configurations in such environments are highly random, we measure generalization capabilities by evaluating model performance in these environments. Experimental results show that LongCat-Flash-Thinking-2601 maintains leading performance in the vast majority of tasks, confirming its powerful generalization capabilities in agentic scenarios.
🏋️ Environment Expansion: Building Large-Scale Training Grounds
Environment expansion is the core foundation for models to acquire universal agent capabilities. To truly master practical task execution, models must break free from pure text training limitations and practice in interactive environments that simulate real scenarios.
Facing the pain points of high real-world scenario replication costs and low iteration efficiency, the LongCat team built an end-to-end automated environment generation system, creating large-scale training environments covering 20+ domains with tens of thousands of scenarios. The system features efficient intelligent generation capabilities: inputting concise "domain definitions" can complete full-chain environment construction, automatically synthesizing executable environment graphs containing 60+ tools with complex dependencies, and simultaneously generating supporting database architectures, tool calling interfaces, and validation logic.
To ensure task solvability and effective training signals, the team innovated a "Solvable Path Priority" environment construction strategy:
- Seed Sampling: Randomly sample a long tool calling chain as an anchor, automatically constructing a complex task that adopts this tool calling chain as one solution
- Controlled Expansion: Using the "golden tool chain" as root, generate a maximum environment subgraph through BFS-style expansion, strictly guaranteeing database logical consistency
- Dynamic Environment Construction: System dynamically decides whether to add new "golden tool chains" based on complexity and available paths
- Minimum Scale Guarantee: Ensures sufficient tool diversity while maintaining database state consistency
⚙️ Enhanced DORA: Asynchronous Multi-Environment Training
Traditional agents are mostly trained in only a few simple simulated environments, like soldiers only practicing on shooting ranges, failing when they reach the real "battlefield." To support large-scale multi-environment training, the LongCat team upgraded the asynchronous training system DORA.
DORA's core breakthrough lies in a fully asynchronous streaming training architecture, revolutionizing traditional synchronous training:
- Multi-version model parallel exploration: Training experiences generated on demand, directly stored in sample queues. Trainers can start training without waiting for all tasks to complete, completely eliminating inter-task waiting time
- Distributed scheduling architecture: "Lightweight Rollout Manager + multiple Rollout Controllers" distributed mode, processing environment interactions through data parallelism, solving single-machine scheduling bottlenecks
- Flexible environment deployment: Extending PyTorch RPC framework, supporting remote function calls based on CPU idle state, enabling flexible deployment of massive environments to any idle machine
- Prefill-Decode (PD) Decoupling: Deploying prefill and decode tasks on different device groups, avoiding interference and ensuring generation efficiency
- KV-cache Swap Mechanism: Chunk-level aggregation transmission and CPU-resident dynamic swapping, completely solving repeated computation problems caused by insufficient device memory
The system achieves 2-4x efficiency compared to traditional synchronous training, supports stable training of 1000+ steps, enabling models to continuously learn and steadily improve in tens of thousands of heterogeneous environments. Through balanced allocation of multi-environment tasks and intelligent distribution of computing resources based on difficulty and training progress, we maximize training efficiency and resource utilization.
🛡️ Noise Robustness Training: Real-World Resilience
Real-world environments have inherent imperfections—tools may randomly fail due to network issues, return incomplete results; user instructions may be ambiguous or inconsistent; data transmission may have errors. These noises cause models trained only in idealized perfect environments to "fail to adapt" when deployed to real scenarios, with performance significantly declining.
The team systematically decomposed and modeled real-world noise, identifying two core noise sources:
- Tool Noise: Including tool execution failures (e.g., call timeouts, insufficient permissions), incomplete return results (e.g., missing data fields), inconsistent response formats (e.g., sometimes returning JSON, sometimes text)
- Instruction Noise: Covering user expression ambiguity (e.g., unclear task objectives), redundant instruction information (e.g., containing irrelevant interference content), dynamic requirement changes (e.g., adjusting task parameters mid-way)
To enable models to gradually adapt to noise, the team adopted a curriculum learning injection strategy: training initially injects mild perturbations. After models show sufficient stability at current noise levels, gradually increase noise complexity and interference intensity, forming robust decision-making patterns. At the training execution level, noise injection is deeply integrated with multi-environment training across 20+ domains and tens of thousands of environments.
Models without robust training in noisy environments show significant performance degradation, and Claude cannot adapt to all noise types. Through this systematic anti-interference training, LongCat-Flash-Thinking-2601 has gained powerful environmental adaptation capabilities, performing well and efficiently completing tasks even in complex and non-ideal scenarios.
🧠 Re-thinking Mode: Width + Depth Dual Expansion
On particularly complex tasks, models sometimes get stuck—following one line of thought to the end, even if that path might be wrong. This is similar to how humans need to consider different possibilities when encountering difficult problems.
The core of "Re-thinking" mode is "Width + Depth" dual expansion: first, let the model simultaneously generate multiple reasoning paths, exploring different solutions; then use a specialized summary model to analyze, filter, and extract optimal ideas from these paths. Moreover, through reinforcement learning, the model learns to integrate intermediate results, continuously improving the reasoning process.
In actual testing, whether in long-chain reasoning, tool-integrated reasoning, or complete agent tool use scenarios, "Re-thinking" mode is particularly effective. As test-time computation budget increases, its performance advantages become increasingly apparent, significantly outperforming strategies that only expand reasoning depth or width.
🔗 Zigzag Attention: Ultra-Long Context Support
Traditional full attention mechanisms' quadratic computation complexity limits their support for million-token contexts, while existing sparse attention solutions often require complete retraining at high costs.
The LongCat team's proposed Zigzag Attention mechanism innovatively combines two sparse attention patterns: MLA (Multi-head Latent Attention) and SSA (Streaming Sparse Attention). The mechanism adopts a hierarchical design, alternately using these two sparse attention variants in different layers, avoiding common computation imbalance issues in traditional sparse attention, achieving higher hardware utilization.
For each query token, attention is limited to: Local Window (the most recent W tokens for short-term dependencies) and Global Anchors (the first B tokens of the sequence for long-term memory). This design significantly reduces computation and memory complexity while maintaining model perception of short- and long-term contexts.
Zigzag attention is introduced in mid-training stages, efficiently converting original full-attention models to sparse variants through structured sparsification processes with extremely low conversion overhead. The optimized model supports up to 1 million token context length, providing feasible solutions for ultra-long sequence processing.
The team simultaneously open-sources the model adapted to this mechanism: LongCat-Flash-Thinking-ZigZag (Hugging Face), fully inheriting LongCat-Flash-Thinking-2601's core capabilities while possessing ultra-long context processing advantages, providing developers with ready-to-use long-sequence solutions.
📊 Benchmark Results Summary
LongCat-Flash-Thinking-2601 demonstrates outstanding performance across multiple benchmark tests: achieving top-tier open-source levels on BrowseComp, τ²-Bench, and VitaBench, and even approaching closed-source top models in some tasks. The model also demonstrates strong generalization capabilities, performing excellently in unseen random tool combinations and tasks, mastering "meta-abilities for problem-solving." On test sets injected with real noise, performance significantly surpasses other models, validating the effectiveness of active noise training.
Through deep synergy of algorithms and engineering, automated environment construction reduces adaptation costs, the DORA system improves training efficiency by 2-4x, and Heavy Thinking mode amplifies complex task processing capabilities, forming an efficient and scalable training system.
📦 Resources & Access
To lower the barrier for developers, the Meituan LongCat team simultaneously opens model weights, inference code, and online experience capabilities, supporting full-process needs from quick trials to deep development:
- Open-source platforms:
- Online experience & API:
- Official website: https://longcat.ai
- API platform: https://longcat.chat/platform/usage
LongCat-Flash-Thinking-2601, through environment expansion and noise training, significantly reduces agents' dependence on vertical scenarios, setting a new reference standard for open-source models' generalization capabilities in real-world tasks. We believe that truly universal agents should not be greenhouse bonsai, but trees that can take root in the real world's storms.
The release of LongCat-Flash-Thinking-2601 is a solid step toward this goal. Open source is a seed we plant, and we look forward to working with the entire community to sail toward a vast future in this starry sea called "agents."
We welcome developers to download, deploy, and experience LongCat-Flash-Thinking-2601, and also welcome you to apply for free API call quotas on the LongCat API platform. If you have collaboration ideas or feedback in areas such as agentic development and large model inference optimization, we look forward to communicating with you.