Technology

Core innovations powering LongCat AI models

Key Technologies

Zero-Computation Experts

Smart MoE routing mechanism that activates only 18.6B–31.3B parameters per token (averaging ~27B) from a 560B parameter pool, achieving cost efficiency while maintaining competitive quality.

Shortcut-connected MoE (ScMoE)

Overlaps computation and communication for speed at scale, reducing latency. Enables unified expert routing across modalities in Omni models.

DORA Training System

Dynamic Orchestration for Asynchronous rollout enables efficient large-scale training across domains. Successfully trained on >20T tokens in ~30 days across large GPU clusters.

Enhanced DORA for Multi-Environment Agents: The upgraded DORA system features a fully asynchronous streaming training architecture, supporting large-scale multi-environment agent training. Key innovations include:

  • Multi-version model parallel exploration: Training experiences generated on demand, eliminating inter-task waiting time
  • Distributed scheduling: Lightweight Rollout Manager + multiple Rollout Controllers architecture for efficient resource utilization
  • Prefill-Decode decoupling: Separate device groups for prefill and decode tasks, ensuring generation efficiency
  • KV-cache swap mechanism: Chunk-level aggregation and CPU-resident dynamic swapping for memory efficiency
  • Two-layer balanced resource allocation: Overall balance by environment difficulty and intra-batch diversity

Achieves 2-4x efficiency compared to traditional synchronous training, supporting stable training of 1000+ steps in tens of thousands of heterogeneous environments.

Environment Expansion & Multi-Environment RL

Automated Environment Generation: End-to-end automated system for building large-scale training environments covering 20+ domains with tens of thousands of scenarios. Each environment integrates 60+ tools with complex dependencies.

  • Solvable Path Priority Strategy: Ensures task solvability and effective training signals through seed sampling, controlled expansion, dynamic construction, and minimum scale guarantees
  • Database consistency: Strict logical consistency maintenance across complex tool dependency graphs
  • Domain coverage: File management, data analysis, e-commerce retail, telecommunications services, and more

Noise Robustness Training

Systematic Real-World Perturbation Training: Models are trained with injected noise to handle imperfect real-world conditions:

  • Tool noise: Execution failures, incomplete results, inconsistent response formats
  • Instruction noise: Ambiguity, redundancy, dynamic requirement changes
  • Curriculum learning: Gradually increasing noise complexity and intensity during training
  • Multi-environment integration: Noise injection across 20+ domains and tens of thousands of environments

This training enables models to maintain robust decision-making capabilities under various real-world perturbations, significantly outperforming models trained only in idealized environments.

Re-thinking Mode (Heavy Thinking)

Width + Depth Dual Expansion: Revolutionary reasoning mode that combines parallel thinking with deep synthesis:

  • Parallel thinking phase: Simultaneously generates multiple independent reasoning paths, ensuring diversity of thought
  • Summary synthesis phase: Organizes, optimizes, and synthesizes multiple paths with closed-loop iterative reasoning
  • Reinforcement learning enhancement: Specialized components to refine summary synthesis capabilities
  • 8 parallel reasoning paths: Simultaneously activates multiple "brains" for thorough thinking and reliable decision-making

Particularly effective in long-chain reasoning, tool-integrated reasoning, and complete agent tool use scenarios. Performance advantages increase with computation budget.

Zigzag Attention Mechanism

Ultra-Long Context Support: Innovative sparse attention mechanism combining MLA (Multi-head Latent Attention) and SSA (Streaming Sparse Attention):

  • Hierarchical design: Alternately uses sparse attention variants in different layers, avoiding computation imbalance
  • Local window + global anchors: Recent W tokens for short-term dependencies, first B tokens for long-term memory
  • Efficient conversion: Structured sparsification with extremely low overhead from full-attention models
  • 1 million token support: Enables ultra-long sequence processing for LongCat-Flash-Thinking-ZigZag variant

Significantly reduces computation and memory complexity while maintaining context perception capabilities.

Dual-Path Reasoning Framework

Combines agentic tool use with formal reasoning for enhanced problem-solving capabilities. Featured in Flash-Thinking model.

Modality-Decoupled Parallel (MDP) Training

Training schedule that enables efficient multi-modal learning by decoupling different modalities during training. Used in Omni model.

Progressive Multi-Modal Fusion

Curriculum learning approach for multi-modal alignment, gradually integrating different modalities during training.

MM-DiT + Single-DiT Hybrid Architecture (Image Generation)

LongCat-Image uses a shared MM-DiT + Single-DiT hybrid backbone architecture with VLM condition encoder, enabling text-to-image generation and editing to mutually assist each other. At only 6B parameters, it achieves performance comparable to larger models through progressive learning strategies and systematic data engineering.

  • Unified architecture: Shared backbone enables generation and editing capabilities to mutually enhance each other
  • Progressive learning: Curriculum learning approach for multi-modal alignment, gradually integrating different capabilities during training
  • Multi-task joint learning: Instruction editing and text-to-image multi-task joint learning mechanism deepens understanding of complex and diverse instructions
  • Mid-training initialization: Initializes from mid-training stage of text-to-image model to effectively inherit knowledge and aesthetics
  • VLM condition encoder: Strong instruction following and consistency preservation
  • Character-level encoding: Uses character-level encoding for specified text in prompts, significantly reducing model memory burden
  • Adversarial training: AIGC content detector as reward model for realistic physical textures, lighting, and quality
  • Efficient inference: Supports consumer-grade GPU inference through refined model design and multi-stage training
  • Chinese text rendering: Covers all 8,105 standard Chinese characters with high accuracy on commercial posters and natural scenes

Training Innovations

  • Hyperparameter transfer: Efficient model scaling
  • Model-growth initialization: Progressive capacity expansion
  • Variance alignment: Training stability
  • Router balancing: Optimal expert utilization

Architecture Highlights

  • MoE Architecture: 560B parameters with efficient routing
  • High-throughput inference: 100+ tokens/s on H800 GPUs
  • Extended context: Up to 128K tokens
  • Multi-modal support: Unified architecture for text, image, audio, video