Technology
Core innovations powering LongCat AI models
Key Technologies
Zero-Computation Experts
Smart MoE routing mechanism that activates only 18.6B–31.3B parameters per token (averaging ~27B) from a 560B parameter pool, achieving cost efficiency while maintaining competitive quality.
Shortcut-connected MoE (ScMoE)
Overlaps computation and communication for speed at scale, reducing latency. Enables unified expert routing across modalities in Omni models.
DORA Training System
Dynamic Orchestration for Asynchronous rollout enables efficient large-scale training across domains. Successfully trained on >20T tokens in ~30 days across large GPU clusters.
Dual-Path Reasoning Framework
Combines agentic tool use with formal reasoning for enhanced problem-solving capabilities. Featured in Flash-Thinking model.
Modality-Decoupled Parallel (MDP) Training
Training schedule that enables efficient multi-modal learning by decoupling different modalities during training. Used in Omni model.
Progressive Multi-Modal Fusion
Curriculum learning approach for multi-modal alignment, gradually integrating different modalities during training.
MM-DiT + Single-DiT Hybrid Architecture (Image Generation)
LongCat-Image uses a shared MM-DiT + Single-DiT hybrid backbone architecture with VLM condition encoder, enabling text-to-image generation and editing to mutually assist each other. At only 6B parameters, it achieves performance comparable to larger models through progressive learning strategies and systematic data engineering.
- Unified architecture: Shared backbone enables generation and editing capabilities to mutually enhance each other
- Progressive learning: Curriculum learning approach for multi-modal alignment, gradually integrating different capabilities during training
- Multi-task joint learning: Instruction editing and text-to-image multi-task joint learning mechanism deepens understanding of complex and diverse instructions
- Mid-training initialization: Initializes from mid-training stage of text-to-image model to effectively inherit knowledge and aesthetics
- VLM condition encoder: Strong instruction following and consistency preservation
- Character-level encoding: Uses character-level encoding for specified text in prompts, significantly reducing model memory burden
- Adversarial training: AIGC content detector as reward model for realistic physical textures, lighting, and quality
- Efficient inference: Supports consumer-grade GPU inference through refined model design and multi-stage training
- Chinese text rendering: Covers all 8,105 standard Chinese characters with high accuracy on commercial posters and natural scenes
Training Innovations
- Hyperparameter transfer: Efficient model scaling
- Model-growth initialization: Progressive capacity expansion
- Variance alignment: Training stability
- Router balancing: Optimal expert utilization
Architecture Highlights
- MoE Architecture: 560B parameters with efficient routing
- High-throughput inference: 100+ tokens/s on H800 GPUs
- Extended context: Up to 128K tokens
- Multi-modal support: Unified architecture for text, image, audio, video