LongCat Models
Comprehensive overview of all LongCat AI model variants and their capabilities.
Model Variants
LongCat-Flash-Lite
Released: 2026
Lightweight MoE model based on N-gram embedding expansion. 68.5B total parameters, activating ~2.9B–4.5B per inference. Optimized for agentic tool use and coding, with up to 256K context via YARN.
- 256K context length (YARN)
- 500–700 token/s on typical 4K-in/1K-out load (LongCat API)
- Strong performance on agentic tool-use and coding benchmarks
LongCat-Flash-Chat
Released: September 1, 2025
Foundation dialogue model with 560B parameters in a Mixture-of-Experts (MoE) architecture. Activates approximately 18.6B–31.3B parameters per token (averaging ~27B) through Zero-Computation Experts.
- Supports up to 128K context length
- Achieves 100+ tokens/s on H800 GPUs
- Strong instruction following, reasoning, and coding
LongCat-Flash-Thinking
Latest: Flash-Thinking-2601 | Original: September 22, 2025
Enhanced reasoning model with open-source SOTA tool calling capabilities. Features revolutionary "Re-thinking Mode" with 8 parallel reasoning paths, dual-path reasoning framework, and DORA asynchronous training system.
- Open-source SOTA on Agentic Tool Use, Agentic Search, and TIR benchmarks
- Re-thinking Mode: 8 parallel reasoning paths for thorough decision-making
- 64.5% token savings in tool-call scenarios
- Outperforms Claude in complex random tool-calling tasks
- Perfect score (100.0) on AIME-25, SOTA (86.8) on IMO-AnswerBench
Hugging Face | GitHub | Try Online
Learn More →LongCat-Video
Released: October 27, 2025
Video generation model based on Diffusion Transformer (DiT) architecture. Unified support for text-to-video, image-to-video, and video continuation tasks.
- Generates 5-minute coherent videos at 720p/30fps
- Long temporal sequences and cross-frame consistency
- Physical motion plausibility
LongCat-Video-Avatar
Released: November 2025
SOTA-level avatar video generation model built on LongCat-Video base. Achieves breakthrough improvements in realism, long-video stability, and identity consistency for virtual human generation.
- Audio-Text-to-Video (AT2V) and Audio-Text-Image-to-Video (ATI2V)
- Natural micro-movements during silent segments (blinking, breathing)
- Cross-Chunk Latent Stitching for stable 5-minute+ video generation
- Reference Skip Attention mechanism for identity consistency
- SOTA performance on HDTF, CelebV-HQ, EMTD, and EvalTalker benchmarks
LongCat-Flash-Omni
Released: November 2025
First open-source real-time all-modality interaction model. Unifies text, image, audio, and video with a single end-to-end ScMoE backbone.
- Open-source SOTA on Omni-Bench and WorldSense
- Low-latency, streaming multi-modal IO
- 128K context with multi-turn dialogue
LongCat-Image
Released: Latest | Parameters: 6B
Open-source AI image generation and editing model. Achieves open-source SOTA on image editing benchmarks (GEdit-Bench, ImgEdit-Bench) and leading performance in Chinese text rendering (ChineseWord: 90.7). Covers all 8,105 standard Chinese characters.
- Image editing: Open-source SOTA (ImgEdit-Bench 4.50, GEdit-Bench 7.60/7.64)
- Chinese text rendering: 90.7 on ChineseWord, covering all 8,105 characters
- Text-to-image: GenEval 0.87, DPG-Bench 86.8
- Available on LongCat Web and LongCat APP (24 templates, image-to-image)
- Fully open-source: Hugging Face | GitHub
LongCat-Audio-Codec
Audio processing module providing low-bitrate, real-time streaming audio tokenization and detokenization for speech LLMs, enabling efficient audio encoding and decoding.
Model Comparison
| Model | Parameters | Key Feature | Use Case |
|---|---|---|---|
| Flash-Lite | 68.5B (MoE, sparse) | N-gram embedding expansion | Agentic tool use, coding, long-context analysis |
| Flash-Chat | 560B (MoE) | High-throughput dialogue | General conversation, coding |
| Flash-Thinking | 560B (MoE) | Enhanced reasoning | Tool use, formal reasoning |
| Video | DiT-based | Video generation | Text/image-to-video, continuation |
| Video-Avatar | DiT-based (LongCat-Video base) | Avatar video generation (SOTA) | Audio/text/image-to-video, virtual human |
| Flash-Omni | ScMoE | All-modality | Multi-modal interaction |
| Image | 6B (MM-DiT+Single-DiT) | Image generation & editing (Open-source SOTA) | Text-to-image, image editing, Chinese text rendering |
Get Started
Choose a model to explore detailed documentation, benchmarks, and deployment guides: