Benchmarks
Performance metrics and comparisons for LongCat AI models
Text-Based Benchmarks
Representative results reported by the authors (non-exhaustive):
| Category | Benchmark | Metric | LongCat-Flash |
|---|---|---|---|
| General Domains | MMLU | acc | 89.71 |
| Instruction Following | IFEval | acc | 89.65 |
| Math Reasoning | MATH500 | acc | 96.40 |
| General Reasoning | DROP | F1 | 79.06 |
| Coding | Humaneval+ | pass@1 | 88.41 |
| Agentic Tool Use | τ²-Bench (telecom) | avg@4 | 73.68 |
UNO-Bench: Unified All-Modality Benchmark
LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.
Why UNO-Bench?
- One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
- High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
- Chinese-centric: robust Chinese scenarios and tasks
- Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)
Combination Law (Synergistic Promotion)
Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):
POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
- Bottleneck effect: weaker models grow slowly
- Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2
Data Pipeline & Quality
- Manual curation to avoid contamination; >90% private, crowdsourced visuals
- Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
- Ablation checks: modality removal verifies cross-modal solvability (≥98%)
- Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)
Highlights
- Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
- Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models
Multi-Modal Leaderboard
Achieves open-source SOTA across modalities.
| Suite | Metric | LongCat-Flash-Omni | Qwen3-Omni | Gemini-2.5-Flash | Gemini-2.5-Pro |
|---|---|---|---|---|---|
| Omni-Bench | avg | SOTA | - | - | - |
| WorldSense | avg | SOTA | - | - | - |
Replace placeholders with exact numbers when available.
Model-Specific Benchmarks
LongCat-Image
Image Editing (Open-source SOTA)
- ImgEdit-Bench: 4.50 (open-source SOTA, approaching top closed-source models)
- GEdit-Bench: Chinese 7.60 / English 7.64 (open-source SOTA)
Text-to-Image
- GenEval: 0.87
- DPG-Bench: 86.8
- Competitive with top open-source and closed-source models
Chinese Text Rendering (Leading Performance)
- ChineseWord: 90.7 (significantly leading all evaluated models)
- Character coverage: All 8,105 standard Chinese characters
- Supports common characters, rare characters, variant forms, and calligraphy styles
Subjective Evaluation
- Text-to-image (MOS): Excellent realism compared to mainstream open/closed-source models; open-source SOTA in text-image alignment and reasonableness
- Image editing (SBS): Significantly outperforms other open-source solutions; competitive with commercial models like Nano Banana and Seedream 4.0
Flash-Thinking
- AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)
Values summarized from public reports; please consult the official resources for full details and conditions.