Benchmarks

Performance metrics and comparisons for LongCat AI models

Text-Based Benchmarks

Representative results reported by the authors (non-exhaustive):

Category Benchmark Metric LongCat-Flash
General Domains MMLU acc 89.71
Instruction Following IFEval acc 89.65
Math Reasoning MATH500 acc 96.40
General Reasoning DROP F1 79.06
Coding Humaneval+ pass@1 88.41
Agentic Tool Use τ²-Bench (telecom) avg@4 73.68

UNO-Bench: Unified All-Modality Benchmark

LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.

Why UNO-Bench?

  • One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
  • High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
  • Chinese-centric: robust Chinese scenarios and tasks
  • Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)

Combination Law (Synergistic Promotion)

Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
  • Bottleneck effect: weaker models grow slowly
  • Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2

Data Pipeline & Quality

  • Manual curation to avoid contamination; >90% private, crowdsourced visuals
  • Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
  • Ablation checks: modality removal verifies cross-modal solvability (≥98%)
  • Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)

Highlights

  • Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
  • Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models

Multi-Modal Leaderboard

Achieves open-source SOTA across modalities.

Suite Metric LongCat-Flash-Omni Qwen3-Omni Gemini-2.5-Flash Gemini-2.5-Pro
Omni-Bench avg SOTA - - -
WorldSense avg SOTA - - -

Replace placeholders with exact numbers when available.

Model-Specific Benchmarks

Flash-Thinking

  • AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)

Values summarized from public reports; please consult the official resources for full details and conditions.