Benchmarks

Performance metrics and comparisons for LongCat AI models

Text-Based Benchmarks

Representative results reported by the authors (non-exhaustive):

Category Benchmark Metric LongCat-Flash
General Domains MMLU acc 89.71
Instruction Following IFEval acc 89.65
Math Reasoning MATH500 acc 96.40
General Reasoning DROP F1 79.06
Coding Humaneval+ pass@1 88.41
Agentic Tool Use τ²-Bench (telecom) avg@4 73.68

UNO-Bench: Unified All-Modality Benchmark

LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.

Why UNO-Bench?

  • One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
  • High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
  • Chinese-centric: robust Chinese scenarios and tasks
  • Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)

Combination Law (Synergistic Promotion)

Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
  • Bottleneck effect: weaker models grow slowly
  • Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2

Data Pipeline & Quality

  • Manual curation to avoid contamination; >90% private, crowdsourced visuals
  • Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
  • Ablation checks: modality removal verifies cross-modal solvability (≥98%)
  • Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)

Highlights

  • Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
  • Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models

Multi-Modal Leaderboard

Achieves open-source SOTA across modalities.

Suite Metric LongCat-Flash-Omni Qwen3-Omni Gemini-2.5-Flash Gemini-2.5-Pro
Omni-Bench avg SOTA - - -
WorldSense avg SOTA - - -

Replace placeholders with exact numbers when available.

Model-Specific Benchmarks

LongCat-Image

Image Editing (Open-source SOTA)

  • ImgEdit-Bench: 4.50 (open-source SOTA, approaching top closed-source models)
  • GEdit-Bench: Chinese 7.60 / English 7.64 (open-source SOTA)

Text-to-Image

  • GenEval: 0.87
  • DPG-Bench: 86.8
  • Competitive with top open-source and closed-source models

Chinese Text Rendering (Leading Performance)

  • ChineseWord: 90.7 (significantly leading all evaluated models)
  • Character coverage: All 8,105 standard Chinese characters
  • Supports common characters, rare characters, variant forms, and calligraphy styles

Subjective Evaluation

  • Text-to-image (MOS): Excellent realism compared to mainstream open/closed-source models; open-source SOTA in text-image alignment and reasonableness
  • Image editing (SBS): Significantly outperforms other open-source solutions; competitive with commercial models like Nano Banana and Seedream 4.0

Flash-Thinking

  • AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)

Values summarized from public reports; please consult the official resources for full details and conditions.