Benchmarks

Performance metrics and comparisons for LongCat AI models

Text-Based Benchmarks

Representative results reported by the authors (non-exhaustive):

Category Benchmark Metric LongCat-Flash
General Domains MMLU acc 89.71
Instruction Following IFEval acc 89.65
Math Reasoning MATH500 acc 96.40
General Reasoning DROP F1 79.06
Coding Humaneval+ pass@1 88.41
Agentic Tool Use τ²-Bench (telecom) avg@4 73.68

UNO-Bench: Unified All-Modality Benchmark

LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.

Why UNO-Bench?

  • One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
  • High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
  • Chinese-centric: robust Chinese scenarios and tasks
  • Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)

Combination Law (Synergistic Promotion)

Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422
  • Bottleneck effect: weaker models grow slowly
  • Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2

Data Pipeline & Quality

  • Manual curation to avoid contamination; >90% private, crowdsourced visuals
  • Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
  • Ablation checks: modality removal verifies cross-modal solvability (≥98%)
  • Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)

Highlights

  • Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
  • Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models

LARYBench: Latent Action Representation Benchmark

LARYBench (Latent Action Representation Yielding Benchmark) is a large-scale benchmark designed to evaluate latent action representations learned from visual sequences—aiming to be an “ImageNet for embodied action representations”. It decouples representation quality from policy learning and evaluates generalization across action granularities and embodiments.

  • Multi-granularity: semantic action classification + proprioceptive action regression
  • Cross-embodiment: diverse robot morphologies and human video sources
  • Probing protocol: shallow heads measure the quality of action embedding z

Multi-Modal Leaderboard

Achieves open-source SOTA across modalities.

Suite Metric LongCat-Flash-Omni Qwen3-Omni Gemini-2.5-Flash Gemini-2.5-Pro
Omni-Bench avg SOTA - - -
WorldSense avg SOTA - - -

Replace placeholders with exact numbers when available.

Model-Specific Benchmarks

LongCat-Image

Image Editing (Open-source SOTA)

  • ImgEdit-Bench: 4.50 (open-source SOTA, approaching top closed-source models)
  • GEdit-Bench: Chinese 7.60 / English 7.64 (open-source SOTA)

Text-to-Image

  • GenEval: 0.87
  • DPG-Bench: 86.8
  • Competitive with top open-source and closed-source models

Chinese Text Rendering (Leading Performance)

  • ChineseWord: 90.7 (significantly leading all evaluated models)
  • Character coverage: All 8,105 standard Chinese characters
  • Supports common characters, rare characters, variant forms, and calligraphy styles

Subjective Evaluation

  • Text-to-image (MOS): Excellent realism compared to mainstream open/closed-source models; open-source SOTA in text-image alignment and reasonableness
  • Image editing (SBS): Significantly outperforms other open-source solutions; competitive with commercial models like Nano Banana and Seedream 4.0

Flash-Thinking

  • AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)

Values summarized from public reports; please consult the official resources for full details and conditions.