Benchmarks - LongCat AI Performance

Text-Based Benchmarks

Representative results reported by the authors (non-exhaustive):

Category	Benchmark	Metric	LongCat-Flash
General Domains	MMLU	acc	89.71
Instruction Following	IFEval	acc	89.65
Math Reasoning	MATH500	acc	96.40
General Reasoning	DROP	F1	79.06
Coding	Humaneval+	pass@1	88.41
Agentic Tool Use	τ²-Bench (telecom)	avg@4	73.68

UNO-Bench: Unified All-Modality Benchmark

LongCat team releases UNO-Bench, a unified, high-quality all-modality benchmark designed to evaluate both single-modality and omni-modality intelligence under one framework, with strong Chinese support. It reveals the Combination Law of omni-modality performance.

Why UNO-Bench?

One-stop benchmark: evaluates image, audio, video, and text, plus their fusion
High-quality curation: 1,250 omni samples and 2,480 single-modality samples; 98% require cross-modal fusion
Chinese-centric: robust Chinese scenarios and tasks
Open-ended reasoning: Multi-step Open-ended (MO) questions with weighted human grading; automatic scoring model (95% accuracy)

Combination Law (Synergistic Promotion)

Omni-modality performance follows a power-law over single-modality abilities (perception of audio and vision):

POmni ≈ 1.0332 · (PA × PV)^2.1918 + 0.2422

Bottleneck effect: weaker models grow slowly
Synergistic gain: stronger models exhibit accelerated improvement; 1+1 >> 2

Data Pipeline & Quality

Manual curation to avoid contamination; >90% private, crowdsourced visuals
Audio-visual decoupling: independently designed/recorded audio paired with video to force real fusion
Ablation checks: modality removal verifies cross-modal solvability (≥98%)
Cluster-guided sampling: >90% compute reduction with rank consistency (SRCC/PLCC > 0.98)

Highlights

Omni SOTA: LongCat-Flash-Omni leads among open-source models on UNO-Bench
Reasoning gap: spatial/temporal/complex reasoning remains the key separator among top models

LongCat-Flash-Omni Documentation

LARYBench: Latent Action Representation Benchmark

LARYBench (Latent Action Representation Yielding Benchmark) is a large-scale benchmark designed to evaluate latent action representations learned from visual sequences—aiming to be an “ImageNet for embodied action representations”. It decouples representation quality from policy learning and evaluates generalization across action granularities and embodiments.

Multi-granularity: semantic action classification + proprioceptive action regression
Cross-embodiment: diverse robot morphologies and human video sources
Probing protocol: shallow heads measure the quality of action embedding z

Read Overview GitHub Dataset

Multi-Modal Leaderboard

Achieves open-source SOTA across modalities.

Suite	Metric	LongCat-Flash-Omni	Qwen3-Omni	Gemini-2.5-Flash	Gemini-2.5-Pro
Omni-Bench	avg	SOTA	-	-	-
WorldSense	avg	SOTA	-	-	-

Replace placeholders with exact numbers when available.

Model-Specific Benchmarks

LongCat-Image

Image Editing (Open-source SOTA)

ImgEdit-Bench: 4.50 (open-source SOTA, approaching top closed-source models)
GEdit-Bench: Chinese 7.60 / English 7.64 (open-source SOTA)

Text-to-Image

GenEval: 0.87
DPG-Bench: 86.8
Competitive with top open-source and closed-source models

Chinese Text Rendering (Leading Performance)

ChineseWord: 90.7 (significantly leading all evaluated models)
Character coverage: All 8,105 standard Chinese characters
Supports common characters, rare characters, variant forms, and calligraphy styles

Subjective Evaluation

Text-to-image (MOS): Excellent realism compared to mainstream open/closed-source models; open-source SOTA in text-image alignment and reasonableness
Image editing (SBS): Significantly outperforms other open-source solutions; competitive with commercial models like Nano Banana and Seedream 4.0

Flash-Thinking

AIME25: 64.5% token savings (from 19,653 tokens down to 6,965)

Values summarized from public reports; please consult the official resources for full details and conditions.

Explore Models Technical Details