LongCat-Next Released
Native discrete multimodal modeling: when vision and speech become AI's first-class language.
Meituan LongCat officially releases and open-sources LongCat-Next and its dNaViT tokenizer. The core idea is to unify image, audio, and text into the same discrete token space, and train a single autoregressive model with pure Next Token Prediction (NTP).
From Language-Centered AI to Physical-World AI
Physical reality is naturally multimodal: visual scenes, speech signals, and symbolic text are all different projections of the same world. LongCat-Next starts from a foundational question: can AI process multimodal physical signals with the same elegance and efficiency as language? The LongCat team argues that if this is possible, then tokenization should not be limited to text. Instead, token becomes a native representation for diverse physical signals.
This release presents an early but concrete answer: under a unified modeling framework and objective, multimodal signals can be mapped into a semantically rich discrete space, then learned with one autoregressive predictor. In short, LongCat-Next turns multimodal learning into one universal task: predict the next token.
Core Technologies
1) DiNA: Discrete Native Autoregressive
Mainstream multimodal systems often rely on a "language core + external modality modules" pattern. This can create structural fragmentation: understanding and generation use different pipelines, losses, and optimization dynamics. DiNA is designed to remove that split.
- One model, one objective: a shared autoregressive backbone for text, image, and speech
- Understanding/generation symmetry: image-to-text and text-to-image become the same token prediction problem
- Modality internalization: model behavior suggests fused multimodal representation instead of loose alignment
In the release, LongCat-Next is trained on top of a Flash-Lite MoE backbone (68.5B total parameters, around 3B activated). The team reports that under DiNA training, MoE routing gradually shows modality specialization while preserving a shared representation space.
A key empirical signal from ablations is that the unified setup reduces the historical conflict between understanding and generation objectives: understanding loss rises only marginally versus pure-understanding training, while generation loss improves versus pure-generation training.
2) dNaViT: Discrete Native Vision Tokenizer
If DiNA answers how to model, dNaViT answers how to discretize images faithfully. Like a tokenizer in language modeling, dNaViT converts images into meaningful visual tokens and supports an image-to-token-to-image loop for both understanding and generation.
- Native arbitrary resolution: image encoding/decoding without forced resize, crop, or pad
- 8-layer RVQ: residual vector quantization for high compression while preserving detail
- Dual-path detokenization: structure-first then detail refinement for stable rendering quality
The native-resolution strategy is particularly relevant for OCR-heavy and chart-heavy workloads where fine visual details matter. The layered RVQ design progressively fits residual information and enables strong compression while retaining reconstruction quality.
3) Semantically Complete Discrete Representation
A common belief is that discrete modeling is fundamentally weaker than continuous representations on fine-grained tasks. LongCat-Next reframes this: the true bottleneck is not "discrete vs continuous", but whether the discrete tokens are semantically complete.
- LongCat argues discrete modeling is not inherently weak; the key is semantic completeness
- With strong representation foundations and RVQ, discrete tokens can preserve semantics and fine-grained details
- Results indicate discrete tokens can support both high-quality understanding and high-fidelity generation
The release highlights that with suitable representation pretraining and hierarchical quantization, discrete tokens can carry high-level semantics (concepts, relations, reasoning cues) and low-level cues (color, texture, geometry), allowing one representation family to support both analysis and synthesis.
Reported Performance Highlights
LongCat-Next is presented as a unified model spanning visual understanding, image generation, audio, tool use, and coding. The release reports strong competitiveness, and in some tasks, leading results:
- Visual understanding: strong OCR/document/chart performance, including competitive OmniDocBench and OCRBench behavior
- Understanding + generation synergy: unified training shows measurable generation gains without sacrificing understanding quality
- Language capability: MMLU-Pro 77.02 and C-Eval 86.80 are reported while keeping native multimodal training
- Agent and coding: strong tool-use behavior (e.g., tau-squared bench retail 73.68) and SWE-Bench 43.0
- Audio: competitive TTS and audio understanding results with low-latency text-speech generation support
Two Illustrative Cases Shared in the Release
Case 1: Symbolic Reasoning Pattern Completion
The release includes a cross-number reasoning puzzle to demonstrate structured reasoning in multimodal
contexts. The inferred rule is:
center = (top + bottom) + (left - right).
Applying this to the third cross yields:
(13 + 8) + (11 - 7) = 25.
This example is used to show that the model can identify latent arithmetic structure rather than relying on shallow matching.
Case 2: Does Understanding Help Generation?
Another visual analysis case compares a unified model against a pure-generation baseline under token-scaled
settings. Reported curves show lower image loss for the unified setting over much of training, with a
highlighted gap around delta = 0.0213 in a zoomed interval.
The interpretation presented in the release is that multimodal understanding signals can improve generative quality and optimization efficiency, rather than being an isolated auxiliary objective.
Why This Matters
- Architectural simplicity: one token space and one objective reduce cross-module inconsistency
- Training/deployment practicality: a single modeling paradigm can simplify scaling and productization
- Path to physical-world intelligence: moves from modality attachment toward modality-native internal learning
- Open research value: releases model and tokenizer so the community can validate and extend the approach
Open-Source Resources
- Paper: tech_report.pdf
- GitHub: meituan-longcat/LongCat-Next
- Hugging Face: LongCat-Next
- Demo: longcat.chat/longcat-next
- Blog: longcat-next/intro
Closing Note
LongCat positions this as an early but meaningful milestone: validating the potential of a native discrete architecture at relatively small scale, while leaving many high-impact directions open for community collaboration. The broader goal is clear: build AI that can natively see, hear, and reason about the physical world with the same fluency language models brought to text.