LongCat-Flash-Lite

N-gram embedding expansion for lightweight MoE (Released: 2026)

Overview

LongCat-Flash-Lite is a lightweight Mixture-of-Experts (MoE) model that explores a new scaling direction: N-gram embedding expansion. Instead of relying primarily on adding more experts, it allocates a large portion of total parameters into an N-gram embedding layer to improve local-context semantic capture, while keeping inference sparse via dynamic activation.

Key Specs

  • Total parameters: 68.5B
  • Activated per inference: ~2.9B–4.5B
  • Embedding allocation: 31.4B (46%) to N-gram embedding layer
  • Context length: Up to 256K (via YARN)
  • Throughput: 500–700 token/s (typical load: 4K input / 1K output, LongCat API)
  • Strengths: Agentic tool use and coding

Technology Highlights

N-gram Embedding Layer

The N-gram embedding layer enhances the model’s ability to capture local context semantics. Using a hash function, the current token together with its preceding N-1 tokens is mapped into a single N-gram embedding vector, which is then fused with the token’s base embedding.

To reduce hash collisions, the design includes:

  • Sub-table decomposition + linear projection: split a large embedding table into multiple sub-tables and project each separately
  • Vocabulary size design: carefully select table sizes to lower collision probability
  • Embedding amplification: scaling or normalization before output to keep the signal effective through residual paths

System Co-Design for Speed

Despite the large total parameter count, LongCat-Flash-Lite benefits from sparse activation and system-level optimizations to convert theoretical sparsity gains into real throughput.

  • Parameter allocation: shift parameters into O(1) embedding lookup to reduce compute growth and expert communication overhead
  • N-gram Cache + kernel fusion: GPU-managed N-gram ID caching and fused CUDA kernels to reduce I/O latency and improve utilization
  • Speculative decoding collaboration: co-design with speculative decoding; draft model uses standard embeddings to avoid N-gram lookup overhead

Benchmark Highlights

Agentic Tool Use

  • τ²-Bench: Telecom 72.8, Retail 73.1, Aviation 58.0 (highest among compared models)
  • VitaBench: 7.0 (leading)

Coding

  • SWE-Bench: 54.4% (code fixing)
  • TerminalBench: 33.75 (terminal command execution)
  • SWE-Bench Multilingual: 38.10%

General Knowledge & Reasoning

  • MMLU: 85.52
  • C-Eval / CMMLU: 86.55 / 82.48
  • MMLU-Pro / GPQA-Diamond: 78.29 / 66.78
  • MATH500: 96.80%
  • AIME: AIME24 72.19; AIME25 63.23