LongCat-Video-Avatar 1.5 Released
From high-fidelity to truly usable: commercial-grade digital human video generation, now open source.
Meituan LongCat officially open-sources LongCat-Video-Avatar 1.5, a digital human video model that advances from open-source SOTA toward commercial-grade applications. It delivers comprehensive improvements in lip synchronization, physical plausibility, long-video stability, multi-person interaction, and efficient inference—producing stable, natural, high-quality content even in complex business scenarios.
Three Capability Upgrades
Commercialized Base Experience
Under long sentences, fast speech, and singing, lip motion is more precise and smooth. Facial expressions, head pose, and body movements are better coordinated for natural, stable overall performance.
Richer Open-Domain Scenes
A high-quality data system enables stable handling of real humans, anime, virtual idols, animals, and more. Multi-person dialogue is more natural, with accurate speaker/listener distinction.
Efficient Inference
DMD distillation reduces generation from 50 steps to 8 steps (~15× speedup). A shared base model plus LoRA adapters replaces three-model parallel deployment, cutting VRAM. A 10-second video generates in about 1 minute.
Whisper-Large Audio Upgrade
The audio encoder moves from Wav2Vec2 to Whisper-large, capturing finer phoneme changes, pronunciation rhythm, and multilingual prosody. This improves lip sync and full-body temporal stability—reducing jitter, frame skips, frozen frames, and identity drift in long videos.
Data Engineering
- Offline annotation: Face keypoints, person count, body composition, audio-visual sync
- Online validation: Filter transitions, black frames, flicker, frame skips
- Multi-person data: Active speaker detection for single-speaker segments
- Silent data: Natural micro-expressions without spurious lip motion on non-speakers
- Emotion data: Frame-level emotion recognition for speech-expression-body alignment
GRPO & Hand Stability
GRPO (Group Relative Policy Optimization) applies frame-level human preference alignment, correcting discontinuous motion, hand deformation, structural collapse, and expression-speech mismatch. First-frame hand detection for image-to-video and continuation tasks increases training on visible-hand samples, reducing hand artifacts in e-commerce, product showcase, and education scenarios.
EvalTalker Benchmark Results
Built on EvalTalker across news, education, entertainment, and commercial scenarios—with 770 evaluators, 13,240 subjective scores, and 10 expert structured analyses.
User Preference
| vs. LongCat-Video-Avatar 1.5 | Win Rate |
|---|---|
| Kling Avatar 2.0 | 65.9% |
| OmniHuman 1.5 | 61.1% |
| HeyGen | 54.3% |
Key Metrics
| Metric | Result |
|---|---|
| Single-person score | 3.336 |
| Multi-person score | 2.730 (vs. InfiniteTalk 2.339) |
| Subject deformation rate | 23.1% |
| Background deformation rate | 9.4% |
| Frame skip rate | 0.8% (lowest) |
| Face-body sync issue rate | 5.1% (best) |
| Lip sync issue rate | 29.8% (best) |
Open Source Resources
- GitHub: https://github.com/meituan-longcat/LongCat-Video
- Hugging Face: LongCat-Video-Avatar-1.5
- ModelScope: LongCat-Video-Avatar-1.5
- Project Page: LongCat-Video-Avatar-1.5-Page
- Tech Report: LongCat-Video-Avatar-1.5-Tech-Report.pdf