LongCat-Video-Avatar 1.5
Commercial-Grade Digital Human Video Generation (Released: May 2026)
Overview
LongCat-Video-Avatar 1.5 is the latest open-source release in Meituan LongCat's digital human video line, advancing from open-source SOTA toward commercial-grade deployment. Built on the LongCat-Video base with the "one model for multiple tasks" design, it natively supports Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and video continuation.
Version 1.5 delivers comprehensive upgrades in lip synchronization, physical plausibility, long-video stability, multi-person interaction, and efficient inference. It produces stable, natural, high-quality output even in complex commercial scenarios—moving digital human video generation from rehearsal-room perfection to real-world, diverse production stages.
This release supersedes LongCat-Video-Avatar (v1.0) (November 2025). Model weights, inference code, and documentation are fully open.
Three Core Capability Upgrades
1. Commercialized Base Experience
Under complex speech inputs—long sentences, fast speech, and singing—the model delivers more precise and smooth lip motion. Facial expressions, head pose, and body movements are better coordinated for natural, stable overall expression.
2. Richer Open-Domain Scenes
Powered by a high-quality data system, the model stably handles diverse subjects including real humans, anime characters, virtual idols, and animals. Multi-person dialogue is more natural, with accurate distinction between speakers and listeners.
3. Efficient Inference & Deployment
DMD (Distribution Matching Distillation) compresses generation from 50 steps to 8 steps, achieving approximately 15× inference speedup. A shared base model + multiple LoRA adapters replaces the traditional three-model parallel scheme, significantly reducing VRAM usage. In practice, generating a 10-second video takes about 1 minute—better suited for scaled production and real business workflows.
Whisper-Large Audio Encoding
In the audio feature extraction stage, the encoder is upgraded from Wav2Vec2 to Whisper-large. The larger parameter count and richer multilingual priors enable finer capture of phoneme changes, pronunciation rhythm, and cross-language prosody—helping the model understand "how to speak at every moment."
This upgrade improves both lip sync and full-body temporal stability: facial expressions, head pose, shoulders, and limb movements coordinate more naturally with speech, substantially reducing jitter, frame skipping, frozen frames, and identity drift in long videos.
High-Quality Data Pipeline
Data quality directly determines generation ceiling. LongCat-Video-Avatar 1.5 uses a multi-stage processing workflow:
- Offline annotation: Extract face keypoints, person count, body composition, audio-visual sync, and related attributes
- Online validation: Automatically filter low-quality segments such as transitions, black frames, flicker, and frame skips
Three Targeted Data Enhancements
- Multi-person data: Active speaker detection keeps only segments where a single speaker talks at each moment, reducing audio-visual ambiguity in multi-person scenes
- Silent data: Videos where characters are not speaking teach natural micro-expressions, gaze, and body dynamics without speech—preventing non-speaking characters from mouth movement artifacts
- Emotion data: Multi-modal pre-screening plus frame-level emotion recognition inject emotional transitions, strengthening the link between speech, expression, and body response
GRPO Alignment & Hand Stability
On top of high-quality data, the team applies targeted optimization for hand stability and motion continuity. GRPO (Group Relative Policy Optimization) aligns outputs with human preferences at the frame level, correcting local issues such as discontinuous motion, hand deformation, short-term structural collapse, and expression-speech mismatch.
For image-to-video and video continuation tasks, a first-frame hand detection mechanism prioritizes training samples with visible hands, significantly mitigating hand distortion. These improvements further enhance naturalness and stability in e-commerce live streaming, product showcases, and educational demonstrations.
Inherited Foundation (v1.0)
LongCat-Video-Avatar 1.5 builds on capabilities introduced in the original release, including:
- Disentangled Unconditional Guidance: Natural micro-movements (blinking, breathing, posture) during silent segments—silence does not mean freeze
- Cross-Chunk Latent Stitching: Long-video generation without VAE encode-decode quality loss; stable 5-minute+ sequences (~5,000 frames)
- Reference Skip Attention: Identity consistency without rigid "copy-paste" motion; flexible reference frame control via RoPE indices
- Full-body synchronization: Coordinated lip sync, eyes, facial expression, and body gestures
- Multi-mode support: AT2V, ATI2V, and video continuation in one unified model
EvalTalker Benchmark
The team built a comprehensive evaluation benchmark based on EvalTalker, covering news, education, entertainment, and commercial scenarios with varying difficulty across audio (speech rate, emotion) and visual (person count, pose, occlusion) dimensions.
- 770 evaluators completed 13,240 subjective ratings
- 10 domain experts conducted structured quality analysis
- Four core dimensions: physical plausibility, temporal stability, identity consistency, audio-visual coordination
User Preference Win Rates
| Comparison | Win Rate vs. LongCat-Video-Avatar 1.5 |
|---|---|
| Kling Avatar 2.0 | 65.9% |
| OmniHuman 1.5 | 61.1% |
| HeyGen | 54.3% |
Scene Performance
| Scenario | Score | Notes |
|---|---|---|
| Single-person | 3.336 | Significantly above HeyGen, OmniHuman 1.5, and others |
| Multi-person | 2.730 | Large lead over InfiniteTalk (2.339); strong speaker/listener distinction |
Physical Plausibility & Long-Temporal Stability
- Subject deformation rate: 23.1% (lowest among compared models)
- Background deformation rate: 9.4%
- Frame skip rate: 0.8% (lowest among all compared models)
- Stable color and detail over long continuous generation; minimal tone drift accumulation
Audio-Visual Coordination
- Face-body sync issue rate: 5.1% (best among compared models)
- Lip sync issue rate: 29.8% (best among compared models)
- Best overall coordination of speech, lips, expression, and motion for speaking characters
Across physical plausibility, temporal stability, identity consistency, and audio-visual coordination, LongCat-Video-Avatar 1.5 achieves a leading and balanced radar profile—maintaining high generation quality while delivering major efficiency gains.
Key Features Summary
- Commercial-grade quality: Naturalness, realism, and stability meet production requirements; competitive with leading closed-source systems
- Whisper-large audio: Finer phoneme, rhythm, and multilingual prosody understanding
- Open-domain subjects: Real humans, anime, virtual idols, animals, and more
- Multi-person dialogue: Accurate speaker vs. listener roles in complex scenes
- DMD 8-step inference: ~15× faster than 50-step baseline; ~1 min for 10s video
- Shared base + LoRA: Lower VRAM vs. three-model parallel deployment
- GRPO frame-level alignment: Corrects motion, hands, structure, and expression-speech mismatch
- Long-video stability: Cross-Chunk Latent Stitching; minimal drift, jitter, and identity loss
- Silent-segment naturalness: Micro-expressions and body dynamics without spurious lip motion
Technical Architecture
- Base: LongCat-Video Diffusion Transformer (DiT) with long-video generation priors
- Audio encoder: Whisper-large (upgraded from Wav2Vec2)
- Distillation: DMD — 50-step teacher compressed to 8-step student
- Deployment: One shared foundation model + task-specific LoRA adapters
- Alignment: GRPO with frame-level human preference rewards
- Long video: Cross-Chunk Latent Stitching + Reference Skip Attention
- Guidance: Disentangled Unconditional Guidance for silent segments
Resources (v1.5)
- GitHub: https://github.com/meituan-longcat/LongCat-Video
- Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar-1.5
- ModelScope: https://www.modelscope.cn/models/meituan-longcat/LongCat-Video-Avatar-1.5
- Project Page: https://meigen-ai.github.io/LongCat-Video-Avatar-1.5-Page/
- Tech Report: LongCat-Video-Avatar-1.5-Tech-Report.pdf
- Related: LongCat-Video Base Model
- Release Announcement
- Back to Models Overview
Version History
LongCat-Video-Avatar 1.5 (May 2026) — Current
Commercial-grade upgrade: Whisper-large, DMD 8-step inference, GRPO alignment, enhanced multi-person and open-domain data, EvalTalker-leading benchmarks.
LongCat-Video-Avatar v1.0 (November 2025)
Initial open-source SOTA release: Cross-Chunk Latent Stitching, Reference Skip Attention, Disentangled Unconditional Guidance, 5-minute+ stable generation.
- Hugging Face (v1.0): LongCat-Video-Avatar
- Project Page (v1.0): LongCat-Video-Avatar
Philosophy
The open-source release of LongCat-Video-Avatar 1.5 is more than a version bump—it is an invitation to developers and creators. Digital human video generation is moving from "showcase effects" to "real usage," facing more open scenarios: different characters, languages, content forms, and complex business needs.
We hope LongCat-Video-Avatar 1.5 becomes a verifiable, improvable, co-buildable technical foundation. Models and code are open—welcome use, testing, and feedback in your own scenarios, and we look forward to advancing open-source digital human video together with the community.