LongCat-Video-Avatar - SOTA Avatar Video Generation

Overview

LongCat-Video-Avatar is a SOTA-level avatar video generation model built on the LongCat-Video base. Following the core design of "one model for multiple tasks," it natively supports Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and video continuation. With comprehensive upgrades to the underlying architecture, it achieves significant breakthroughs in three key dimensions: realistic motion, long-video stability, and identity consistency, providing developers with a more stable, efficient, and practical creation solution.

Building on the solid foundation of InfiniteTalk and LongCat-Video, LongCat-Video-Avatar addresses core pain points in real-world scenarios. It represents a major evolution in virtual human generation, offering unprecedented realism and stability for commercial applications.

Key Breakthroughs

🎭 Open-Source SOTA Realism - Making Virtual Humans "Come Alive"

Unlike traditional virtual humans where only the mouth moves while the head and body remain static, LongCat-Video-Avatar acts as a complete director, synchronously controlling lip sync, eye movements, facial expressions, and body gestures to achieve rich and full emotional expression, making virtual humans truly "perform."

The model implements Disentangled Unconditional Guidance, allowing it to understand that "silence" does not mean "freeze." Even during speech pauses, virtual humans naturally blink, adjust posture, and relax shoulders, just like real people. This technology makes LongCat-Video-Avatar the first all-around solution to support text, image, and video generation modes simultaneously.

🎬 Long-Sequence High-Quality Generation - Stabilizing Video

Previous InfiniteTalk models experienced visual quality degradation in long video generation, primarily due to repeated VAE encoding-decoding cycles. Existing methods typically decode previous generation results to pixels, then re-encode the end frames back to latent variables as conditions for the next segment—this "decode→re-encode" cycle continuously introduces cumulative errors, causing color shifts and detail blurring.

LongCat-Video-Avatar proposes the Cross-Chunk Latent Stitching training strategy to fundamentally solve this problem. During training, we sample two consecutive and partially overlapping segments from the same video, performing feature replacement directly in the latent space, enabling the model to seamlessly connect context within the latent space.

During inference, the system directly uses the end portion of the latent sequence generated from the previous segment as the context latent for the next segment, eliminating the need to decode to the pixel domain throughout the process. This design not only eliminates quality loss from VAE cycles but also significantly improves inference efficiency and effectively bridges the gap between training and inference (train-test gap). Experiments show that LongCat-Video-Avatar maintains stable colors and clear details even when generating 5-minute videos with approximately 5,000 frames.

✅ Commercial-Grade Consistency - Precise Identity Anchoring

To maintain identity (ID) consistency in long videos, InfiniteTalk used reference frame injection, which sometimes led to color shifts or rigid movements (a "copy-paste" effect). LongCat-Video-Avatar upgrades systematically in two ways:

Base Model Upgrade: The video base model migrated to LongCat-Video, which has stronger identity preservation and color consistency priors from large-scale long-video pre-training.
Reference Mechanism Innovation: We introduced a reference frame injection mode with positional encoding. During inference, users can flexibly control the insertion position of reference frames in the generation block by specifying the index position in RoPE. More importantly, we designed the Reference Skip Attention mechanism, which at time steps adjacent to reference frames, shields the reference frame's direct influence on attention calculations, allowing it to only provide identity semantic priors without dominating specific motion generation. This mechanism ensures ID consistency while effectively suppressing action repetition and rigidity, making long videos both stable and varied.

Key Features

Multi-mode support: Audio-Text-to-Video (AT2V), Audio-Text-Image-to-Video (ATI2V), and video continuation
Natural micro-movements: Blinking, breathing, and posture adjustments during silent segments
Long-video stability: Cross-Chunk Latent Stitching enables stable 5-minute+ video generation without quality degradation
Identity consistency: Reference Skip Attention mechanism ensures consistent character appearance throughout long sequences
Full-body synchronization: Synchronous control of lip sync, eye movements, facial expressions, and body gestures
Disentangled Unconditional Guidance: Natural behavior during speech pauses
Cross-language support: Excellent performance in both Chinese and English

Benchmark Performance

Objective Benchmarks

Quantitative evaluation on authoritative public datasets including HDTF, CelebV-HQ, EMTD, and EvalTalker shows that LongCat-Video-Avatar achieves SOTA-leading performance across multiple core metrics.

Sync-c/Sync-D: SOTA performance on all datasets for measuring lip-sync accuracy
Consistency metrics: Excellent performance on FID, FVD, and CSIM
Long-video stability: Maintains stable colors and clear details for 5-minute videos (~5,000 frames)

Subjective Evaluation

Based on the EvalTalker benchmark, we organized large-scale human evaluation, conducting blind scoring (5-point scale) of generated videos from the "Naturalness and Realism" dimension. With 492 participants, LongCat-Video-Avatar received significantly positive feedback across multiple dimensions:

Silent segment performance: Most reviewers noted that LongCat-Video-Avatar maintains natural micro-movements like breathing and blinking during silent segments
Long-video stability: In long sequence generation, compared to InfiniteTalk, the model shows superior identity consistency and visual continuity, effectively alleviating long-standing drift issues
Motion diversity: Thanks to the innovative reference frame mechanism, generated motions are widely considered more rich and natural, avoiding obvious repetition or "copy-paste" effects
Language performance: LongCat-Video-Avatar outperforms all comparison methods in both Chinese and English, demonstrating robust cross-language performance and precise audio-visual synchronization
Application scenarios: LongCat-Video-Avatar performs best in entertainment, daily life, and educational scenarios, showing strong generalization ability across diverse application contexts

In comprehensive subjective evaluation covering commercial promotion, entertainment, news, daily life, and educational scenarios, LongCat-Video-Avatar's overall score leads many mainstream open-source and commercial models, including InfiniteTalk, HeyGen, and Kling Avatar 2.0.

Technical Architecture

LongCat-Video-Avatar is built on the LongCat-Video base, inheriting its Diffusion Transformer (DiT) architecture and long-video generation capabilities. Key technical innovations include:

Cross-Chunk Latent Stitching: Eliminates VAE encode-decode cycles for long-video generation, maintaining quality throughout 5-minute sequences
Reference Skip Attention: Ensures identity consistency without motion rigidity through smart attention masking
Disentangled Unconditional Guidance: Enables natural micro-movements during silent segments
Positional encoding for reference frames: Flexible control over reference frame insertion positions via RoPE indices

Resources

GitHub: https://github.com/meituan-longcat/LongCat-Video
Hugging Face: https://huggingface.co/meituan-longcat/LongCat-Video-Avatar
Project Page: https://meigen-ai.github.io/LongCat-Video-Avatar/
Related: LongCat-Video Base Model
Back to Models Overview

Philosophy

LongCat-Video-Avatar represents our continued iteration in digital human generation following InfiniteTalk. We focus on real problems developers encounter in long-video generation—identity drift, frame freezing, and rigidity during silent segments—and attempt to provide improvements at the model level.

This open-source release is not an "ultimate solution" but an evolving, usable technical foundation. Both are based on real feedback and long-term experimentation, with code and models fully open. We persist in open-source because we believe that tools gain value through iteration, and iteration requires more people to use, verify, and build together. If you are exploring digital human-related applications or have ideas about generation technology, we welcome you to follow our project and even more welcome your feedback.