Overview

LongCat-Image is a 6B parameter open-source AI image generation and editing model that achieves performance comparable to larger models. Through high-performance architecture design, systematic training strategies, and data engineering, it delivers fast response, studio-quality output, and accurate Chinese text rendering, providing developers and industry with a "high-performance, low-threshold, fully open" solution.

The model achieves open-source SOTA on image editing benchmarks (GEdit-Bench, ImgEdit-Bench) and excels in Chinese text rendering (ChineseWord: 90.7), covering all 8,105 standard Chinese characters. It powers AI image generation across LongCat APP and Web, making professional image creation accessible to everyone.

Try on Web LongCat APP Hugging Face GitHub

Key Features

✨ Integrated Generation & Editing

From "text-to-image" to "edit with natural language" in one seamless workflow:

Simple prompts, high-quality output: Deep semantic understanding enables simple prompts to generate highly aligned images in layout, atmosphere, and content, significantly improving creation efficiency while maintaining quality.
15 editing task types: Object add/remove, style transfer, perspective change, portrait refinement, text modification, background replacement, and complex multi-round composite instructions — all with natural language, no complex commands needed.
Multi-round editing without quality loss: Modified images maintain style, lighting, and consistency with originals; no "stitching artifacts"; portrait editing preserves facial features; multi-round edits stay on track.

✨ Superior Chinese Text Rendering

Excellent Chinese text generation capabilities with comprehensive character coverage:

Comprehensive character coverage: Covers all 8,105 standard Chinese characters through curriculum learning strategy — ChineseWord benchmark: 90.7, significantly leading all evaluated models.
High-quality character rendering: Accurate text in shop signs, posters, book covers, and natural scenes (Chinese and English). Supports complex stroke structures for commercial poster design and rare characters for ancient poetry illustrations, couplets, store signs, and text logos.
Multi-stage training: Pre-training learns character forms from millions of synthetic data; SFT introduces real-world text-image data for font and layout generalization; RL incorporates OCR and aesthetic dual reward models for text accuracy and natural background fusion.
Character-level encoding: Uses character-level encoding for specified text in prompts, significantly reducing model memory burden and achieving leap-forward improvement in text generation learning efficiency.
Smart typography: Automatically matches scene context for font size, color, and spacing (e.g., classical scripts for ancient themes, modern sans-serif for tech themes) — no manual adjustment needed.

✨ Studio-Quality Output

Fast response, no waiting: Lightweight optimization enables efficient high-resolution image generation with improved efficiency over similar tools; frequent creation without long waits.
Photography-grade quality: Optimized composition and lighting aesthetics; accurate textures and scene lighting that precisely replicate the real world; realistic body proportions and object physics following physical laws — achieving studio photography quality.

Technical Architecture

LongCat-Image adopts a unified architecture for text-to-image generation and image editing, combined with progressive learning strategies. At only 6B parameters, it achieves efficient collaborative improvement in instruction following accuracy, image generation quality, and text rendering capabilities.

Unified Architecture Design

Shared hybrid backbone: MM-DiT + Single-DiT architecture enables generation and editing capabilities to mutually enhance each other.
VLM condition encoder: Inherits high-quality text-to-image output while maintaining strong instruction following and consistency preservation.
Mid-training initialization: Initializes from mid-training stage of text-to-image model to effectively inherit knowledge and aesthetics while avoiding narrowed state space limitations.

Training Strategy

Multi-task joint learning: Instruction editing and text-to-image multi-task joint learning mechanism deepens understanding of complex and diverse instructions.
Progressive learning: Curriculum learning approach for multi-modal alignment, gradually integrating different capabilities during training.
Data engineering: Pre-training uses multi-source data and instruction rewriting strategies; SFT introduces human-annotated data; RL incorporates AIGC content detector as reward model for adversarial training.
Texture and realism enhancement: Strict filtering of AIGC data in pre-training and mid-training stages to avoid "plastic" texture local optima; all SFT data manually screened for public aesthetic alignment.

Performance Highlights

Image editing: Open-source SOTA: ImgEdit-Bench (4.50), GEdit-Bench Chinese/English (7.60/7.64), approaching top closed-source model levels.
Text-to-image: Strong competitiveness: GenEval (0.87), DPG-Bench (86.8), competitive with top open-source and closed-source models.
Chinese text rendering: Leading performance: ChineseWord (90.7), significantly leading all evaluated models, covering all 8,105 standard Chinese characters.
Efficient inference: Through refined model design and multi-stage training strategy optimization, supports efficient inference on consumer-grade GPUs.

Applications

LongCat APP:
- Text-to-image: Generate high-quality images from text prompts
- Image-to-image: Upload any material (landscape photos, selfies, sketches) and generate new images based on requirements
- 24 zero-threshold image templates: Covering poster design, portrait refinement, scene transformation, and more — click "AI Creation" to use directly, no prompt anxiety
- Multi-round editing: Iterate and edit generated images with multi-round generation
LongCat Web: Access AI image generation and multi-round editing at longcat.ai.
Professional creators: Fast, accurate image creation and editing for commercial posters, marketing materials, and creative projects. Supports complex Chinese text rendering for traditional culture, professional domains, and special creative needs.
General users: Easy-to-use AI image generation for personal projects and creative exploration. Zero-threshold templates enable beginners to quickly produce professional-grade works.
Open-source community: Fully open-source text-to-image multi-stage models (Mid-training, Post-training) and image editing models, supporting the full workflow from cutting-edge research to commercial applications.

LongCat-Image