[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-845":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":16,"starSnapshotCount":16,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},845,"LLaDA2.0-Uni","inclusionAI\u002FLLaDA2.0-Uni","inclusionAI","LLaDA2.0-Uni: Understanding and Generation the World. ","",null,"Python",759,49,6,3,0,4,22,50.3,false,"main",[],"2026-06-12 04:00:06","\u003Cp align=\"center\">\n \u003Cimg src=\".\u002Fassets\u002Fllada_logo.png\" width=\"20%\"\u002F>\n\u003C\u002Fp>\n\u003Cdiv align=\"center\">\n \u003Ch1> LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model \u003C\u002Fh1>\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.20796\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTechnical%20Report-b5212f.svg?logo=arxiv\" height=\"21px\">\u003C\u002Fa>\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Checkpoint-LLaDA2.0--Uni-yellow)](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA2.0-Uni)&#160;\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Checkpoint-LLaDA2.0--Uni--FP8-yellow)](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA2.0-Uni-FP8)&#160;\n\n[![ModelScope Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤖%20Checkpoint-LLaDA2.0--Uni-624aff)](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FinclusionAI\u002FLLaDA2.0-Uni)&#160;\n[![ModelScope Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤖%20Checkpoint-LLaDA2.0--Uni--FP8-624aff)](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FinclusionAI\u002FLLaDA2.0-Uni-FP8)&#160;\n\n \u003Cb>AGI Research Center, Inclusion AI \u003C\u002Fb>\n\u003C\u002Fdiv>\n\n\n## 🔥 News\n- **[2026-05-06]** ⚡ We release the **FP8 quantized versions** on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA2.0-Uni-FP8) and [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FinclusionAI\u002FLLaDA2.0-Uni-FP8).\n\n- **[2026-04-23]** 🎉 We release the initial version of **LLada2.0-Uni**, including:\n  - 🎯 Model Checkpoints on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA2.0-Uni)!\n  - 🎯 Text-to-Image (w\u002F thinking mode) Inference Code!\n  - 🎯 Image Understanding Inference Code!\n  - 🎯 Image Editing Inference code!\n  - 🎯 SPRINT Acceleration for dLLM Backbone!\n  \n## 📝 TODO\n- [x] Quantized model\n- [ ] Diffusers support\n- [ ] ComfyUI support\n- [ ] SGLang support\n- [ ] RL optimization\n\n## 📚 Model Introduction \nWe introduce **LLaDA2.0-Uni**, a unified dLLM-based Mixture-of-Experts (MoE) model that seamlessly integrates multimodal understanding and generation.\n\n \u003Cimg src=\".\u002Fassets\u002Farchitecture.png\" width=\"95%\"\u002F>\n\n#### Architectural Innovations\n- **Unified dLLM-MoE Backbone**: Built on LLaDA 2.0, it unifies multimodal understanding and generation into a simple Mask Token Prediction paradigm.\n\n- **Discrete Semantic Tokenizer**: Utilizes SigLIP-VQ to convert visual inputs into discrete semantic tokens, significantly enhancing multimodal understanding.\n\n- **Efficient Diffusion Decoder**: Pairs discrete tokens with a specialized diffusion decoder for high-fidelity generation, enabling rapid 8-step inference via distillation.\n\n#### Core Capabilities\n- **Top-Tier Understanding & Generation**: Matches dedicated VLMs in answering visual questions and understanding documents, while also generating highly detailed images.\n\n- **Flexible Image Editing**: Supports single or multi-reference editing. It enables precise modifications while perfectly preserving original details.\n\n- **Interleaved Generation & Reasoning**: Empowered by unified discrete representations, it effortlessly handles complex interleaved generation and unlocks advanced interleaved reasoning.\n\n\n## 📊 Evaluation Results\n\n\u003Cimg src=\".\u002Fassets\u002Fperformance.png\" width=\"100%\"\u002F>\n\n## 📌 Quick Start\n\n### ⚙️ Installation\n\n#### 1. Create a conda environment\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FLLaDA2-Uni && cd LLaDA2-Uni\nconda create -n llada2_uni python=3.10 -y\nconda activate llada2_uni\n```\n\n#### 2. Install PyTorch (CUDA 12.4)\n\n```bash\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n```\n\n#### 3. Install Flash Attention 2 (required for efficient inference)\n\n```bash\npip install flash-attn --no-build-isolation\n```\n\n#### 4. Install remaining dependencies\n\n```bash\npip install -r requirements.txt\n```\n\n### 🧨 Inference\n\n#### 🌟 Text-to-Image Generation\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom decoder import decode_vq_tokens\n\nmodel_path = \"inclusionAI\u002FLLaDA2.0-Uni\"\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path, device_map=\"cuda\", torch_dtype=\"bfloat16\", trust_remote_code=True\n).eval()\nmodel.tokenizer = tokenizer\n\n# Generate image tokens\nresult = model.generate_image(\n    \"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f\u002F2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.\",\n    image_h=1024, image_w=1024,\n    steps=8, cfg_scale=2.0,\n)\n\n# Decode to PIL image (default: 50-step ODE)\nimage = decode_vq_tokens(result[\"token_ids\"], result[\"h\"], result[\"w\"], model_path, \"cuda\")\nimage.save(\"output.png\")\n```\n\n> [!Note]\n>  💡 **Faster decoding** — Use the **decoder-turbo** (distilled decoder) for **~10× faster** image decoding (8 steps instead of 50) with minimal quality loss:\n> ```python\n> image = decode_vq_tokens(\n>     result[\"token_ids\"], result[\"h\"], result[\"w\"], model_path, \"cuda\",\n>     num_steps=8, decode_mode=\"decoder-turbo\",\n> )\n> ```\n\n#### 🌟 Text-to-Image Generation with Thinking\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom decoder import decode_vq_tokens\n\nmodel_path = \"inclusionAI\u002FLLaDA2.0-Uni\"\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path, device_map=\"cuda\", torch_dtype=\"bfloat16\", trust_remote_code=True\n).eval()\nmodel.tokenizer = tokenizer\n\n# Generate image tokens with thinking process\nresult = model.generate_image(\n    \"A fox with thick, dense, fluffy fur in a winter setting, possibly surrounded by snow.\",\n    image_h=1024, image_w=1024,\n    mode=\"thinking\",\n    steps=8, cfg_scale=2.0,\n    thinking_steps=32, thinking_gen_length=4096,\n)\n\n# Print thinking trace\nprint(\"Thinking:\", result[\"thinking\"])\n\n# Decode to PIL image\nimage = decode_vq_tokens(result[\"token_ids\"], result[\"h\"], result[\"w\"], model_path, \"cuda\", num_steps=8, decode_mode=\"decoder-turbo\",)\nimage.save(\"output_thinking.png\")\n```\n\n#### 🌟 Image Understanding\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom encoder.image_tokenizer import ImageTokenizer\nfrom decoder.smart_img_process import smart_resize_images\n\nmodel_path = \"inclusionAI\u002FLLaDA2.0-Uni\"\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path, device_map=\"cuda\", torch_dtype=\"bfloat16\", trust_remote_code=True\n).eval()\nmodel.tokenizer = tokenizer\n\n# Encode image to discrete tokens\nimage_tokenizer = ImageTokenizer(model_path=model_path, device=\"cuda\")\npil_image = smart_resize_images([\".\u002Fassets\u002Funderstanding_example.png\"])[0]\ninfo = image_tokenizer.encode_with_info(pil_image)\nimage_tokens = [x + model.config.image_token_offset for x in info[\"token_ids\"]]\n_, h, w = info[\"grid_thw\"]\n\n# Understand the image\nresponse = model.understand_image(\n    image_tokens, h, w,\n    question=\"Describe this image in detail.\",\n    steps=32, gen_length=2048,\n)\nprint(response)\n```\n\n#### 🌟 Image Editing\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom encoder.image_tokenizer import ImageTokenizer\nfrom decoder.utils import generate_crop_size_list, var_center_crop\nfrom decoder import decode_vq_tokens\nfrom PIL import Image\n\nmodel_path = \"inclusionAI\u002FLLaDA2.0-Uni\"\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path, device_map=\"cuda\", torch_dtype=\"bfloat16\", trust_remote_code=True\n).eval()\nmodel.tokenizer = tokenizer\n\n# Encode source image\nimage_tokenizer = ImageTokenizer(model_path=model_path, device=\"cuda\")\ncrop_size_list = generate_crop_size_list((512 \u002F\u002F 32) ** 2, 32)\npil_image = var_center_crop(Image.open(\".\u002Fassets\u002Fedit_example.png\").convert(\"RGB\"), crop_size_list=crop_size_list)\ninfo = image_tokenizer.encode_with_info(pil_image)\nimage_tokens = [x + model.config.image_token_offset for x in info[\"token_ids\"]]\n_, h, w = info[\"grid_thw\"]\n\n# Edit the image\nresult = model.edit_image(\n    image_tokens, h, w,\n    instruction=\"Change the background to a beach.\",\n    steps=8, cfg_text_scale=4.0,\n)\n\n# Decode to PIL image\nedited_image = decode_vq_tokens(result[\"token_ids\"], result[\"h\"], result[\"w\"], model_path, \"cuda\", num_steps=8, decode_mode=\"decoder-turbo\",)\nedited_image.save(\"edited.png\")\n```\n\n#### 🌟 SPRINT Acceleration\n\nSPRINT accelerates inference by combining **KV cache reuse**, **adaptive unmasking**, and **threshold-based batch acceptance**:\n\n- **KV Cache Reuse & Pruning**: The prefix KV cache is computed once during warmup steps, then optionally pruned by importance scores (blending KV attention importance with token confidence). Subsequent denoising steps reuse the cached prefix, significantly reducing computation. Per-modality keep ratios (`image_keep_ratio`, `text_keep_ratio`) allow fine-grained control — e.g., retaining all image\u002Ftext tokens for quality while still benefiting from cache reuse.\n- **Adaptive Unmasking**: Instead of unmasking a fixed number of tokens per step, Sprint dynamically decides how many tokens to reveal based on model confidence. At each step, it computes confidence scores (via strategies like `low_confidence`, `top_k_margin`, or `neg_entropy`) and transfers the top-k most confident tokens, where k is adaptively set as `ceil(remaining_masked \u002F steps_left)`. This allows easy positions to be resolved quickly while concentrating compute on harder tokens.\n- **Batch Acceptance**: On top of adaptive scheduling, all tokens whose probability exceeds `threshold` are accepted in batch, further reducing the number of denoising iterations needed.\n\n**Image Understanding** with Sprint:\n\n```python\nresponse = model.understand_image(\n    image_tokens, h, w,\n    question=\"Describe this image in detail.\",\n    steps=32, gen_length=4096,\n    use_sprint=True,\n    threshold=0.93,\n    keep_ratio=0.5,\n    cache_warmup_steps=1,\n    image_keep_ratio=1.0,\n    text_keep_ratio=1.0,\n)\n```\n\n**Text-to-Image** with Sprint:\n\n```python\nresult = model.generate_image(\n    \"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f\u002F2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.\",\n    image_h=1024, image_w=1024,\n    cfg_scale=2.0,\n    use_sprint=True,\n    block_length=32,\n    steps=8,\n    keep_ratio=0.5,\n    cache_warmup_steps=1,\n)\n```\n\n> [!Note]\n>  Sprint is supported for Simple CFG and no-CFG modes. When using Editing CFG (three-way guidance with `cfg_text_scale` \u002F `cfg_image_scale`), Sprint automatically falls back to baseline.\n\n#### 🌟 Using CLI Scripts\n\n```bash\n# Text-to-Image\npython scripts\u002Ft2i_generate.py --model_path inclusionAI\u002FLLaDA2.0-Uni --prompt \"A modern Scandinavian kitchen with white cabinetry, marble countertops, and a single orchid on the island. A Nordic woman with sleek blonde ponytail, wearing an oversized sweater and dainty silver necklaces, stirs a matcha bowl with a bamboo whisk, eyes sparkling with quiet joy. Shot with 50mm, f\u002F2.5, diffused window light, cool white balance, low saturation, clean skin retouch. Mood: serene, wholesome, hygge.\"\n\n# Image Understanding\npython scripts\u002Fmmu_understand.py --model_path inclusionAI\u002FLLaDA2.0-Uni --image .\u002Fassets\u002Funderstanding_example.png\n\n# Image Editing\npython scripts\u002Fimage_edit.py --model_path inclusionAI\u002FLLaDA2.0-Uni --image .\u002Fassets\u002Fedit_example.png --instruction \"Make it a watercolor painting\"\n```\n\n## 🚀 SGLang Support (Coming Soon)\n\nWe are working on integrating [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) for high-throughput serving and optimized inference. Stay tuned!\n\n## ⚠️ License\n\nThis project is licensed under the terms of the [Apache License 2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0).\n\n## 📖 BibTeX\n\n```bibtex\n@article{LLaDA2Uni,\ntitle = {LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model},\nauthor = {Tiwei Bie and Haoxing Chen and Tieyuan Chen and Zhenglin Cheng and Long Cui and Kai Gan and Zhicheng Huang and Zhenzhong Lan and Haoquan Li and Jianguo Li and Tao Lin and Qi Qin and Hongjun Wang and Xiaomei Wang and Haoyuan Wu and Yi Xin and Junbo Zhao},\njournal = {arXiv preprint arXiv:2604.20796},\nyear = {2026}\n}\n```\n","LLaDA2.0-Uni是一个基于扩散大语言模型的多模态理解和生成统一框架。该项目通过一个统一的dLLM-MoE架构，将图像理解与文本到图像生成等功能整合在一起，支持高效的8步推理过程。其核心技术包括使用SigLIP-VQ将视觉输入转化为离散语义令牌，以及采用专门设计的扩散解码器来实现高质量的内容生成。此外，LLaDA2.0-Uni还提供了灵活的图像编辑功能，能够在保持原始细节的同时进行精确修改。此项目适用于需要高级别多模态处理能力的应用场景，如自动问答系统、文档理解工具及创意内容生产等。",2,"2026-06-11 02:39:46","CREATED_QUERY"]