[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72226":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72226,"LLaDA","ML-GSAI\u002FLLaDA","ML-GSAI","Official PyTorch implementation for \"Large Language Diffusion Models\"",null,"Python",3821,267,42,84,0,6,11,45,18,29.28,"MIT License",false,"main",true,[],"2026-06-12 02:03:00","# Large Language Diffusion Models\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Base-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-LLaDA_Instruct-FFEB3B)](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Demo-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmultimodalart\u002FLLaDA)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu1-知乎1-blue)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F24214732238)\n[![deploy](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZhihu2-知乎2-blue)](https:\u002F\u002Fwww.zhihu.com\u002Fquestion\u002F1908479621466396378\u002Fanswer\u002F1910672718174589774?share_code=1kreOq5gzOtnM&utm_psn=1910708245535912148&utm_source=wechat_timeline&utm_medium=social&s_r=0)\n\n## News\n### New works\n- [2025.02.14] We have uploaded LLaDA paper to [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992) and open-sourced [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) and [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct).\n\n- [2025.05.23] We introduce [LLaDA-V](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-V-demo\u002F), a competitive diffusion-based vision-language model, outperforming other diffusion MLLMs.\n\n- [2025.05.25] We introduce [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F), which incorporates VRPO to reduce gradient variance and enhance preference alignment in LLaDA.\n\n- [2025.09.11] We introduce [LLaDA-MoE-7B-A1B-Base](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Base) and [LLaDA-MoE-7B-A1B-Instruct](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLLaDA-MoE-7B-A1B-Instruct), the first diffusion language model pretrained from scratch with MoE architecture. LLaDA-MoE-7B-A1B-Instruct uses only ~1B active parameters at inference while surpassing LLaDA 1.5(an 8B dense model), and comparable to Qwen2.5-3B-Instruct.\n\n\n### New features in this repo\n- [2025.05.04] We have provided evaluation code based on the [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) for the LLaDA-8B-Base.\n\n- [2025.10.27] We have provided batch inference support, along with all evaluation code for [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base), [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) and [LLaDA 1.5](https:\u002F\u002Fml-gsai.github.io\u002FLLaDA-1.5-Demo\u002F).\n\n  \n## Introduction\nWe introduce LLaDA (\u003Cb>L\u003C\u002Fb>arge \u003Cb>La\u003C\u002Fb>nguage \u003Cb>D\u003C\u002Fb>iffusion with m\u003Cb>A\u003C\u002Fb>sking), a diffusion model with an unprecedented 8B scale, trained entirely from scratch, \nrivaling LLaMA3 8B in performance.\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA.svg\" style=\"width: 45%\" \u002F>\n    \u003Cimg src=\".\u002Fimgs\u002FLLaDA_vs_LLaMA_chat.svg\" style=\"width: 46%\" \u002F>\n\u003C\u002Fdiv>\n\n\n## Inference\nThe [LLaDA-8B-Base](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Base) and [LLaDA-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002FGSAI-ML\u002FLLaDA-8B-Instruct) are uploaded\nin Huggingface. Please first install `transformers==4.38.2` and employ the [transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Findex) to load.\n\n```angular2html\nfrom transformers import AutoModel, AutoTokenizer\n\ntokenizer = AutoTokenizer.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True)\nmodel = AutoModel.from_pretrained('GSAI-ML\u002FLLaDA-8B-Base', trust_remote_code=True, torch_dtype=torch.bfloat16)\n```\n\nWe provide `get_log_likelihood()` and `generate()` functions in `get_log_likelihood.py` \nand `generate.py` respectively, for conditional likelihood evaluation and conditional generation.\n\nYou can directly run `python chat.py` to have multi-round conversations with LLaDA-8B-Instruct.\n\n\n## Gradio demo \nThank you very much to [apolinário](https:\u002F\u002Fgithub.com\u002Fapolinario) for helping us create this amazing demo!\n\nFirst, install [Gradio](https:\u002F\u002Fwww.gradio.app) `pip install gradio`, and then you can directly run `python app.py`\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\".\u002Fimgs\u002Fexample_gradio.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n## Pre-training and Supervised Fine-Tuning\n\nWe will not provide the training framework and data as most open-source LLMs do.\n\nHowever, the pre-training and Supervised Fine-Tuning of LLaDA are straightforward. If \nyou have a codebase for training an autoregressive model, you can modify it to \nadapt to LLaDA with just a few lines of code.\n\nWe provide guidelines for the pre-training and SFT of LLaDA in [GUIDELINES.md](GUIDELINES.md). \nYou can also refer to [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM), which has a similar training process to LLaDA \nand has open-sourced the training framework.\n\n## Evaluation\nPlease refer to [EVAL.md](EVAL.md) for instructions on using the evaluation code.\n\n## FAQ\nHere, we address some common questions about LLaDA.\n\n### 0. How do I train my own LLaDA?\nPlease refer to [GUIDELINES.md](GUIDELINES.md) for the guidelines. \nYou can also refer to [SMDM](https:\u002F\u002Fgithub.com\u002FML-GSAI\u002FSMDM), which follows the same training \nprocess as LLaDA and has open-sourced its code.\n\n\n### 1. What is the difference between LLaDA and BERT?\n\nOur motivation is not to improve BERT, nor to apply image generation methods like [MaskGIT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2202.04200) \nto text. **Our goal is to explore a theoretically complete language modeling approach — masked diffusion models.** \nDuring this process, we simplified the approach and discovered that the loss function of masked diffusion models \nis related to the loss functions of BERT and MaskGIT. You can find our theoretical research process in Question 7.\n\nSpecifically, LLaDA employs a masking ratio that varies randomly between 0 and 1, while BERT uses \na fixed ratio. This subtle difference has significant implications. **The training\nobjective of LLaDA is an upper bound on the negative log-likelihood of the model \ndistribution, making LLaDA a generative model.** This enables LLaDA to naturally \nperform in-context learning, instruction-following, and ensures Fisher consistency \nfor scalability with large datasets and models. You can also find a direct answer \nto this question in Section 2.1 of our paper.\n\n\n### 2. What is the relationship between LLaDA and Transformer?\nNetwork structure and probabilistic modeling are two distinct approaches that collectively form the \nfoundation of language models. LLaDA, like GPT, adopts the \nTransformer architecture. The key difference lies in the probabilistic modeling approach: GPT \nutilizes an autoregressive next-token prediction method, \nwhile LLaDA employs a diffusion model for probabilistic modeling.\n\n\n### 3. What is the sampling efficiency of LLaDA?\nCurrently, LLaDA's sampling speed is slower than the autoregressive baseline for three reasons: \n1. LLaDA samples with a fixed context length;\n2. LLaDA cannot yet leverage techniques like KV-Cache;\n3. LLaDA achieves optimal performance when the number of sampling steps equals the response length.\nReducing the number of sampling steps leads to a decrease in performance, as detailed in Appendix B.4 \nand Appendix B.6 of our paper.\n\nIn this work, we aim to explore the upper limits of LLaDA's capabilities, **challenging the assumption \nthat the key LLM abilities are inherently tied to autoregressive models**. We will continue \nto optimize its efficiency in the future. We believe this research approach is reasonable, \nas verifying the upper limits of diffusion language models' capabilities will provide us with\nmore resources and sufficient motivation to optimize efficiency.\n\nRecall the development of diffusion models for images, from [DDPM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2006.11239) \nto the [Consistency model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.11081), where sampling speed accelerated nearly \n1000 times over the course of 4 years. **We believe there is significant room for optimization in LLaDA's \nsampling efficiency as well**. Current solutions, including [block diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09573), can mitigate the fixed context length issue, and \n[consistency distillation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05415) can reduce the number of sampling steps. In\naddition, some cache methods (e.g., [Fast-dllm](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFast-dLLM), [dllm-cache](https:\u002F\u002Fgithub.com\u002Fmaomaocun\u002FdLLM-cache))\ncan also be adapted by LLaDA.\n\n\n### 4. What is the training stability of LLaDA?\nFor details on the pre-training process of LLaDA, please refer to Section 2.2 of our paper. \nDuring the total pre-training on 2.3T tokens, we encountered a training crash (loss becoming NaN) \nonly once at 1.2T tokens. Our solution was to resume the checkpoint and reduce \nthe learning rate from 4e-4 to 1e-4.\n\n\n### 5. Why is the final answer \"72\" generated earlier than the intermediate calculation step (e.g., 12 × 4 = 48) in Tab4?\n\n**The mask predictor has successfully predicted the reasoning process. However, during the \nremasking process, the reasoning steps are masked out again.** As shown in the figure \nbelow, the non-white background represents the model's generation process, while the \nwhite-background boxes indicate the predictions made by the mask predictor at each step. \nWe adopt a randomly remasking strategy.\n\n\u003Cdiv style=\"display: flex; justify-content: center; flex-wrap: wrap;\">\n    \u003Cimg src=\".\u002Fimgs\u002Fdiff_remask.gif\" style=\"width: 80%\" \u002F>\n\u003C\u002Fdiv>\n\n### 6. Why does LLaDA answer 'Bailing' when asked 'Who are you'?\nThis is because our pre-training and SFT data were designed for training an autoregressive model, \nwhereas LLaDA directly utilizes data that contains identity markers.\n\n\n### 7. Our journey in developing LLaDA?\nLLaDA is built upon our two prior works, [RADD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03736) and \n[SMDM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18514). \n\nRADD demonstrated that the **training objective of LLaDA serves as an upper bound on the negative \nlog-likelihood** of the model’s distribution, a conclusion also supported by [MD4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04329) \nand [MDLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07524). \nFurthermore, RADD was the first to theoretically prove that **masked diffusion models do not require time t \nas an input to Transformer**. This insight provides the theoretical \njustification for LLaDA’s unmodified use of the Transformer architecture. Lastly, \nRADD showed that **the training objective of masked diffusion models is equivalent to that of \nany-order autoregressive models**, offering valuable insights into how masked diffusion models can \novercome the reversal curse.\n\nSMDM introduces the first **scaling law** for masked diffusion models and demonstrates that, with the \nsame model size and training data, masked diffusion models can achieve downstream benchmark results \non par with those of autoregressive models. Additionally, SMDM presents a simple, **unsupervised \nclassifier-free guidance** method that greatly improves downstream benchmark performance, which has \nbeen adopted by LLaDA.\n\n\n## Citation\n\n```bibtex\n@article{nie2025large,\n  title={Large Language Diffusion Models},\n  author={Nie, Shen and Zhu, Fengqi and You, Zebin and Zhang, Xiaolu and Ou, Jingyang and Hu, Jun and Zhou, Jun and Lin, Yankai and Wen, Ji-Rong and Li, Chongxuan},\n  journal={arXiv preprint arXiv:2502.09992},\n  year={2025}\n}\n```\n","LLaDA是一个基于PyTorch实现的大规模语言扩散模型。该项目的核心功能包括80亿参数的预训练模型，支持从零开始训练，并且在性能上可与LLaMA3 8B相媲美。技术特点方面，LLaDA采用了先进的扩散机制和掩码技术，提供了批量推理支持以及基于lm-evaluation-harness的评估代码。此外，项目还推出了视觉-语言模型LLaDA-V和采用MoE架构的LLaDA-MoE-7B-A1B等扩展版本。适合需要高质量文本生成、多模态处理及高效推理的应用场景使用。",2,"2026-06-11 03:40:57","high_star"]