[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80763":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":13,"forks30d":13,"starsTrendScore":18,"compositeScore":13,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},80763,"DiffusionOPD","ali-vilab\u002FDiffusionOPD","ali-vilab","DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models","",null,"Python",93,0,2,1,17,53,13,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:06","\u003Ch1 align=\"center\"> DiffusionOPD:\u003Cbr>A Unified Perspective of On-Policy Distillation in Diffusion Models \u003C\u002Fh1>\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15055'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper%20(arXiv)-2605.15055-red?logo=arxiv'>\u003C\u002Fa>  &nbsp;\n  \u003Ca href='https:\u002F\u002Fquanhaol.github.io\u002FDiffusionOPD-site\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-green?logo=homepage&logoColor=white'>\u003C\u002Fa> &nbsp;\n  \u003Ca href='https:\u002F\u002Fgithub.com\u002Fali-vilab\u002FDiffusionOPD'>\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-9E95B7?logo=github\">\u003C\u002Fa> &nbsp;\n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fquanhaol\u002FDiffusionOPD'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-blue?logo=huggingface&logoColor='>\u003C\u002Fa> &nbsp;\n\u003C\u002Fdiv>\n\n## Overview\n\n**DiffusionOPD** introduces an online policy distillation framework for multi-task diffusion alignment. Instead of jointly optimizing several rewards from scratch or cascading RL stages, it first learns task-specialized teachers and then distills their capabilities into one unified student along the student's own rollout trajectories.\n\n-   **Decoupled Multi-Stage Training:** Single-task exploration is handled independently by task-specific teachers, while the final student focuses on integrating their capabilities, reducing reward conflict and catastrophic forgetting.\n-   **Principled Diffusion OPD Objective:** We extend OPD from discrete token generation to continuous diffusion Markov processes and derive a closed-form per-step KL objective for denoising transitions.\n-   **Lower-Variance and Sampler-Compatible:** The analytic objective avoids the extra score-function noise in PPO-style policy gradients and naturally covers both stochastic SDE samplers and deterministic ODE samplers through transition\u002Fmean matching.\n-   **Strong Multi-Domain Results:** DiffusionOPD consistently improves training efficiency and final performance across aesthetics, OCR, and GenEval, outperforming multi-reward RL and cascade RL baselines.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fteaser.png\" alt=\"Result\" style=\"width:90%;\">\n\u003C\u002Fp>\n\nDiffusionOPD follows a simple two-stage recipe:\n\n1.  **Train Task-Specific Teachers:** Decompose the target capabilities into individual tasks, such as aesthetics, OCR, and GenEval, and train one teacher per task using an off-the-shelf diffusion RL algorithm.\n2.  **Initialize a Unified Student:** Start the student policy from the pretrained diffusion model.\n3.  **Round-Robin On-Policy Distillation:** For each training round, sample prompts from every task, roll out the current student to obtain on-policy denoising trajectories, and query the corresponding task-specific teacher for supervision at the states visited by the student.\n4.  **Accumulate Full-Task Supervision:** Compute the OPD loss for each task using the closed-form KL objective, accumulate losses across all tasks, and update the student once per round.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Falgo.png\" alt=\"DiffusionOPD Algorithm\" style=\"width:80%;\">\n\u003C\u002Fp>\n\n## Environment Setup\nOur implementation is based on the [DiffusionNFT](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FDiffusionNFT)  codebase, with most environments aligned.\n\nClone this repository and install packages by:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fali-vilab\u002FDiffusionOPD.git\ncd DiffusionOPD\n\nconda create -n DiffusionOPD python=3.10.16\npip install torch==2.6.0 torchvision==0.21.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\npip install -e .\n```\n\n## Model Download\nTo avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.\n\n**Models**\n* **SD3.5**: `stabilityai\u002Fstable-diffusion-3.5-medium`\n* **GenEval Teacher**: `quanhaol\u002FGenEval-Teacher`\n* **OCR Teacher**: `quanhaol\u002FOCR-Teacher`\n* **Aes Teacher**: `quanhaol\u002FAes-Teacher`\n\n## Reward Preparation\n\nOur supported reward models include [GenEval](https:\u002F\u002Fgithub.com\u002Fdjghosh13\u002Fgeneval), [OCR](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleOCR), [PickScore](https:\u002F\u002Fgithub.com\u002Fyuvalkirstain\u002FPickScore), [ClipScore](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP), [HPSv2.1](https:\u002F\u002Fgithub.com\u002Ftgxs002\u002FHPSv2), [Aesthetic](https:\u002F\u002Fgithub.com\u002Fchristophschuhmann\u002Fimproved-aesthetic-predictor), [ImageReward](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FImageReward) and [UnifiedReward](https:\u002F\u002Fgithub.com\u002FCodeGoat24\u002FUnifiedReward). We additionally support `HPSv2.1` on top of FlowGRPO, and simplify `GenEval` from remote server to local. \n\n### Checkpoints Downloading\n\n```bash\nmkdir reward_ckpts\ncd reward_ckpts\n# Aesthetic\nwget https:\u002F\u002Fgithub.com\u002Fchristophschuhmann\u002Fimproved-aesthetic-predictor\u002Fraw\u002Frefs\u002Fheads\u002Fmain\u002Fsac+logos+ava1-l14-linearMSE.pth\n# GenEval\nwget https:\u002F\u002Fdownload.openmmlab.com\u002Fmmdetection\u002Fv2.0\u002Fmask2former\u002Fmask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco\u002Fmask2former_swin-s-p4-w7-224_lsj_8x2_50e_coco_20220504_001756-743b7d99.pth\n# ClipScore\nwget https:\u002F\u002Fhuggingface.co\u002Flaion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K\u002Fresolve\u002Fmain\u002Fopen_clip_pytorch_model.bin\n# HPSv2.1\nwget https:\u002F\u002Fhuggingface.co\u002Fxswu\u002FHPSv2\u002Fresolve\u002Fmain\u002FHPS_v2.1_compressed.pt\ncd ..\n```\n\n### Reward Environments\n\n```bash\n# GenEval\npip install -U openmim\nmim install mmengine\ngit clone https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002Fmmcv.git\ncd mmcv; git checkout 1.x\nMMCV_WITH_OPS=1 FORCE_CUDA=1 pip install -e . -v\ncd ..\n\ngit clone https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002Fmmdetection.git\ncd mmdetection; git checkout 2.x\npip install -e . -v\ncd ..\n\npip install open-clip-torch clip-benchmark\n\n# OCR\npip install paddlepaddle-gpu==2.6.2\npip install paddleocr==2.9.1\npip install python-Levenshtein\n\n# HPSv2.1\npip install hpsv2x==1.2.0\n\n# ImageReward\npip install image-reward\npip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n\nFor `UnifiedReward`, we deploy the reward service using sglang. To avoid conflicts, first create a new environment and install sglang with:\n\n```bash\npip install \"sglang[all]\"\n```\n\nThen launch the service with:\n\n```bash\npython -m sglang.launch_server --model-path CodeGoat24\u002FUnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85\n```\n\nMemory usage can be reduced by lowering `--mem-fraction-static`, limiting `--max-running-requests`, and increasing `--data-parallel-size` or `--tensor-parallel-size`.\n\n\n\n\n## Training\nThe default configuration file `config\u002Fopd.py` is set for 8 GPUs, and you can customize it as needed.\n\nSingle-node training example:\n```bash\n# Single Teacher\nbash scripts\u002Fsingle_node\u002Fsopd.sh\n\n# Multi Teacher\nbash scripts\u002Fsingle_node\u002Fmopd.sh\n```\n\n## Evaluation\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fcomparison.png\" alt=\"Comparison Table\" style=\"width:90%;\">\n\u003C\u002Fp>\n\nThe evaluation process follows DiffusionNFT, and we provide an inference script here for loading LoRA checkpoints and running evaluation.\n\n```bash\nbash scripts\u002Fsingle_node\u002Feval.sh\n```\n\nThe `--dataset` flag supports `geneval`, `ocr`, `pickscore`, and `drawbench`.\n\n## Acknowledgement\nWe thank the [Flow-GRPO](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo) and [DiffusionNFT](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FDiffusionNFT) projects for providing the awesome open-source diffusion RL codebase.\n\n## Citation\n```\n@article{li2026diffusionopd,\n  title={DiffusionOPD: A Unified Perspective of On-Policy Distillation in Diffusion Models},\n  author={Li, Quanhao and Yu, Junqiu and Jiang, Kaixun and Wei, Yujie and Xing, Zhen and Li, Pandeng and Chu, Ruihang and Zhang, Shiwei and Liu, Yu and Wu, Zuxuan},\n  journal={arXiv preprint arXiv:2605.15055},\n  year={2026}\n}\n```\n","DiffusionOPD 是一个针对扩散模型中在线策略蒸馏的统一框架。该项目通过首先训练任务特定的教师模型，再将这些模型的能力蒸馏到一个统一的学生模型中，以实现多任务对齐。其核心技术特点包括解耦的多阶段训练、基于原则的扩散OPD目标函数、低方差及采样器兼容性。具体而言，它通过独立处理单任务探索并整合多种能力来减少奖励冲突和灾难性遗忘；同时，通过对连续扩散马尔可夫过程的扩展，提供了一个闭式每步KL目标函数，这不仅避免了PPO风格策略梯度中的额外得分函数噪声，还能自然地适应随机SDE采样器和确定性ODE采样器。该方法适用于需要高效训练和优化最终性能的各种场景，如美学评估、OCR识别以及生成评估等。","2026-06-11 04:01:55","CREATED_QUERY"]