[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84130":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":10,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":29,"discoverSource":30},84130,"UniRL","Tencent-Hunyuan\u002FUniRL","Tencent-Hunyuan","UniRL is a Framework for Unified Multimodal Model Reinforcement Learning","https:\u002F\u002Funirl-project.github.io\u002Funirl\u002F",null,"Python",511,27,2,4,0,93,267,453,8.34,"Other",false,"main",true,[26],"reinforcement-learning","2026-06-12 02:04:38","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fbanner.png\" alt=\"UniRL — A Reinforcement Learning Framework for Unified Multimodal Models\" width=\"98%\">\n\n### A Reinforcement Learning Framework for Unified Multimodal Models\n\n**U**(you)·**ni**(need)·**RL** for unified multimodal intelligence\n\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.12%2B-blue)](pyproject.toml)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green)](LICENSE)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-unirl--project.github.io-blue)](https:\u002F\u002Funirl-project.github.io\u002Funirl\u002F)\n[![WeChat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-微信群-07C160?logo=wechat&logoColor=white)](assets\u002Fwechat_qr.jpg)\n\n\u003C\u002Fdiv>\n\n## News 🚀\n\n- **[2026-05]** **DRPO** released — *\"Rethinking the Divergence Regularization in LLM RL\"* ([arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.09821)).\n- **[2026-06]** **Flow-DPPO** released — *\"FlowDPPO: Divergence Proximal Policy Optimization for Flow Matching Models\"* ([paper](FlowDPPO\u002FHY_FlowDPPO.pdf)).\n\n## About 💡\n\nUniRL applies one RL post-training loop — generate samples, score them, compute\nadvantages, update the policy, and sync weights back to rollout workers —\nacross multimodal model families.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002FUniRL_arch_new.png\" alt=\"UniRL architecture\" width=\"900\">\n\u003C\u002Fdiv>\n\nUniRL is a layered, composable system. Each **entrypoint** (`train_diffusion`,\n`train_ar`, `train_pe`, `train_unified_model`) loads a **Hydra example config**\ncovering model, algorithm, rollout, reward, placement, and sync, then creates the\nmatching domain **trainer** (`DiffusionTrainer`, `ARTrainer`, `PETrainer`,\n`UnifiedModelTrainer`). The trainer coordinates the RL loop across pluggable\n**rollout engines**, **algorithms**, **model bundles**, **reward services**, and\nthe shared **distributed runtime**: Ray `DevicePool`, FSDP, Transfer\nQueue (TQ), and LoRA\u002Ffull-weight sync. See [`unirl\u002FREADME.md`](unirl\u002FREADME.md) for the\nruntime loop, deployment modes, and module map.\n\n## Team-Proposed Algorithms 🌟\n\n> **🌟 These algorithms are proposed by our team — the highlight of UniRL.** Each\n> algorithm's folder holds a step-by-step tutorial and a runnable example recipe.\n> We highly recommend trying them in our framework!\n\n| Algorithm | Paper | Tutorial | Notes |\n|---|---|---|---|\n| **Flow-DPPO** | [*\"Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models\"*](FlowDPPO\u002FHY_FlowDPPO.pdf) | [FlowDPPO\u002F](FlowDPPO\u002F) | Diffusion\u002Fflow RL with an exact divergence-based trust-region mask. |\n| **DRPO** | [*\"Rethinking the Divergence Regularization in LLM RL\"*](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.09821) | [DRPO\u002F](DRPO\u002F) | Token-level LLM RL with a smooth advantage-weighted quadratic regularizer. |\n\nUniRL also wires in standard reference algorithms — **(LLM's)GRPO**, **DiffusionNFT**,\n**DanceGRPO**, and **MixGRPO** — in [`unirl\u002Falgorithms\u002F`](unirl\u002Falgorithms\u002FREADME.md).\n\n## Model Support 🎨\n\nModel and algorithm support are **two independent dimensions** that compose within\na domain: any diffusion algorithm (see above) runs on a diffusion\nmodel, AR algorithms on AR models — so UniRL covers many more model × algorithm\ncombinations than the shipped example recipes alone. The table below is the model\ndimension; all listed models are supported (✅).\n\n\u003Cdiv align=\"center\">\n\n| Model | Category | Modality | Status |\n|---|---|---|---|\n| Stable Diffusion 3 \u002F 3.5 | Image diffusion | Text → Image | ✅ |\n| Qwen-Image | Image diffusion | Text → Image | ✅ |\n| FLUX.2-Klein | Image diffusion | Text → Image | ✅ |\n| WAN 2.1 | Video diffusion | Text \u002F Image → Video | ✅ |\n| WAN 2.2 | Video diffusion | Text \u002F Image → Video | ✅ |\n| HunyuanVideo 1.0 \u002F 1.5 | Video diffusion | Text → Video | ✅ |\n| Qwen-VL | Vision-language AR | Text + Image → Text | ✅ |\n| Qwen3 | LLM AR | Text → Text | ✅ |\n| Prompt-Enhancer | LLM + diffusion | Text → Text → Image | ✅ |\n| HunyuanImage3 | Unified AR + diffusion | Text → Image | ✅ |\n| Bagel | Unified AR + diffusion | Text → Image | ✅ |\n\n\u003C\u002Fdiv>\n\nEach model maps to a domain entrypoint (`train_diffusion`, `train_ar`, `train_pe`,\n`train_unified_model`); see **Getting Started** below to run any of them.\n\n## Training Modes 🧩\n\nUniRL unifies four training modes, one Hydra example bucket and entrypoint each.\nExamples are self-contained YAML files selected with\n`--config-name=\u003Cdomain>\u002F\u003Cexample>`:\n\n| Domain | Trains | Entrypoint | Example |\n|---|---|---|---|\n| `diffusion\u002F` | Image \u002F video diffusion models | `train_diffusion` | `diffusion\u002Fsd3_sglang_rollout_colocate` |\n| `ar\u002F` | Autoregressive models — vision-language (VLM) + text-only (LLM) | `train_ar` | `ar\u002Fqwen_vl_grpo_geo3k_mc_4x8`, `ar\u002Fqwen3_drpo_4b_base_dapo_sglang` |\n| `pe\u002F` | Prompt-enhancer (AR rewriter + diffusion reward) | `train_pe` | `pe\u002Fpe_sglang_full_pickscore` |\n| `unified_model\u002F` | Unified AR + diffusion models | `train_unified_model` | `unified_model\u002Fhi3_vllmomni` |\n\nSee [`examples\u002FREADME.md`](examples\u002FREADME.md) for the full launch guide, naming\nschema, and how to add a recipe.\n\n## Getting Started ⚡\n\nInstall dependencies first — see [INSTALL.md](INSTALL.md).\n\n```bash\n# compose-check, then launch a single-node example\npython -m unirl.train_diffusion --config-name=diffusion\u002Fsd3_trainside --cfg job --resolve\nbash examples\u002Frun_experiment_single_node.sh diffusion\u002Fsd3_trainside\n```\n\nFull [launch guide](examples\u002FREADME.md#running-a-recipe) — multi-node, every entrypoint, mooncake.\n\n## Roadmap 🗺️\n\nWe are actively expanding model and algorithm coverage. Near-term directions:\n\n- Broaden algorithm coverage for the newer model families — FLUX.2-Klein,\n  HunyuanVideo 1.0 \u002F 1.5, and Bagel.\n- Extend the team-proposed algorithms (Flow-DPPO, DRPO) to more model families.\n- Broaden reward backends and rollout-engine coverage across domains.\n\nWant a model or algorithm prioritized? [Open an issue](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FUniRL\u002Fissues) to discuss.\n\n## Contributing 🤝\n\nContributions and questions are welcome. Before opening a pull request, read the\nrepository conventions in [`AGENTS.md`](AGENTS.md), run the\n[pre-PR checks](examples\u002FREADME.md#adding-or-editing-a-recipe) for the files you\ntouched, and fill in the [pull request template](.github\u002Fpull_request_template.md).\nFor questions, bug reports, and feature requests,\n[open an issue](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FUniRL\u002Fissues).\n\n## Acknowledgement 🙏\n\nUniRL builds on ideas and infrastructure from the open-source RL and inference\necosystem. We especially thank\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm),\n[SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang),\n[slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime), and\n[verl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl).\n\n## Citation 📚\n\nIf you find UniRL helpful, please cite:\n\n```bibtex\n@misc{unirl_github,\n  title        = {{UniRL: A Reinforcement Learning Framework for Unified Multimodal Models}},\n  author       = {Haonan Wang and Linyu Wu and Qian Qiu and Lewei Jin and Bowen Ping and Jianghai Chen and Yiheng Du and Guangxin He and Yu Shi and Yongguang Lin and Zhuoxin Zhou and Zhanchao Zhou and Keming Wu and Rizhen Hu and Xuefei Ning and Lvfang Tao and Feiyu Hu and Xiangyan Liu and Siqi Kou and Jiarui Yao and Xiangxin Zhou and Liefeng Bo and Wenxi Zhu and Tianyu Pang},\n  year         = {2026},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FUniRL}},\n  urldate      = {2026-06-05}\n}\n```\n\nIf you use DRPO, please also cite:\n\n```bibtex\n@misc{yao2026drpo,\n  title         = {{Rethinking the Divergence Regularization in LLM RL}},\n  author        = {Jiarui Yao and Xiangxin Zhou and Penghui Qi and Wee Sun Lee and Liefeng Bo and Tianyu Pang},\n  year          = {2026},\n  eprint        = {2606.09821},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.LG},\n  url           = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.09821}\n}\n```\n\nIf you use Flow-DPPO, please also cite:\n\n```bibtex\n@misc{ping2026flowdppo,\n  title        = {{Flow-DPPO: Divergence Proximal Policy Optimization for Flow Matching Models}},\n  author       = {Bowen Ping and Xiangxin Zhou and Penghui Qi and Minnan Luo and Liefeng Bo and Tianyu Pang},\n  year         = {2026},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FUniRL\u002Ftree\u002Fmain\u002FFlowDPPO}},\n  note         = {Manuscript dated June 8, 2026}\n}\n```\n","2026-06-11 04:12:22","CREATED_QUERY"]