[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79878":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":13,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":16,"starSnapshotCount":16,"syncStatus":13,"lastSyncTime":25,"discoverSource":26},79878,"DMPO","OliverLeeXZ\u002FDMPO","OliverLeeXZ","[ICML 2026] Official implement on 'Beyond Mode Collapse: Distribution Matching for Diverse Reasoning'","",null,"Python",105,2,3,6,0,1.43,false,"main",true,[],"2026-06-12 02:03:55","# Beyond Mode Collapse: Distribution Matching for Diverse Reasoning\n\u003Cdiv align=\"center\">\n\n[📃[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.19461)]\n[🌐[Project Page](https:\u002F\u002Fgithub.com\u002FOliverLeeXZ\u002FDMPO)]\n[🤗[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.19461)]\n\u003C\u002Fdiv>\n\n\n## 📣 What's New\n- **[2025.5.21]** We have released our DMPO algorithm in verl [DMPO](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl-recipe\u002Fpull\u002F105) !\n- **[2025.5.21]** We have released data in [OliverLeeXZ\u002FNP-MM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOliverLee\u002FNP_MM) and [OliverLeeXZ\u002FNP](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOliverLee\u002FNP). 🎉🎉🎉\n- **[2026.5.19]**  Our DMPO Paper is released! Check it at 📃[Arxiv: DMPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.19461) !\n- **[2026.5.1]**  Our DMPO has been accepted at **ICML 2026**! See you in **Seoul**! 🎉🎉🎉\n- **[2026.5.1]** Our NPMM-Bench is now integrated into VLMEvalKit via [PR #1463](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fpull\u002F1463).\n\n## 🌟 Highlights\n\n\u003Cdiv align=\"center\">\n \u003Cimg src=\".\u002Fimages\u002Fdmpo.png\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\n1. We show that on-policy RL methods suffer from mode collapse due to reverse KL's mode-seeking behavior, and propose DMPO—a simple, practical solution that approximates forward KL minimization at the group level, achieving 9-12\\% relative improvements on optimization tasks.\n2. We introduce MM-NP-Bench to vision-language models with visual representations of 10 NP-hard tasks. The benchmark features dual-metric evaluation (Success Rate \\& Quality Ratio) that makes mode collapse observable: high SR but low QR reveals a policy that finds solutions but doesn't optimize them. We provide a complete infrastructure including parametric generators, rule-based verifiers, and heuristic solvers, enabling both evaluation and RLVR training.\n3. Extensive experiments showing DMPO outperforms five strong baselines by 4.7\\%-3.8\\% on optimization tasks, 2\\% on mathematical reasoning, and 2.3\\% on out-of-domain tasks, with evidence that diversity-preserving training transfers to general reasoning capabilities.\n\n## Quick Start\n\n### Environment Setup\n- We recommend following the official verl installation guide: [Install verl](https:\u002F\u002Fverl.readthedocs.io\u002Fen\u002Flatest\u002Fstart\u002Finstall.html#install-verl).\n\n### NPMM Training Setup\n- You can integrate the vision NP task: ``NP_MM``, into verl, or directly use the existing NP task in verl: [NP Task](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl\u002Fpull\u002F6465).\n\n### Latest verl Recipe for DMPO\n\nTo run DMPO, we recommend using the latest official verl framework and recipe from the official codebase:\n\n**Latest verl recipe:**\n\n- verl recipe: [https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl-recipe\u002Fpull\u002F105](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl-recipe\u002Fpull\u002F105)\n\n## 🖊️ Citation\n\nIf you find this work helpful, please consider to **star🌟** this repo and cite this paper. Thanks for your support!\n\n```bib\n@misc{li2026modecollapsedistributionmatching,\n      title={Beyond Mode Collapse: Distribution Matching for Diverse Reasoning}, \n      author={Xiaozhe Li and Yang Li and Xinyu Fang and Shengyuan Ding and Peiji Li and Yongkang Chen and Yichuan Ma and Tianyi Lyu and Linyang Li and Dahua Lin and Qipeng Guo and Qingwen Liu and Kai Chen},\n      year={2026},\n      eprint={2605.19461},\n      archivePrefix={arXiv},\n      primaryClass={cs.AI},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.19461}, \n}\n```\n## 🙏 Acknowledgement\n\nDMPO is built on the excellent RL framework [verl](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl), and NPMM-Bench is built on the widely used VLM evaluation framework [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit). We thank the authors and contributors of these projects for their valuable work.\n","该项目旨在解决强化学习中的模式坍塌问题，提出了一种名为DMPO的新算法。该算法通过在群体层面近似前向KL散度最小化，有效提升了优化任务的表现，相对改进达到9-12%。项目还引入了MM-NP-Bench基准测试，包含10个NP难问题的视觉表示，并采用双指标评估（成功率和质量比），以更全面地衡量模型性能。此外，提供了包括参数生成器、基于规则的验证器及启发式求解器在内的完整基础设施。适用于需要多样化推理能力的场景，如复杂的优化任务、数学推理以及跨领域任务等。项目使用Python编写，已集成至verl框架中。","2026-06-11 03:58:22","CREATED_QUERY"]