[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72586":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},72586,"Open-Reasoner-Zero","Open-Reasoner-Zero\u002FOpen-Reasoner-Zero","Official Repo for Open-Reasoner-Zero","https:\u002F\u002Fyasminezhang.notion.site\u002FOpen-Reasoner-Zero-19e12cf72d418007b9cdebf44b0e7903",null,"Python",2095,120,11,19,0,3,4,28.25,"MIT License",false,"main",[],"2026-06-12 02:03:05","\u003Cdiv align=\"center\">\n\n# Open Reasoner Zero\n\n\u003Cimg src=\"figure\u002Flogo.jpg\" width=\"300\"\u002F>\n\n\u003Cdiv>\n\nAn Open Source Approach to Scaling Up Reinforcement Learning on the Base Model\n\u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\" style=\"line-height: 1;\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\" style=\"margin: 2px;\">\u003Cimg alt=\"Code\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpen%20Reasoner%20Zero-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n  \n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\" target=\"_blank\">\u003Cimg alt=\"Hugging Face\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor\"\u002F>\u003C\u002Fa>\n\n  \u003Ca href=\"https:\u002F\u002Fyasminezhang.notion.site\u002FOpen-Reasoner-Zero-19e12cf72d418007b9cdebf44b0e7903\" target=\"_blank\">\n  \u003Cimg alt=\"Notion Page\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNotion-%23000000.svg?style=for-the-badge&logo=notion&logoColor=white\"\u002F>\u003C\u002Fa>\n\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290\">\u003Cb>Paper Arxiv Link \u003C\u002Fb>👁️\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cdiv>\n\u003Cbr>\n\n\u003C\u002Fdiv>\n\n## Overview 🌊\nWe introduce **Open-Reasoner-Zero**, the first open source implementation of large-scale reasoning-oriented RL training focusing on scalability, simplicity and accessibility.\nUsing the same base model as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance on AIME2024, MATH500, and the GPQA Diamond benchmark while demonstrating remarkable efficiency—requiring only a tenth of the training steps, compared to DeepSeek-R1-Zero pipeline.\n\nTo enable broader participation in this pivotal moment we witnessed and accelerate research towards artificial general intelligence (AGI), \nwe release our source code, parameter settings, training data, and model weights.\nPlease refer to our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) for more insights across various model sizes. \n\n**Let the Reasoner-Zero tide rise!**\n\n\n## Main Results 🏆\n\n![](figure\u002Fteaser.png)\n\n*Figure 1 | Evaluation performance of Open-Reasoner-Zero-\\{7B, 32B\\}. Evaluation performance of Open-Reasoner-Zero-\\{7B, 32B\\} on benchmarks (averaged on 16 responses) during training. Using the same base model as DeepSeek-R1-Zero-Qwen-32B, Open-Reasoner-Zero-32B achieves superior performance on AIME2024, MATH500, and GPQA Diamond benchmark-requiring only a tenth of the training steps.*\n\n![](figure\u002Ftrain_curve.png)\n*Figure 2 | Train-time Scale up on Train Reward and Response Length of Open-Reasoner-Zero (ORZ) - \\{0.5B, 1.5B, 7B, 32B\\}. Train Reward and Response Length increase steadily, demonstrating consistent scalability across model sizes. Interestingly, the ORZ-32B Response Length exhibits fluctuations without negatively impacting training stability, highlighting the robustness of our minimalist recipe.*\n\n## Releases 📦\n\n\u003Cstrong>[2025\u002F06\u002F03]\u003C\u002Fstrong>\nWe release [ORZ-R1-Distill-Qwen-14B](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FORZ-R1-Distill-Qwen-14B), obtained by applying ORZ recipe to reasoning-enhanced models like DeepSeek-R1-Distill-Qwen-14B. This ORZ-R1-Distill-Qwen-14B achieves strong results on reasoning benchmarks, even surpassing the larger DeepSeek-R1-Distill-Qwen-32B model.\n\n| Model                        | AIME 2024 | AIME 2025 | MATH500 | GPQA Dia. |\n| ---------------------------- | --------- | --------- | ------- | --------- |\n| DeepSeek-R1-Distill-Qwen-14B | 69.7      | 49.1      | 93.9    | 59.1      |\n| DeepSeek-R1-Distill-Qwen-32B | 72.6      | 60.0      | 94.3    | **62.1**     |\n| **ORZ-R1-Distill-Qwen-14B**  | **75.2**  | **60.0**  | **95.6** | 60.4  |\n\n\n\u003Cstrong>[2025\u002F03\u002F31]\u003C\u002Fstrong>\nWe announce a major milestone for `Open-Reasoner-Zero`:\n\n- 🌊 [Updated Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) with new results.\n- 🔭 [Easy-to-use Training Scripts](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fplayground):\n  - [ORZ-1.5B training scripts](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fplayground\u002Forz_1p5b_ppo.py) and [ORZ-0.5B training scripts](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fplayground\u002Forz_0p5b_ppo.py) (main results in Figure 2). \n  - [Minimal resource training scripts](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fplayground\u002Forz_0p5b_ppo_1gpu.py): ORZ-0.5B can be run on a single A800\u002FH800 gpu!\n- 🤩 [Updated Curated Datasets](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fdata): \n  - 129k data in total:\n    - [original 57k data](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_57k_collected.json).\n    - [extended 72k data](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_72k_collection_extended.json).\n  - [13k hard data](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_13k_collection_hard.json) mined from the above 129k data. \n    - used in the \"annealing\" stage of ORZ-32B training: **AIME2024 from ~41% to ~48%**!\n- 🤗 More HF Models: \n  - Updated HF Models: [`Open-Reasoner-Zero-7B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-7B) and [`Open-Reasoner-Zero-32B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-32B).\n  - Released HF Models: [`Open-Reasoner-Zero-1.5B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-1.5B) and [`Open-Reasoner-Zero-0.5B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-0.5B).\n- 🚀 Full Suite of Critic Models for in-depth research: `Open-Reasoner-Zero-Critic-`{[0.5B](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-Critic-0.5B), [1.5B](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-Critic-1.5B), [7B](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-Critic-7B),  [32B](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-Critic-32B)}.\n\n\u003Cstrong>[2025\u002F02\u002F18]\u003C\u002Fstrong>\nWe release `Open-Reasoner-Zero`. \n\nAs part of this release, we open-source:\n- 🌊 [Paper(WIP)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002FORZ_paper.pdf) on our comprehensive analysis and insights in Reasoner-Zero training\n- 🤗 HF Model [`Open-Reasoner-Zero-7B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-7B) and [`Open-Reasoner-Zero-32B`](https:\u002F\u002Fhuggingface.co\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero-32B)\n- 🎁 [`Our curated 57k training data`](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fdata)\n- 📄 [Training Scripts](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fplayground) to enjoy your own Reasoner-Zero journey!\n\n## Key Features in Codebase 🔑\n\n- Adopt single controller trainer design, flexible and researcher-friendly.\n- Colocate training and generation in the same GPUs to maximize GPU utilization.\n\n## Getting Started 🚀\n### Data\n\nWe release all of curated high-quality training data in the [`data`](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fdata) folder:\n* curated 129k data:\n  * [original 57k](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_57k_collected.json), collected from various sources, including AIME (up to 2023), MATH, Numina-Math collection and Tulu3 MATH.\n  * [extended 72k](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_72k_collection_extended.json), mainly cleaned from OpenR1-Math-220k.\n* [hard 13k](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdata\u002Forz_math_13k_collection_hard.json), mined from the first stage of ORZ-32B training.\n\nThe details for how to collect data are described in our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290).\n\n### Installation & Training Scripts\nWe release our [Dockerfile](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Fdocker\u002FDockerfile) in [docker](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Ftree\u002Fmain\u002Fdocker) folder to facilitate the reproducibility of our training.\n\nTo install the package, run:\n```bash\npip install -e .\n```\n\n#### Start ORZ-32B PPO Training\nHere are the starting commands in 16 nodes. \n\nFirst on master node, run:\n```bash\nray start --head\n# you will see logging like:\n# Next steps\n#  To add another node to this Ray cluster, run\n#    ray start --address='\u003Cmaster-node-ip>:\u003Cmaster-node-port>'\n```\n\nthen on all other nodes, run:\n```bash\nray start --address='\u003Cmaster-node-ip>:\u003Cmaster-node-port>' # \u003Cmaster-node-ip> and \u003Cmaster-node-port> are from above loggings!\n```\n\nfinally on master node, just run:\n```bash\npython -m playground.orz_32b_ppo\n```\nYour training log will be shown in the master node terminal.\n\n------\n\n#### Start ORZ-0.5B PPO Training\nYou can start the ORZ-0.5B PPO training in single A800\u002FH800 node:\n```bash\npython -m playground.orz_0p5b_ppo\n```\n\nYou can even run in **a single A800\u002FH800 gpu**: \n```bash\npython -m playground.orz_0p5b_ppo_1gpu\n```\n\nnote: since we are not in multi-node setting, no `ray start` like logics are needed.\n\n------\n\n#### Start ORZ-7B PPO Training\n\nMulti-node Training on 4 nodes:\n```bash\n# set up for multi-node training\nray start --head # on master node\nray start --address='\u003Cmaster-node-ip>:\u003Cmaster-node-port>' # then on other nodes\n\n# then on master node, run:\npython -m playground.orz_7b_ppo\n```\n\nYour training log will be shown in the master node terminal.\n\n-----\n\n#### Start ORZ-1.5B PPO Training\n\nMulti-node Training on 2 nodes:\n```bash\n# set up for multi-node training\nray start --head # on master node\nray start --address='\u003Cmaster-node-ip>:\u003Cmaster-node-port>' # then on other nodes\n# then on master node, run:\npython -m playground.orz_1p5b_ppo\n```\n\n----\n\n#### Debug Settings\nIn the code, we leave an environment variable `DEBUG_MODE` to run in debug setting for researcher to iterate. (Thought for now, we recommend using `python -m playground.orz_0p5b_ppo_1gpu` for debugging.)\n\nThe debug running command examples:\n```bash\n# NOTE: just for debug, not final setting!\n\n## Debug command in a single GPU with `EleutherAI\u002Fpythia-14m`\nDEBUG_MODE=True python -m playground.orz_14m_ppo_mini\n## Debug command in a single node (8 GPUs) with `Qwen\u002FQwen2.5-7B`\nDEBUG_MODE=True python -m playground.orz_7b_ppo\n```\n\n### How to Use the Model\n#### Policy Model\nPolicy models can be used in the same way as any chat model in transformers and vllm, since we have put the chat template jinja in the tokenizer.\n\n#### Critic Model\nCritic models can be loaded the same way like in the [training code](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002Forz\u002Fppo\u002Factors.py#L738). \n\n\n## Acknowledgements 💖 \n\n- This work was supported by computing resources and valuable feedback provided by [StepFun](https:\u002F\u002Fwww.stepfun.com\u002F) and Tsinghua University.\n- Our training framework is built on [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF), [vllm](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed) and [ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray).\n- Our model is based on [Qwen2.5 Series](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2.5-llm\u002F) of **base models**, including [Qwen2.5-0.5B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-0.5B), [Qwen2.5-1.5B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B), [Qwen2.5-7B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-7B) and [Qwen2.5-32B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-32B).\n- We thank [Project Numina](https:\u002F\u002Fprojectnumina.ai\u002F), [Tulu3](https:\u002F\u002Fallenai.org\u002Fblog\u002Ftulu-3-technical) and [OpenR1-Math-220k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopen-r1\u002FOpenR1-Math-220k) for their collected open sourced data.\n\n## Advertisement Time 📣\n\nWe are hiring talented researchers and engineers to join our team. If you are interested in our project and would like to contribute to the reasoner scale-up all the way to AGI, please feel free to reach out to us at hanqer@stepfun.com\n\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Open-Reasoner-Zero\u002FOpen-Reasoner-Zero&type=Timeline)](https:\u002F\u002Fstar-history.com\u002F#Open-Reasoner-Zero\u002FOpen-Reasoner-Zero&Timeline)\n\n## Community Discussions 🍺\n\nWe have several wechat groups to help discussions and sharing, you can scan the QR code below to join the latest group.\n\n\u003Cimg src=\"figure\u002FWeChatGroup.png\" width=\"300\" style=\"display: block; margin: 0 auto;\"\u002F>\n\n## Citation\n\n```bibtex\n@misc{hu2025openreasonerzeroopensourceapproach,\n      title={Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model}, \n      author={Jingcheng Hu and Yinmin Zhang and Qi Han and Daxin Jiang and Xiangyu Zhang and Heung-Yeung Shum},\n      year={2025},\n      eprint={2503.24290},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290}, \n}\n```\n","Open-Reasoner-Zero 是一个开源项目，专注于基于基础模型的大规模强化学习训练。该项目使用与DeepSeek-R1-Zero-Qwen-32B相同的基模型，在AIME2024、MATH500和GPQA Diamond等基准测试中表现出色，同时仅需十分之一的训练步骤，显著提高了效率。项目采用Python语言编写，其核心功能包括简化训练流程、提高可扩展性和易用性。适用于需要高效推理能力的研究者和开发者，尤其是在追求人工通用智能（AGI）的研究领域。通过公开源代码、参数设置、训练数据和模型权重，Open-Reasoner-Zero促进了更广泛的技术交流与合作。",2,"2026-06-11 03:42:42","high_star"]