[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72216":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},72216,"simpleRL-reason","hkust-nlp\u002FsimpleRL-reason","hkust-nlp","Simple RL training for reasoning","",null,"Python",3864,289,32,34,0,3,12,62.09,"MIT License",false,"v1",[],"2026-06-12 04:01:04","\n\n\u003Cdiv align=\"center\">\n\n# Simple Reinforcement Learning for Reasoning\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18892)  [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSimpleRL_Zoo-fcd022?style=for-the-badge&logo=Huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fsimplerl-zoo-67e0fd24c185423c1e3452d1)\n\n\u003C\u002Fdiv>\n\n\nThis repo contains a simple reinforcement learning recipe to improve models' reasoning abilities. It is simple because only rule-based reward and GSM8K\u002FMath datasets are used. We have used this code to successfully train 10 diverse base models with limited data (8K examples), achieving surprisingly strong results -- the accuracy gains range from 10 to more than 20 absolute points. These models include Llama3 8B, Mistral 7B\u002F24B, DeepSeekMath 7B, Qwen2.5 0.5B\u002F1.5B\u002F7B\u002F14B\u002F32B, and Qwen2.5-Math-7B. While we observe significant increase in both response length and accuracy, we note that different models exhibit distinct reasoning behaviors during training, and the increased response length does not necessarily correlate with emergence of certain cognitive behaviors such as self-verification. We share many findings and practices in our paper, and we release the code, model checkpoints, and analysis tools here. \n\n> You may find an old version of this repo [here](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason\u002Ftree\u002Fv0), with our early results and codebase using OpenRLHF and PPO.\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fplot_figure1_v2.3_token_length_vs_steps.png\" width=\"700\" alt=\"simplelr-reaoning-intro-figure_00\">\n\u003C\u002Fdiv>\n\n> Accuracy and response length across training iterations for different models. Training starts from base models without any SFT. \n\n## News\n- **[2025\u002F03\u002F24]** We perform successful zero RL training starting from 10 diverse base models. We release all 10 models and the code, and share many findings and practices in our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18892).   \n- **[2025\u002F02\u002F19]** We release checkpoints of [Qwen-2.5-Math-7B-SimpleRL-Zero](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-Math-7B-SimpleRL-Zero) and [Qwen-2.5-Math-7B-SimpleRL](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-Math-7B-SimpleRL) to Huggingface. \n- **[2025\u002F01\u002F25]** We release the training\u002Feval code and our blog. We are working on the paper and will release it very soon.\n\n\n## Links \n\n* **Paper: SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild**\n  * 📝 [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18892)\n  * 🤗 [Hugging Face Collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fsimplerl-zoo-67e0fd24c185423c1e3452d1)\n  * 💻 [Github](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason\u002Ftree\u002Fv1)\n\n* **Blog: 7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient**\n  * 📝 [Blog](https:\u002F\u002Fhkust-nlp.notion.site\u002Fsimplerl-reason)\n  * 🤗 [Hugging Face Collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fsimplerl-67b543892b2ec6908ffff710)\n  * 💻 [Github](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason\u002Ftree\u002Fv0)\n\n\n\n## Main Results\n\n### Mistral, Llama and DeepSeek Models\n|            Model           | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | AIME24  (Pass@1) | AIME24  (Avg@32) | AMC23 | Avg. |\n|:--------------------------:|:-----:|:--------:|:------------:|:--------------:|:----------------:|:----------------:|:-----:|:----:|\n| Mistral-v0.1-7B                 |  21.2 |    4.2   |      4.0     |       2.4      |        0.0       |        0.0       |  0.0  |  5.3 |\n| 🦁 Mistral-v0.1-7B + SimpleRL-Zoo   |  75.0 |   15.8   |      6.6     |       4.1      |        0.0       |        0.2       |  10.0 | 18.6 |\n| Llama-3.1-8B               |  39.7 |   13.6   |      4.8     |       3.1      |        0.0       |        0.2       |  2.5  | 10.6 |\n| 🦁 Llama-3.1-8B + SimpleRL-Zoo  |  79.2 |   23.0   |      9.6     |       5.3      |        0.0       |        0.2       |  15.0 | 22.0 |\n| DeepSeek-Math-7B           |  28.4 |   19.4   |      5.5     |       4.7      |        0.0       |        0.0       |  10.0 | 11.3 |\n| 🦁 DeepSeek-Math-7B + SimpleRL-Zoo  |  78.5 |   39.6   |     21.0     |      12.6      |        3.3       |        0.6       |  20.0 | 29.2 |\n| Mistral-Small-24B               |  78.6 |   43.6   |     10.7     |      11.6      |        3.3       |        0.5       |  17.5 | 27.6 |\n| 🦁 Mistral-Small-24B + SimpleRL-Zoo    |  92.0 |   70.6   |     36.8     |      36.6      |       16.7       |       13.1       |  45.0 | 49.6 |\n\n\n### Qwen Series Model\n\n|              Model              | GSM8K | MATH 500 | Minerva Math | Olympiad Bench | AIME24  (Pass@1) | AIME24  (Avg@32) | AMC23 | Avg. |\n|:-------------------------------:|:-----:|:--------:|:------------:|:--------------:|:----------------:|:----------------:|:-----:|:----:|\n| Qwen-2.5-0.5B                   |  36.7 |   15.8   |      4.8     |       2.8      |        0.0       |        0.3       |  12.5 | 12.1 |\n| 🦁 Qwen-2.5-0.5B + SimpleRL-Zoo          |  49.5 |   34.4   |     10.3     |       8.9      |        0.0       |        0.7       |  22.5 | 20.9 |\n| Qwen-2.5-1.5B                   |  55.7 |   29.6   |      6.6     |       6.5      |        0.0       |        0.1       |  12.5 | 18.5 |\n| 🦁 Qwen-2.5-1.5B + SimpleRL-Zoo          |  74.4 |   59.0   |     20.2     |      21.0      |        6.7       |        4.2       |  35.0 | 36.1 |\n| Qwen-2.5-7B                     |  88.2 |   64.6   |     25.7     |      30.1      |        3.3       |        0.3       |  30.0 | 40.3 |\n| 🦁 Qwen-2.5-7B + SimpleRL-Zoo            |  91.7 |   78.2   |     38.6     |      40.4      |       20.0       |       15.6       |  62.5 | 55.2 |\n| Qwen-2.5-Math-7B                |  65.5 |   63.6   |     12.5     |      25.8      |       13.3       |        8.6       |  42.5 | 37.2 |\n| 🦁 Qwen-2.5-Math-7B + SimpleRL-Zoo       |  90.2 |   80.2   |     37.5     |      39.0      |       40.0       |       24.0       |  70.0 | 59.5 |\n| Qwen-2.5-14B                    |  91.6 |   65.4   |     24.3     |      33.5      |        6.7       |        3.4       |  37.5 | 43.2 |\n| 🦁 Qwen-2.5-14B + SimpleRL-Zoo           |  94.4 |   80.2   |     40.4     |      44.9      |       23.3       |       14.2       |  57.6 | 56.8 |\n| Qwen-2.5-32B                    |  92.9 |   68.6   |     27.9     |      31.1      |       10.0       |        4.5       |  45.0 | 45.9 |\n| 🦁 Qwen-2.5-32B + SimpleRL-Zoo           |  95.9 |   82.4   |     42.6     |      46.4      |       36.7       |       27.2       |  67.5 | 61.9 |\n\n> AIME is evaluated in two ways: Pass@1 (single run) and Avg@32 (average score from 32 runs). For AIME24 (Pass@1) and other benchmarks, baselines use greedy decoding, and models with ”zero RL training” use temperature=1 and top-p=0.95. For AIME24 (Avg@32), we sample 32 responses per model with the same settings. Average scores are based on AIME (Avg@1) and other benchmarks.\n\u003C!-- #### Increase of Response Length does not always correspond to the \"aha moment\"\n\n#### Rigid Format Reward Harms Training of Some Base Models\n\n#### Pass@K Improves Significantly\n\n#### Traditional SFT as a Cold Start Harms RL -->\n\n\n## Model Checkpoints\n**We are open-sourcing all the Zero RL models as outlined below and will also release some of the intermediate checkpoints to support future research. Stay tuned for updates!**\n\nAll these models are also in our [Huggingface Collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fsimplerl-zoo-67e0fd24c185423c1e3452d1). \n|Model|Link|\n|-|-|\n|Mistral-7B-v0.1-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FMistral-7B-v0.1-SimpleRL-Zoo)|\n|Llama-3.1-8B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FLlama-3.1-8B-SimpleRL-Zoo)|\n|DeepSeek-Math-7B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FDeepSeek-Math-7B-SimpleRL-Zoo)|\n|Mistral-Small-24B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FMistral-Small-24B-SimpleRL-Zoo)|\n|Qwen-2.5-0.5B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-0.5B-SimpleRL-Zoo)|\n|Qwen-2.5-1.5B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-1.5B-SimpleRL-Zoo)|\n|Qwen-2.5-7B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-7B-SimpleRL-Zoo)|\n|Qwen-2.5-14B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-14B-SimpleRL-Zoo)|\n|Qwen-2.5-32B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-32B-SimpleRL-Zoo)|\n|Qwen-2.5-Math-7B-SimpleRL-Zoo|[🤗](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002FQwen-2.5-Math-7B-SimpleRL-Zoo)|\n\n\n\n\n## Quick Start\n\n### Installation\n\nOur code is implemented based on [Verl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl). We provide basic environment setup for training as follows, which only support custom environment setup and [FSDP training](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002FFSDP_tutorial.html). \n\n```bash\nconda create -n verl python==3.9\nconda activate verl\npip3 install torch==2.4.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\npip3 install flash-attn --no-build-isolation\npip3 install -e . \n```\n\nTo install from docker image or utilize Megatron-lm, please refer to [Verl's documentation](https:\u002F\u002Fverl.readthedocs.io\u002Fen\u002Fv0.2.x\u002Fstart\u002Finstall.html).\n\n### Reproducing SimpleRL-Zoo\n\n\n#### Dataset\n\nAs mentioned in [our paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18892), our data includes three difficulty levels: Easy (GSM8K and MATH lv.1), Medium (MATH lv.1-4), and Hard (MATH lv.3-5). We have processed the data into two formats: simpler prompts (abel) and complex prompts (qwen), ready to use:\n\n\njust download the [dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002FSimpleRL-Zoo-Data) directly. E.g,\n\n```\nwget https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002FSimpleRL-Zoo-Data\u002Fresolve\u002Fmain\u002Fsimplelr_qwen_level3to5\u002Ftrain.parquet\nwget https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002FSimpleRL-Zoo-Data\u002Fresolve\u002Fmain\u002Fsimplelr_qwen_level3to5\u002Ftest.parquet\n```\nSee the other folders for the other splits.\n\n\n#### Training\n\n\nThe minimum hardware requirement for training Qwen-2.5-0.5B is a single H\u002FA100-80G GPU. To accelerate our experiments, we utilized 2x8 H100-80G GPUs to train 7B and 14B models for approximately 100 steps over 15 hours using 8K examples. For training the 32B models, we used 8x8 H100-80G GPUs, completing the training in 1.5 days with the same dataset.\n\nThe training process leverages GRPO with Ray and vLLM for acceleration. So firstly, you need to launch the ray cluster using the command below:\n```bash\n# launch the master node of ray \nray start --head --node-ip-address 0.0.0.0 --num-gpus 8\n\n# if you want to launch ray on more nodes, use\nray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8\n```\nThe main script for training is train_grpo_math_tune_ray.sh. You need to specify the required environment variables in this script. Once configured, submit the training job from the master node.\n\nHere are examples for different models:\n\n* Qwen-2.5-7B (For models between 0.5B and 14B, we use `kl_loss_coef=1e-4`)\n```bash\nbash train_grpo_math_tune_ray.sh --model_name Qwen-2.5-7B --max_response_length 8192  --train_batch_size 1024 --rollout_n 8 --kl_loss_coef 0.0001 --entropy_coeffient 0.001 --rollout_gpu_memory_util 0.75 --rollout_tp 2 --save_freq 5  \n```\n\n* Qwen-2.5-32B (For > 14B models, we use `kl_loss_coef=1e-3`)\n```bash\nbash train_grpo_math_tune_ray.sh --model_name Qwen-2.5-32B --max_response_length 8192  --train_batch_size 1024 --rollout_n 8 --kl_loss_coef 0.001 --entropy_coeffient 0.001 --rollout_gpu_memory_util 0.75 --rollout_tp 8 --save_freq 5  \n```\n\nNote: The run name will depends on the model name and specific hyper-parameters to identify the training job. For example, above command will generate a run name like `verl-grpo_Qwen-2.5-32B_max_response8192_batch1024_rollout8_klcoef0.001_entcoef0.001_simplelr_math_35`. You can find the run name in terminal output. \n\nFor other models, use the same command, adjusting the `--model_name` argument accordingly. \n\n### Evaluate\n\nWe used [Qwen Math's codebase](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math\u002Ftree\u002Fmain\u002Fevaluation) for evaluation, but for fairness considerations, we completely prohibited solving problems by calling code. The `eval_math_nodes.sh` script provides the full pipeline for evaluation, results collection, and analysis. To use it, you'll need to specify a few environment variables within the script, and then run it as shown below:\n\nExample: \n```bash\nbash eval_math_nodes.sh \\\n    --run_name verl-grpo_Qwen-2.5-32B_max_response8192_batch1024_rollout8_klcoef0.001_entcoef0.001_simplelr_math_35   \\\n    --init_model Qwen-2.5-32B \\\n    --template qwen-boxed  \\\n    --tp_size 8 \\\n    --add_step_0 true  \\\n    --temperature 1.0 \\\n    --top_p 0.95 \\\n    --max_tokens 16000 \\\n    --benchmarks aime24,amc23,math500,olympiadbench,gsm8k,minerva_math \\\n    --n_sampling 1 \n```\n\nAfter running the script, the evaluation results will be saved in `$RUN_NAME\u002Feval_results`, with the metrics from our paper (e.g., clip ratio, average response length, etc.) saved in `$RUN_NAME\u002Feval_results\u002Feval_results.csv`.\n\n### Visualization\n\nTo compare the model's responses across different training steps, we offer a visualization tool that displays the model's reasoning process across various steps and benchmarks using Gradio. You can run the following script to access this tool:\n\n```bash\n# install gradio and httpx\npip install gradio\npip install httpx==0.23.0\n\nbash launch_gradio.sh \\\n    --data_dir SimpleRL-verl\u002Fcheckpoints \\\n    --run_names verl-grpo_Qwen-2.5-32B_max_response8192_batch1024_rollout8_klcoef0.001_entcoef0.001_simplelr_math_35  \\\n    --temperature 1.0   # temperature for evaluation\n```\n\n\n\n## Citation\n\nIf you find our paper\u002Fblog or our code useful, we would appreciate it if you could cite our work:\n\nCite our paper:\n```bibtex\n@inproceedings{zeng2025simplerlzooinvestigatingtamingzero,\n      title={SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for Open Base Models in the Wild}, \n      author={Weihao Zeng and Yuzhen Huang and Qian Liu and Wei Liu and Keqing He and Zejun Ma and Junxian He},\n      booktitle={Second Conference on Language Modeling},\n      year={2025}\n}\n```\n\n\nCite our blog:\n```bibtex\n@misc{zeng2025simplerl,\n      title={7B Model and 8K Examples: Emerging Reasoning with Reinforcement Learning is Both Effective and Efficient},\n      author={Weihao Zeng and Yuzhen Huang and Wei Liu and Keqing He and Qian Liu and Zejun Ma and Junxian He},\n      year={2025},\n      howpublished={\\url{https:\u002F\u002Fhkust-nlp.notion.site\u002Fsimplerl-reason}},\n      note={Notion Blog}\n}\n```\n\n\n## Acknowledgement\nWe implement our reinforcement learning algorithm extending from [Verl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl). We utilize [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) for inference and develop evaluation scripts based on [Qwen2.5-Math](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math\u002Ftree\u002Fmain\u002Fevaluation). Particularly, we thank the developers of DeepSeek-R1 and Kimi-k1.5 for their innovation and contribution to the open-source community.\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=hkust-nlp\u002FsimpleRL-reason&type=Date)](https:\u002F\u002Fstar-history.com\u002F#hkust-nlp\u002FsimpleRL-reason&Date)\n\n\n","该项目旨在通过简单的强化学习方法提升模型的推理能力。其核心功能是基于规则奖励和GSM8K\u002FMath数据集训练模型，无需复杂的设置即可实现显著的性能提升，适用于包括Llama3 8B、Mistral 7B\u002F24B、DeepSeekMath 7B以及Qwen2.5系列在内的多种基础模型。技术特点在于使用有限的数据（约8000个样本）进行训练，使得不同模型在准确率上获得了10至超过20个百分点的增长，并且观察到了响应长度与准确性的同时增加。此项目特别适合需要快速增强语言模型数学解题能力和逻辑推理能力的研究者或开发者采用。",2,"2026-06-11 03:40:54","high_star"]