[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73766":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},73766,"Awesome-RL-for-LRMs","TsinghuaC3I\u002FAwesome-RL-for-LRMs","TsinghuaC3I","A Survey of Reinforcement Learning for Large Reasoning Models","https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08827",null,"TeX",2464,130,21,3,0,5,12,62.05,"MIT License",false,"main",[24,25,26,27,28,29,30],"awesome-list","deepseek-r1","llm","lrm","open-source","reasoning","rl","2026-06-12 04:01:11","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"figs\u002Fsurvey_logo.png\" style=\"width: 70%;\"\u002F>\n\n## A Survey of Reinforcement Learning for Large Reasoning Models\n\n[![Awesome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome-0066CC?style=for-the-badge&logo=awesome-lists&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome) [![Survey](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08827)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAwesome--RL--for--LRMs-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-RL-Reasoning-Recipes)  [![HF Papers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)  [![Twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002FOkhayIea\u002Fstatus\u002F1965989894163235111)\n\n\u003C\u002Fdiv>\n\n> We welcome everyone to open an issue for any related work we haven’t discussed, and we’ll try to address it in the next release!\n\n\n## 🎉 News\n\n- **[2025-11-05]** 🔥 Excited to release our paper list about **Memory for Agents**, covering breakthroughs in Context Management and Learning from Experience powering self-improving AI agents. Check it out: [GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FAwesome-Memory-for-Agents)\n- **[2025-10]** 🎉 Honored to give talks at [BAAI](https:\u002F\u002Fevent.baai.ac.cn\u002Factivities\u002F961), [Qingke Talk](https:\u002F\u002Fqingkeai.online\u002Farchives\u002F0h3Cm8Bi) and Tencent Wiztalk! Here are the [slides](Survey@RL4LRM-v1.pdf).\n- **[2025-09-18]** 🎉 We update the full list of papers in the category structure of the survey!\n- **[2025-09-12]** 🎉 Our survey was ranked **#1 Paper of the Day** on 🤗 [Hugging Face Daily Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827)!\n- **[2025-09-11]** 🔥 Excited to release our **RL for LRMs Survey**! We’ll be updating the full list of papers in with a new category structure soon. Check it out: [Paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2509.08827).\n- **[2025-08-15]** 🔥 Introducing **SSRL**: an investigation for Agentic Search RL without reliance on external search engine. Check it out: [GitHub](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FSSRL) and [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10874).\n- **[2025-05-27]** 🔥 Introducing **MARTI**: A Framework for LLM-based Multi-Agent Reinforced Training and Inference. Check it out: [Github](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI).\n- **[2025-04-23]** 🔥 Introducing **TTRL**: an open-source solution for online RL on data without ground-truth labels, especially test data. Check it out: [Github](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL) and [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084).\n- **[2025-03-20]** 🔥 We are excited to introduce collection of papers and projects on RL for reasoning models!\n\n\n## 🎈 Citation\n\nIf you find this survey helpful, please cite our work:\n\n```bibtex\n@article{zhang2025survey,\n  title={A survey of reinforcement learning for large reasoning models},\n  author={Zhang, Kaiyan and Zuo, Yuxin and He, Bingxiang and Sun, Youbang and Liu, Runze and Jiang, Che and Fan, Yuchen and Tian, Kai and Jia, Guoli and Li, Pengfei and others},\n  journal={arXiv preprint arXiv:2509.08827},\n  year={2025}\n}\n```\n\n## 📖 Contents\n- [A Survey of Reinforcement Learning for Large Reasoning Models](#a-survey-of-reinforcement-learning-for-large-reasoning-models)\n- [🎉 News](#-news)\n- [🎈 Citation](#-citation)\n- [📖 Contents](#-contents)\n- [🗺️ Overview](#️-overview)\n- [📄 Paper List](#-paper-list)\n  - [Frontier Models](#frontier-models)\n  - [Reward Design](#reward-design)\n    - [Generative Rewards](#generative-rewards)\n    - [Dense Rewards](#dense-rewards)\n    - [Unsupervised Rewards](#unsupervised-rewards)\n    - [Rewards Shaping](#rewards-shaping)\n  - [Policy Optimization](#policy-optimization)\n    - [Policy Gradient Objective](#policy-gradient-objective)\n    - [Critic-based Algorithms](#critic-based-algorithms)\n    - [Critic-Free Algorithms](#critic-free-algorithms)\n    - [Off-policy Optimization](#off-policy-optimization)\n    - [Off-policy Optimization (Exp replay)](#off-policy-optimization-exp-replay)\n    - [Regularization Objectives](#regularization-objectives)\n  - [Sampling Strategy](#sampling-strategy)\n    - [Dynamic and Structured Sampling](#dynamic-and-structured-sampling)\n    - [Sampling Hyper-Parameters](#sampling-hyper-parameters)\n  - [Training Resource](#training-resource)\n    - [Static Corpus (Code)](#static-corpus-code)\n    - [Static Corpus (STEM)](#static-corpus-stem)\n    - [Static Corpus (Math)](#static-corpus-math)\n    - [Static Corpus (Agent)](#static-corpus-agent)\n    - [Static Corpus (Mix)](#static-corpus-mix)\n    - [Dynamic Environment (Rule-based)](#dynamic-environment-rule-based)\n    - [Dynamic Environment (Code-based)](#dynamic-environment-code-based)\n    - [Dynamic Environment (Game-based)](#dynamic-environment-game-based)\n    - [Dynamic Environment (Model-based)](#dynamic-environment-model-based)\n    - [Dynamic Environment (Ensemble-based)](#dynamic-environment-ensemble-based)\n    - [RL Infrastructure (Primary)](#rl-infrastructure-primary)\n    - [RL Infrastructure (Secondary)](#rl-infrastructure-secondary)\n  - [Applications](#applications)\n    - [Coding Agent](#coding-agent)\n    - [Search Agent](#search-agent)\n    - [Browser-Use Agent](#browser-use-agent)\n    - [DeepResearch Agent](#deepresearch-agent)\n    - [GUI\\&Computer Agent](#guicomputer-agent)\n    - [Recommendation Agent](#recommendation-agent)\n    - [Agent (Others)](#agent-others)\n    - [Code Generation](#code-generation)\n    - [Software Engineering](#software-engineering)\n    - [Multimodal Understanding](#multimodal-understanding)\n    - [Multimodal Generation](#multimodal-generation)\n    - [Robotics Tasks](#robotics-tasks)\n    - [Multi-Agent Systems](#multi-agent-systems)\n    - [Scientific Tasks](#scientific-tasks)\n- [🌟 Acknowledgment](#-acknowledgment)\n- [✨ Star History](#-star-history)\n\n\n## 🗺️ Overview\n\nOur survey provides a comprehensive examination of **Reinforcement Learning for Large Reasoning Models**.\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\"figs\u002Fteaser.png\" alt=\"Overview of RL for LRMs Survey\" style=\"width: 100%;\">\n\u003C\u002Fp>\n\nWe organize the survey into five main sections:\n\n1. \u003Cu>Foundational Components:\u003C\u002Fu> Reward design, policy optimization, and sampling strategies\n2. \u003Cu>Foundational Problems:\u003C\u002Fu> Key debates and challenges in RL for LRMs\n3. \u003Cu>Training Resources:\u003C\u002Fu> Static corpora, dynamic environments, and infrastructure\n4. \u003Cu>Applications:\u003C\u002Fu> Real-world implementations across diverse domains\n5. \u003Cu>Future Directions:\u003C\u002Fu> Emerging research opportunities and challenges\n\n## 📄 Paper List\n\n### Frontier Models\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `Intern-S1` | Intern-S1: A Scientific Multimodal Foundation Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.15763v1) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FIntern-S1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FInternLM\u002FIntern-S1) |\n| 2025-08 | `GLM-4.5` | GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.06471) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-4.5?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-4.5) |\n| 2025-08 | `gpt-oss` | gpt-oss-120b & gpt-oss-20b Model Card | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10925) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fgpt-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-oss) |\n| 2025-08 | `InternVL3.5` | InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18265) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-07 | `Kimi K2` | Kimi K2: Open Agentic Intelligence | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-K2?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2) |\n| 2025-07 | `Step 3` | Step-3 is Large yet Affordable: Model-system Co-design for Cost-effective Decoding | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19427) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStep3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) |\n| 2025-07 | `GLM-4.1V-Thinking` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-07 | `Skywork-R1V3` | Skywork-R1V3 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06167) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-07 | `GLM-4.5V` | GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01006) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzai-org\u002FGLM-V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FGLM-V) |\n| 2025-06 | `Magistral` | Magistral | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10910) | - |\n| 2025-06 | `Minimax-M1` | MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13585) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `MiMo` | MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07608) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo) |\n| 2025-05 | `Qwen3` | Qwen3 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3) |\n| 2025-05 | `Llama-Nemotron-Ultra` | Llama-Nemotron: Efficient Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVIDIA\u002FMegatron-LM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) |\n| 2025-05 | `INTELLECT-2` | INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07291) | - |\n| 2025-05 | `Hunyuan-TurboS` | Hunyuan-TurboS: Advancing Large Language Models through Mamba-Transformer Synergy and Adaptive Chain-of-Thought | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15431) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FHunyuan-TurboS?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuan-TurboS) |\n| 2025-05 | `Skywork OR-1` | Skywork Open Reasoner 1 Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22312) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-OR1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-OR1) |\n| 2025-04 | `Phi-4 Reasoning` | Phi-4-reasoning Technical Report | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21318) | - |\n| 2025-04 | `Skywork-R1V2` | Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16656) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) |\n| 2025-04 | `InternVL3` | InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) |\n| 2025-03 | `ORZ` | Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24290) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| - | `QwQ` | QwQ-32B: Embracing the Power of Reinforcement Learning | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwq-32b\u002F) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwQ?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwQ) |\n| - | `Seed-OSS` | Seed-OSS Open-Source Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByteDance-Seed\u002Fseed-oss?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fseed-oss) |\n| - | `ERNIE-4.5-Thinking` | ERNIE 4.5 Technical Report | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fernie.baidu.com\u002Fblog\u002Fpublication\u002FERNIE_Technical_Report.pdf) | - |\n\n\n### Reward Design\n#### Generative Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `CAPO` | CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `CompassVerifier` | CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03686) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopen-compass\u002FCompassVerifier?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FCompassVerifier) |\n| 2025-08 | `Cooper` | Cooper: Co-Optimizing Policy and Reward Models in Reinforcement Learning for Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05613) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzju-real\u002Fcooper?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzju-real\u002Fcooper) |\n| 2025-08 | `ReviewRL` | ReviewRL: Towards Automated Scientific Review with RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10308) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FMARTI?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) |\n| 2025-08 | `Rubicon` | Reinforcement Learning with Rubric Anchors | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12790) | - |\n| 2025-08 | `RuscaRL` | Breaking the Exploration Bottleneck: Rubric-Scaffolded Reinforcement Learning for General LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.16949) | - |\n| 2025-07 | `OMNI-THINKER` | OMNI-THINKER: Scaling Cross-Domain Generalization in LLMs via Multi-Task RL with Hybrid Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.14783) | - |\n| 2025-07 | `URPO` | URPO: A Unified Reward & Policy Optimization Framework for Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17515) | - |\n| 2025-07 | `RaR` | Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.17746) | - |\n| 2025-07 | `RLCF` | Checklists Are Better Than Reward Models For Aligning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.18624) | - |\n| 2025-07 | `PCL` | Post-Completion Learning for Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20252) | - |\n| 2025-07 | `K2` | KIMI K2: OPEN AGENTIC INTELLIGENCE | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) | - |\n| 2025-07 | `LIBRA` | LIBRA: ASSESSING AND IMPROVING REWARD MODEL BY LEARNING TO THINK | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21645) | - |\n| 2025-07 | `TP-GRPO` | Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `RewardAnything` | RewardAnything: Generalizable Principle-Following Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03637) | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-1F4E79?style=for-the-badge)](https:\u002F\u002Fzhuohaoyu.github.io\u002FRewardAnything\u002F) |\n| 2025-06 | `Writing-Zero` | Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.00103) | - |\n| 2025-06 | `Critique-GRPO` | Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03106) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangxy-2019\u002Fcritique-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzhangxy-2019\u002Fcritique-GRPO) |\n| 2025-06 | `PAG` | PAG: Multi-Turn Reinforced LLM Self-Correction with Policy as Generative Verifier | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10406) | - |\n| 2025-06 | `GRAM` | GRAM: A Generative Foundation Reward Model for Reward Generalization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14175) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNiuTrans\u002FGRAM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNiuTrans\u002FGRAM) |\n| 2025-06 | `ProxyReward` | From General to Targeted Rewards: Surpassing GPT-4 in Open-Ended Long-Context Generation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.16024) | - |\n| 2025-06 | `QA-LIGN` | QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08123) | - |\n| 2025-05 | `RM-R1` | RM-R1: Reward Modeling as Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02387) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRM-R1-UIUC\u002FRM-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRM-R1-UIUC\u002FRM-R1) |\n| 2025-05 | `J1` | J1: Incentivizing Thinking in LLM-as-a-Judge via RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10320) | - |\n| 2025-05 | `TinyV` | TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14625) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuw-nsl\u002FTinyV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fuw-nsl\u002FTinyV) |\n| 2025-05 | `General-Reasoner` | General-reasoner: Advancing llm reasoning across all domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14652) | - |\n| 2025-05 | `RRM` | Reward Reasoning Model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14674) | - |\n| 2025-05 | `RL Tango` | RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `Think-RM` | Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16265) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIlgeeHong\u002FThink-RM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIlgeeHong\u002FThink-RM) |\n| 2025-04 | `JudgeLRM` | JudgeLRM: Large Reasoning Models as a Judge | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00050) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNuoJohnChen\u002FJudgeLRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNuoJohnChen\u002FJudgeLRM) |\n| 2025-04 | `GenPRM` | GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `DeepSeek-GRM` | Inference-Time Scaling for Generalist Reward Modeling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02495) | - |\n| 2025-04 | `AIR` | AIR: A Systematic Analysis of Annotations, Instructions, and Response Pairs in Preference Dataset | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03612) | - |\n| 2025-04 | `Pairwise-RL` | A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04950) | - |\n| 2025-04 | `xVerify` | xVerify: Efficient Answer Verifier for Reasoning Model Evaluations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10481) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIAAR-Shanghai\u002FxVerify?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FIAAR-Shanghai\u002FxVerify) |\n| 2025-04 | `Seed-Thinking-v1.5` | Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13914) | - |\n| 2025-04 | `ThinkPRM` | Process Reward Models That Think | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16828) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmukhal\u002Fthinkprm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmukhal\u002Fthinkprm) |\n| 2025-03 | - | Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-02 | - | Self-rewarding correction for mathematical reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19613) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRLHFlow\u002FSelf-rewarding-reasoning-LLM) |\n| 2024-10 | `GenRM` | Generative Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12832) | - |\n| 2024-08 | `CLoud` | Critique-out-Loud Reward Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11791) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzankner\u002FCLoud?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fzankner\u002FCLoud) |\n| 2024-08 | `Generative Verifier` | Generative Verifiers: Reward Modeling as Next-Token Prediction | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15240) | - |\n| 2024-01 | `Self-Rewarding LM` | Self-Rewarding Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10020) | - |\n| 2023-10 | `Auto-J` | Generative Judge for Evaluating Alignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.05470) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGAIR-NLP\u002Fauto-j?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002Fauto-j) |\n| 2023-06 | `Judge LLM-as-a-Judge` | Judging llm-as-a-judge with mt-bench and chatbot arena | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05685) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flm-sys\u002FFastChat?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) |\n\n#### Dense Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Tree-GRPO` | Tree Search for LLM Agent Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21240) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAMAP-ML\u002FTree-GRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAMAP-ML\u002FTree-GRPO) |\n| 2025-09 | `AttnRL` | Attention as a Compass: Efficient Exploration for Process-Supervised RL in Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.26628) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FAttnRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FAttnRL) |\n| 2025-09 | `TARL` | Process-Supervised Reinforcement Learning for Interactive Multimodal Tool-Use Agents | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.14480) | - |\n| 2025-09 | `PROF` | Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03403) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChenluye99\u002FPROF?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChenluye99\u002FPROF) |\n| 2025-09 | `HICRA` | Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.03646) | - |\n| 2025-08 | `KlearReasoner` | Klear-Reasoner: Advancing Reasoning Capability via Gradient-Preserving Clipping Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.07629) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKwai-Klear\u002FKlearReasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKwai-Klear\u002FKlearReasoner) |\n| 2025-08 | `CAPO` | CAPO: Towards Enhancing LLM Reasoning through Verifiable Generative Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.02298) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fandyclsr\u002FCAPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fandyclsr\u002FCAPO) |\n| 2025-08 | `GTPO & GRPO-S` | GTPO and GRPO-S: Token and Sequence-Level Reward Shaping with Policy Entropy | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04349) | - |\n| 2025-08 | `VSRM` | Promoting Efficient Reasoning with Verifiable Stepwise Reward | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10293) | - |\n| 2025-08 | `G-RA` | Stabilizing Long-term Multi-turn Reinforcement Learning with Gated Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10548) | - |\n| 2025-08 | `SSPO` | SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.12604) | - |\n| 2025-08 | `AIRL-S` | Your Reward Function for RL is Your Best PRM for Search: Unifying RL and Search-Based TTS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.14313) | - |\n| 2025-08 | `TreePO` | TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17445) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmultimodal-art-projection\u002FTreePO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fmultimodal-art-projection\u002FTreePO) |\n| 2025-08 | `MUA-RL` | MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18669) | - |\n| 2025-07 | `SPRO` | Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.01551) | - |\n| 2025-07 | `FR3E` | First Return, Entropy-Eliciting Explore | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07017) | - |\n| 2025-07 | `ARPO` | Agentic Reinforced Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.19849) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUC-NLPIR\u002FARPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUC-NLPIR\u002FARPO) |\n| 2025-07 | `TP-GRPO` | Good Learners Think Their Thinking: Generative PRM Makes Large Reasoning Model More Efficient Math Learner | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.23317) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcs-holder\u002Ftp_grpo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcs-holder\u002Ftp_grpo) |\n| 2025-06 | `TreeRPO` | TreeRPO: Tree Relative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05183) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyangzhch6\u002FTreeRPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fyangzhch6\u002FTreeRPO) |\n| 2025-06 | `TreeRL` | TreeRL: LLM Reinforcement Learning with On-Policy Tree Search | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11902) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FTreeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FTreeRL) |\n| 2025-06 | `Entropy Advantage` | Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758) | - |\n| 2025-06 | `ReasonFlux-PRM` | ReasonFlux-PRM: Trajectory-Aware PRMs for Long Chain-of-Thought Reasoning in LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.18896) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FReasonFlux?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FReasonFlux) |\n| 2025-05 | `S-GRPO` | S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07686) | - |\n| 2025-05 | `GiGPO` | Group-in-Group Policy Optimization for LLM Agent Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10978) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlangfengQ\u002Fverl-agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FlangfengQ\u002Fverl-agent) |\n| 2025-05 | - | Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Credit Assignment | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11821) | - |\n| 2025-05 | `Tango` | RL Tango: Reinforcing Generator and Verifier Together for Language Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15034) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkaiwenzha\u002Frl-tango?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fkaiwenzha\u002Frl-tango) |\n| 2025-05 | `StepSearch` | StepSearch: Igniting LLMs Search Ability via Step-Wise Proximal Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15107) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZillwang\u002FStepSearch?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FZillwang\u002FStepSearch) |\n| 2025-05 | - | Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15922) | - |\n| 2025-05 | `Tool-Star` | Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16410) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdongguanting\u002FTool-Star?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdongguanting\u002FTool-Star) |\n| 2025-05 | `SPA-RL` | SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20732) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWangHanLinHenry\u002FSPA-RL-Agent?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FWangHanLinHenry\u002FSPA-RL-Agent) |\n| 2025-05 | `SPO` | Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Mode | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23564) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIFrameResearch\u002FSPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FAIFrameResearch\u002FSPO) |\n| 2025-04 | `GenPRM` | GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00891) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRyanLiu112\u002FGenPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRyanLiu112\u002FGenPRM) |\n| 2025-04 | `PURE` | Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15275) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCJReinforce\u002FPURE?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCJReinforce\u002FPURE) |\n| 2025-03 | `MRT` | Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07572) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCMU-AIRe\u002FMRT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FCMU-AIRe\u002FMRT) |\n| 2025-03 | `SWEET-RL` | SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15478) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fsweet_rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsweet_rl) |\n| 2025-02 | `PRIME` | Process Reinforcement through Implicit Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | Free Process Rewards without Process Labels | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.01981) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FImplicitPRM) |\n| 2024-10 | `VinePPO` | VinePPO: Refining Credit Assignment in RL Training of LLMs | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01679) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMcGill-NLP\u002FVinePPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002FVinePPO) |\n| 2024-10 | `PAV` | Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08146) | - |\n| 2024-04 | - | From $r$ to $Q^*$: Your Language Model is Secretly a Q-Function | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12358) | - |\n| 2024-03 | `GELI` | Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.11330) | - |\n| 2023-12 | `Math-Shepherd` | Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08935) | - |\n| 2023-05 | `PRM800K` | Let's Verify Step by Step | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.20050) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopenai\u002Fprm800k?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fprm800k) |\n| 2022-11 | - | Solving math word problems with process- and outcome-based feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.14275) | - |\n\n#### Unsupervised Rewards\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `Vision-Zero` | Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25541) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwangqinsi1\u002FVision-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwangqinsi1\u002FVision-Zero) |\n| 2025-08 | `Co-Reward` | Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.00410) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftmlr-group\u002FCo-Reward?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftmlr-group\u002FCo-Reward) |\n| 2025-08 | `SQLM` | Self-Questioning Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.03682) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flili-chen\u002Fself-questioning-lm?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flili-chen\u002Fself-questioning-lm) |\n| 2025-08 | `R-zero` | R-Zero: Self-Evolving Reasoning LLM from Zero Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.05004) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FChengsong-Huang\u002FR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FChengsong-Huang\u002FR-Zero) |\n| 2025-08 | `ETTRL` | ETTRL: Balancing Exploration and Exploitation in LLM Test-Time Reinforcement Learning Via Entropy Mechanism | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.11356) | - |\n| 2025-07 | `RLSF` | Post-Training Large Language Models via Reinforcement Learning from Self-Feedback | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.21931) | - |\n| 2025-06 | `RLSC` | Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06395) | - |\n| 2025-06 | `RPT` | Reinforcement Pre-Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08007) | - |\n| 2025-06 | `CoVo` | Consistent Paths Lead to Truth: Self-Rewarding Reinforcement Learning for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08745) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsastpg\u002FCoVo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsastpg\u002FCoVo) |\n| 2025-06 | `SEAL` | Self-Adapting Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10943) | - |\n| 2025-06 | `Spurious Rewards` | Spurious Rewards: Rethinking Training Signals in RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10947) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fruixin31\u002FSpurious_Rewards?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fruixin31\u002FSpurious_Rewards) |\n| 2025-06 | `No Free Lunch` | No Free Lunch: Rethinking Internal Feedback for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17219) | - |\n| 2025-05 | `Absolute Zero` | Absolute Zero: Reinforced Self-play Reasoning with Zero Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FLeapLabTHU\u002FAbsolute-Zero-Reasoner) |\n| 2025-05 | `EM-RL` | The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15134) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshivamag125\u002FEM_PT?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fshivamag125\u002FEM_PT) |\n| 2025-05 | `SSR-Zero` | SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16637) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FKelaxon\u002FSSR-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FKelaxon\u002FSSR-Zero) |\n| 2025-05 | - | Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-05 | `RLIF` | Learning to Reason without External Rewards | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19590) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsunblaze-ucb\u002FIntuitor?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsunblaze-ucb\u002FIntuitor) |\n| 2025-05 | `SeRL` | SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.20347) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwantbook-book\u002FSeRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fwantbook-book\u002FSeRL) |\n| 2025-05 | `SRT` | Can Large Reasoning Models Self-Train? | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21444) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftajwarfahim\u002Fsrt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ftajwarfahim\u002Fsrt) |\n| 2025-05 | `RENT-RL` | Maximizing Confidence Alone Improves Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22660) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsatrams\u002Frent-rl?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fsatrams\u002Frent-rl) |\n| 2025-04 | `EMPO` | Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05812) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQingyangZhang\u002FEMPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQingyangZhang\u002FEMPO) |\n| 2025-04 | `TRANS-ZERO` | TRANS-ZERO: Self-Play Incentivizes Large Language Models for Multilingual Translation Without Parallel Data | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14669) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNJUNLP\u002Ftrans0?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FNJUNLP\u002Ftrans0) |\n| 2025-04 | `TTRL` | TTRL: Test-Time Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16084) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FTTRL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FTTRL) |\n| 2025-04 | `One-Shot-RLVR` | Reinforcement Learning for Reasoning in Large Language Models with One Training Example | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fypwang61\u002FOne-Shot-RLVR?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fypwang61\u002FOne-Shot-RLVR) |\n| 2025-02 | `CAGSR` | A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10482) | - |\n| 2024-07 | `MINIMO` | Learning Formal Mathematics From Intrinsic Motivation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00695) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgpoesia\u002Fminimo?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fgpoesia\u002Fminimo) |\n\n#### Rewards Shaping\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `CDE` | CDE: Curiosity-Driven Exploration for Efficient Reinforcement Learning in Large Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09675) | - |\n| 2025-09 | `DARLING` | Jointly Reinforcing Diversity and Quality in Language Model Generations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.02534) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ffacebookresearch\u002Fdarling?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdarling) |\n| 2025-09 | `DRER` | Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06024) | - |\n| 2025-09 | `OBE` | Outcome-based Exploration for LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.06941) | - |\n| 2025-08 | `Pass@kTraining` | Pass@k Training for Adaptively Balancing Exploration and Exploitation of Large Reasoning Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.10751) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FPassk_Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FPassk_Training) |\n| 2025-05 | `PKPO` | Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15201) | - |\n| 2025-05 | `rl-without-gt` | Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19439) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FinsightLLM\u002Frl-without-gt?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FinsightLLM\u002Frl-without-gt) |\n| 2025-03 | `CrossDomain-RLVR` | Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23829) | - |\n| 2025-01 | `DeepSeek-R1` | DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12948) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-R1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) |\n| 2024-09 | `Qwen2.5-Math` | Qwen2. 5-math technical report: Toward mathematical expert model via self-improvement | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12122) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Math?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Math) |\n\n### Policy Optimization\n#### Policy Gradient Objective\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2017-07 | `PPO` | Proximal policy optimization algorithms | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347) | - |\n| - | `PG` | Policy gradient methods for reinforcement learning with function approximation. | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F1999\u002Ffile\u002F464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf) | - |\n| - | `REINFORCE` | Simple statistical gradient-following algorithms for connectionist reinforcement learning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1007\u002FBF00992696) | - |\n| - | `TRPO` | Trust region policy optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.mlr.press\u002Fv37\u002Fschulman15.pdf) | - |\n\n#### Critic-based Algorithms\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-08 | `VL-DAC` | Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.04280) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcorl-team\u002FVL-DAC?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002FVL-DAC) |\n| 2025-08 | `VRPO` | VRPO:Rethinking Value Modeling for Robust RL Training under Noisy Supervision | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03058) | - |\n| 2025-05 | `VerIPO` | VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19000) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FVerIPO?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FVerIPO) |\n| 2025-04 | `VAPO` | Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.05118?) | - |\n| 2025-03 | `VCPPO` | What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01491) | - |\n| 2025-03 | `Open reasoner-zero` | open reasoner-zero: An open source approach to scaling up reinforcement learning on the base model | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.24290) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero) |\n| 2025-02 | `PRIME` | PROCESS REINFORCEMENT THROUGH IMPLICIT REWARDS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.01456) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPRIME-RL\u002FPRIME?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FPRIME-RL\u002FPRIME) |\n| 2024-12 | `Implicit PRM` | FREE PROCESS REWARDS WITHOUT PROCESS LABELS | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.01981) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flifan-yuan\u002FImplicitPRM?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Flifan-yuan\u002FImplicitPRM) |\n| 2023-12 | `Math-shepherd` | Math-shepherd: Verify and reinforce LLMs step-by-step without human annotations | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08935) | - |\n| 2015-06 | `GAE` | High-dimensional continuous control using generalized advantage estimation | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1506.02438) | - |\n| - | `Autopsv` | Autopsv: Automated process-supervised verifier. | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-6772E5?style=for-the-badge)](https:\u002F\u002Fproceedings.neurips.cc\u002Fpaper_files\u002Fpaper\u002F2024\u002Ffile\u002F9246aa822579d9b29a140ecdac36ad60-Paper-Conference.pdf) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frookie-joe\u002FAutoPSV?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Frookie-joe\u002FAutoPSV) |\n\n#### Critic-Free Algorithms\n\n| Date | Name | Title | Paper | Github |\n|:-:|:-:|:-|:-:|:-:|\n| 2025-09 | `UPGE` | Towards a Unified View o fLarge Language Model Post-Training | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.04419) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTsinghuaC3I\u002FUnify-Post-Training?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FUnify-Post-Training) |\n| 2025-09 | `SPO` | Single-stream Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.13232) | - |\n| 2025-08 | `LitePPO` | Part I: Tricks or Traps? A Deep Dive into RLfor LLM Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.08221v1) | - |\n| 2025-07 | `R1-RE` | R1-RE: Cross-Domain Relation Extraction with RLVR | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.04642) | - |\n| 2025-07 | `GSPO` | Group Sequence Policy Optimization | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.18071) | - |\n| 2025-06 | `CISPO` | MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13585) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMiniMax-AI\u002FMiniMax-M1?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FMiniMax-AI\u002FMiniMax-M1) |\n| 2025-05 | `KRPO` | Kalman Filter Enhanced Group Relative Policy Optimization for Language Model Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07527) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbillhhh\u002FKRPO_LLMs_RL?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002Fbillhhh\u002FKRPO_LLMs_RL) |\n| 2025-05 | `CPGD` | CPGD:Toward Stable Rule-based Reinforcement Learning for Language Models | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.12504) | [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FModalMinds\u002FMM-EUREKA?style=for-the-badge&logo=github&label=GitHub&color=black)](https:\u002F\u002Fgithub.com\u002FModalMinds\u002FMM-EUREKA) |\n| 2025-05 | `NFT` | Bridging Supervised Learning and Reinforcement Learning in Math Reasoning | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.18116) | - |\n| 2025-05 | `Clip-Cov\u002FKL-Cov` | The Entropy Mechanism of Reinforcement Learning for Reasonin","该项目是对大型推理模型中的强化学习技术进行综述。它提供了关于如何将强化学习应用于大规模语言模型以增强其推理能力的全面分析，包括最新的研究成果和应用案例。采用TeX编写，确保了文档的专业性和可读性。适合研究人员、开发者以及对AI领域内强化学习与大模型结合感兴趣的人士参考使用，尤其适用于希望深入了解或探索相关技术前沿进展的场景。",2,"2026-06-11 03:47:18","high_star"]