[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72423":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72423,"RAGEN","mll-lab-nu\u002FRAGEN","mll-lab-nu","RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.","https:\u002F\u002Fragen-ai.github.io",null,"Python",2695,226,23,28,0,6,18,41,84.17,"MIT License",false,"main",true,[],"2026-06-12 04:01:05","\u003Ch1 align=\"center\">RAGEN: Training Agents by Reinforcing Reasoning\u003C\u002Fh1>\n\u003Ch3 align=\"center\">\u003Cem>Diagnose agent failure modes. Make your RL training better.\u003C\u002Fem>\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\u003Cimg src=\"public\u002Fragen_logo.jpeg\" width=\"300px\" alt=\"RAGEN icon\" \u002F>\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>RAGEN\u003C\u002Fstrong> (\u003Cb>R\u003C\u002Fb>easoning \u003Cb>AGEN\u003C\u002Fb>T) is a flexible RL framework for training reasoning agents.\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  We develop \u003Cstrong>diagnostics to understand \u003Ci>how\u003C\u002Fi> agent RL training works \u003C\u002Fstrong>, and how to fix hidden issues.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.06268\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄_V2_Paper-DC143C?style=for-the-badge&logoColor=white\" alt=\"V2 Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20073\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄_v1_Paper-FF8C00?style=for-the-badge&logoColor=white\" alt=\"v1 Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fragen-ai.github.io\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📝_HomePage-FF5722?style=for-the-badge&logoColor=white\" alt=\"Blog\">\u003C\u002Fa>\n  \u003C!-- \u003Ca href=\"https:\u002F\u002Fragen-doc.readthedocs.io\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📚_Documentation-4285F4?style=for-the-badge&logoColor=white\" alt=\"Documentation\">\u003C\u002Fa> -->\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002Fwzihanw\u002Fstatus\u002F1915052871474712858\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🔍_Post-34A853?style=for-the-badge&logoColor=white\" alt=\"Post\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fzihanwang-ai-northwestern-university\u002Fa8er8l7b\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧪_Experiment_Log-AB47BC?style=for-the-badge&logoColor=white\" alt=\"Experiment Log\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n> **Looking for the V1 README?** Please take a look [here](docs\u002Freadme_v1.md).\n\n## News\n\n- **2026.3.12.** We are excited to release \u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont>! We introduce a systematic study of reasoning collapse in agent RL and lightweight interventions for stable training. See the [\u003Cfont color=\"#DC143C\">v2 paper\u003C\u002Ffont>](https:\u002F\u002Fragen-ai.github.io\u002Fv2).\n- **2025.4.20.** RAGEN V1 [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20073) published on arXiv.\n- **2025.1.27.** Initial RAGEN release. [Post](https:\u002F\u002Fx.com\u002Fwzihanw\u002Fstatus\u002F1884092805598826609).\n\n\n## About\n\nRAGEN is built around **StarPO** (**S**tate-**T**hinking-**A**ctions-**R**eward **P**olicy **O**ptimization), a unified RL framework for training multi-turn, trajectory-level agents with flexible control over reasoning processes, reward assignment mechanisms, and prompt-rollout structures.\n\n**RAGEN is flexible with:**\n\n- **StarPO framework.** Unified optimization for multi-turn agents, supporting both trajectory-level and turn-wise training.\n- **10 built-in environments.** Sokoban, FrozenLake, WebShop, DeepCoder, SearchQA, Lean, Bandit, Countdown, MetaMathQA, Sudoku.\n- **Gym-compatible interface.** Easy to add custom environments.\n\n**\u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont> additionally introduces:**\n\n- **SNR-Adaptive Filtering (\u003Cfont color=\"#DC143C\">V2\u003C\u002Ffont>).** Lightweight rollout filtering based on reward variance to mitigate noisy gradient updates.\n- **Reasoning collapse diagnostics (\u003Cfont color=\"#DC143C\">V2\u003C\u002Ffont>).** Mutual information proxy metrics to detect and monitor template collapse during training.\n\n\n## Algorithm\n\n### StarPO: Reinforcing Reasoning via Trajectory-Level Optimization\n\n\u003Cp align=\"center\">\u003Cimg src=\"public\u002Fstarpo_logo.png\" width=\"800px\" alt=\"StarPO Framework\" \u002F>\u003C\u002Fp>\n\u003Cp align=\"center\" style=\"font-size: 16px; max-width: 800px; margin: 0 auto;\">\nThe StarPO (State-Thinking-Action-Reward Policy Optimization) framework with two interleaved stages: \u003Cb>rollout stage\u003C\u002Fb> and \u003Cb>update stage\u003C\u002Fb>. The LLM generates reasoning-guided actions to interact with the environment, collecting trajectory-level rewards to jointly optimize reasoning and action strategies.\n\u003C\u002Fp>\n\n**MDP Formulation.** Agent-environment interactions are formulated as Markov Decision Processes (MDPs) where states and actions are token sequences, allowing LLMs to reason over environment dynamics. The objective is to maximize expected cumulative rewards across multiple interaction turns.\n\n**Rollout Stage.** Given an initial state, the LLM generates multiple trajectories. At each step, the model produces a reasoning-guided action: `\u003Cthink>...\u003C\u002Fthink>\u003Cans> action \u003C\u002Fans>`. The environment returns feedback (reward and next state).\n\n**Update Stage.** StarPO optimizes entire trajectories using importance sampling. It supports:\n- **PPO.** Token-level advantage estimation via a value function over trajectories.\n- **GRPO.** Normalized reward assigned to the full trajectory.\n\n### \u003Cfont color=\"#DC143C\">V2\u003C\u002Ffont>: Diagnosing Template Collapse\n\nEntropy alone cannot detect *template collapse*, where reasoning appears diverse within a single input but becomes input-agnostic across inputs. \u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont> decomposes reasoning quality into two axes:\n- **Within-input diversity:** Conditional Entropy H(Z|X)\n- **Cross-input distinguishability:** Mutual Information I(X;Z)\n\nSNR-Adaptive Filtering uses reward variance as a lightweight proxy to select high-signal prompts each iteration, directly addressing the root cause of template collapse.\n\n\n## Update Log\n\n**2026.3.12.** \u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont> is released! Check out our [\u003Cfont color=\"#DC143C\">v2 paper\u003C\u002Ffont>](https:\u002F\u002Fragen-ai.github.io\u002Fv2).\n\n\u003Cdetails>\n\u003Csummary>Older updates\u003C\u002Fsummary>\n\n**2025.5.8.** Official [Documentation](https:\u002F\u002Fragen-doc.readthedocs.io\u002F) released. NOTE: this document is now outdated.\n\n**2025.5.2.** A [tracking document](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1bg7obeiKTExuHHBl5uOiSpec5uLDZ2Tgvxy6li5pHX4\u002Fedit?usp=sharing) for logging minor codebase updates is released.\n\n**2025.4.20.** RAGEN V1 [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20073) published. Codebase restructured: veRL integrated as a submodule; architecture decomposed into three modules — Environment State Manager, Context Manager, and Agent Proxy.\n\n**2025.3.13.** RAGEN codebase refactoring underway. See the [developing branch](https:\u002F\u002Fgithub.com\u002FZihanWang314\u002FRAGEN\u002Ftree\u002Fmain-new).\n\n**2025.3.8.** KL term issue in veRL [fixed](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl\u002Fpull\u002F179\u002Ffiles). Default advantage estimator changed to GAE (PPO) for more stable training.\n\n**2025.1.27.** Initial RAGEN release. [Post](https:\u002F\u002Fx.com\u002Fwzihanw\u002Fstatus\u002F1884092805598826609).\n\n\u003C\u002Fdetails>\n\n\n## Getting Started\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmll-lab-nu\u002FRAGEN.git\ncd RAGEN\nconda create -n ragen python=3.12 -y && conda activate ragen\nbash scripts\u002Fsetup_ragen.sh\n```\n\nUse `bash scripts\u002Fsetup_ragen.sh --with-search` to include the search environment. For WebShop, see [docs\u002Fexperiment_webshop_release.md](docs\u002Fexperiment_webshop_release.md).\n\n### The Four Reasoning Regimes\n\n\u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont> diagnoses agent behavior along two axes — **within-input diversity** (Conditional Entropy) and **cross-input distinguishability** (Mutual Information) — yielding four distinct reasoning regimes:\n\n\u003Cp align=\"center\">\u003Cimg src=\"public\u002Fteaser.png\" width=\"800px\" alt=\"Four reasoning regimes: diverse reasoning, template collapse, compressed reasoning, low-entropy collapse\" \u002F>\u003C\u002Fp>\n\u003Cp align=\"center\" style=\"font-size: 15px; max-width: 800px; margin: 0 auto;\">\n\u003Cb>Left:\u003C\u002Fb> Input-driven reasoning adapts to the current state; templated reasoning produces nearly identical responses across different inputs. \u003Cb>Right:\u003C\u002Fb> Four reasoning regimes along two axes — conditional entropy H(Z|X) (within-input diversity) and mutual information I(X;Z) (input dependence). Template collapse (high entropy, low MI) is invisible to existing entropy-based metrics.\n\u003C\u002Fp>\n\n**Train (no filter, default):**\n```bash\npython train.py --config-name _2_sokoban\n```\n\n**Train with SNR-Adaptive Filtering (\u003Cfont color=\"#DC143C\">V2\u003C\u002Ffont>, Top-p):**\n```bash\npython train.py --config-name _2_sokoban \\\n  actor_rollout_ref.rollout_filter_strategy=top_p \\\n  actor_rollout_ref.rollout.rollout_filter_value=0.9\n```\n\n**Evaluate:**\n```bash\npython -m ragen.llm_agent.agent_proxy --config-name _2_sokoban\n```\n\nSNR-Adaptive Filtering consistently improves training across algorithms, model scales, and modalities (green = gain from filtering):\n\n\u003Cp align=\"center\">\u003Cimg src=\"public\u002Fmain_results.png\" width=\"800px\" alt=\"Main results: filtering vs no filtering\" \u002F>\u003C\u002Fp>\n\nSee the [Rollout Filtering Guide](docs\u002Fguide_rollout_filtering.md) for more filtering strategies (Top-k, linear mode, etc.).\n\n\n## Future Plans\n\nWe are actively developing the next generation of RAGEN infrastructure and diagnostics, targeting a release in **late March 2026**.\n\n**Infrastructure**\n- [ ] **Async rollout engine** \n- [ ] **HTTP-based environment interface** \n- [ ] **Layered Env Wrapper** \n- [ ] **Optional environment dependencies** \n\n**Diagnostics & Training Quality**\n- [ ] **Expanded benchmark suite** to stress-test diagnostics across diverse, real-world agent tasks\n- [ ] **Extended MI diagnostic dashboard**, including richer WandB visualizations for entropy, MI proxy, and gradient decomposition over training\n- [ ] **RL training metrics guide**, including a practitioner's blog on how to read training signals (reward distribution, entropy, MI, gradient norms) and act on them before committing to a full run\n\n**Framework**\n- [ ] Update full documentation for \u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont>\n- [ ] Multi-modal agent support (building upon [VAGEN](https:\u002F\u002Fgithub.com\u002FRAGEN-AI\u002FVAGEN))\n- [ ] Public leaderboard for benchmark results\n\n\n## Documentation\n\n- [Full Documentation](https:\u002F\u002Fragen-doc.readthedocs.io\u002F) *(We will release an updated version soon.)*\n- [Evaluation Guide](docs\u002Feval.md) — How to evaluate models and configure output formats\n- [Rollout Filtering Guide](docs\u002Fguide_rollout_filtering.md)\n- [MI Metrics Reference](docs\u002Freference_mutual_information_metrics.md)\n- Adding Custom Environments — Gym-compatible interface, see `config\u002Fenvs.yaml`\n- Experiment reproduction: [Main Table](docs\u002Fexperiment_main_table.md) | [Intervention Sweep](docs\u002Fexperiment_intervention_sweep.md) | [FrozenLake](docs\u002Fexperiment_frozen_lake_slipper_sweep.md) | [Sokoban Gradient](docs\u002Fexperiment_sokoban_gradient_analysis.md) | [Search](docs\u002Fexperiment_search.md) | [DeepCoder](docs\u002Fexperiment_deepcoder.md) | [WebShop](docs\u002Fexperiment_webshop_release.md)\n\n\n## Awesome Work Powered or Inspired by RAGEN\n\n- [ROLL](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL): Efficient Scaling Library for RL with LLMs ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002FROLL?style=social)\n- [VAGEN](https:\u002F\u002Fgithub.com\u002FRAGEN-AI\u002FVAGEN): Training Visual Agents with multi-turn RL ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRAGEN-AI\u002FVAGEN?style=social)\n- [Search-R1](https:\u002F\u002Fgithub.com\u002FPeterGriffinJin\u002FSearch-R1): Train LLMs to reason and call a search engine with RL ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPeterGriffinJin\u002FSearch-R1?style=social)\n- [ZeroSearch](https:\u002F\u002Fgithub.com\u002FAlibaba-nlp\u002FZeroSearch): Incentivize LLM search capability without searching ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-nlp\u002FZeroSearch?style=social)\n- [Agent-R1](https:\u002F\u002Fgithub.com\u002FAgentR1\u002FAgent-R1): Training Powerful LLM Agents with End-to-End RL\n- [OpenManus-RL](https:\u002F\u002Fgithub.com\u002FOpenManus\u002FOpenManus-RL): RL tuning for LLM agents ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenManus\u002FOpenManus-RL?style=social)\n- [MetaSpatial](https:\u002F\u002Fgithub.com\u002FPzySeere\u002FMetaSpatial): Reinforcing 3D Spatial Reasoning in VLMs ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPzySeere\u002FMetaSpatial?style=social)\n- [s3](https:\u002F\u002Fgithub.com\u002Fpat-jj\u002Fs3): Efficient Yet Effective Search Agent Training via RL\n\n\n## Contributors\n\n[**Zihan Wang**\\*](https:\u002F\u002Fzihanwang314.github.io\u002F), [**Kangrui Wang**\\*](https:\u002F\u002Fjameskrw.github.io\u002F), [**Qineng Wang**\\*](https:\u002F\u002Fqinengwang-aiden.github.io\u002F), [**Pingyue Zhang**\\*](https:\u002F\u002Fwilliamzhangsjtu.github.io\u002F), [**Linjie Li**\\*](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=WR875gYAAAAJ&hl=en), [**Zhengyuan Yang**](https:\u002F\u002Fzyang-ur.github.io\u002F), [**Xing Jin**](https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Xing_Jin3), [**Kefan Yu**](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkefan-yu-22723a25b\u002Fen\u002F), [**Minh Nhat Nguyen**](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fmenhguin\u002F?originalSubdomain=sg), [**Licheng Liu**](https:\u002F\u002Fx.com\u002Fliulicheng10), [**Eli Gottlieb**](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Feli-gottlieb1\u002F), [**Yiping Lu**](https:\u002F\u002F2prime.github.io), [**Kyunghyun Cho**](https:\u002F\u002Fkyunghyuncho.me\u002F), [**Jiajun Wu**](https:\u002F\u002Fjiajunwu.com\u002F), [**Li Fei-Fei**](https:\u002F\u002Fprofiles.stanford.edu\u002Ffei-fei-li), [**Lijuan Wang**](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpeople\u002Flijuanw\u002F), [**Yejin Choi**](https:\u002F\u002Fhomes.cs.washington.edu\u002F~yejin\u002F), [**Manling Li**](https:\u002F\u002Flimanling.github.io\u002F)\n\n\\*Equal Contribution.\n\n\n## Acknowledgements\n\nWe thank the [DeepSeek](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1) team for early conceptual inspirations. We are grateful to the [veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) team for infrastructure support. We thank the [TinyZero](https:\u002F\u002Fgithub.com\u002FJiayi-Pan\u002FTinyZero) team for discoveries that informed our initial exploration. We appreciate insightful discussions with Han Liu, Xinyu Xing, Li Erran Li, John Schulman, Akari Asai, Eiso Kant, Lu Lu, Runxin Xu, Huajian Xin, Zijun Liu, Weiyi Liu, Weimin Wu, Yibo Wen, Jiarui Liu, Lorenzo Xiao, Ishan Mukherjee, Anabella Isaro, Haosen Sun, How-Yeh Wan, Lester Xue, Matthew Khoriaty, Haoxiang Sun, Jiajun Liu.\n\nFor \u003Cfont color=\"#DC143C\">RAGEN-2\u003C\u002Ffont>, we additionally thank Yuxiang Lin and Kyunghyun Cho for their support.\n\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=mll-lab-nu\u002Fragen&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#mll-lab-nu\u002Fragen&Date)\n\n\n## Citation\n\n```bibtex\n@misc{ragen2,\n      title={RAGEN-2: Reasoning Collapse in Agentic RL}, \n      author={Zihan Wang and Chi Gui and Xing Jin and Qineng Wang and Licheng Liu and Kangrui Wang and Shiqi Chen and Linjie Li and Zhengyuan Yang and Pingyue Zhang and Yiping Lu and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},\n      year={2026},\n      eprint={2604.06268},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.06268}, \n}\n```\n\n```bibtex\n@misc{ragen,\n      title={RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning},\n      author={Zihan Wang and Kangrui Wang and Qineng Wang and Pingyue Zhang and Linjie Li and Zhengyuan Yang and Xing Jin and Kefan Yu and Minh Nhat Nguyen and Licheng Liu and Eli Gottlieb and Yiping Lu and Kyunghyun Cho and Jiajun Wu and Li Fei-Fei and Lijuan Wang and Yejin Choi and Manling Li},\n      year={2025},\n      eprint={2504.20073},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20073},\n}\n```\n","RAGEN 是一个利用强化学习在交互式随机环境中训练推理代理的项目。其核心功能包括通过StarPO框架统一优化多轮次代理，支持轨迹级和回合级训练，并内置了10个环境如Sokoban、FrozenLake等，同时提供与Gym兼容的接口方便自定义环境添加。RAGEN特别强调对代理RL训练过程中的问题进行诊断及修正，有助于识别并解决隐藏的问题模式。此外，最新版本RAGEN-2引入了基于奖励方差的轻量级序列过滤技术SNR-Adaptive Filtering，进一步提高了训练稳定性。该项目适用于需要开发复杂决策逻辑或希望深入理解强化学习训练机制的研究者与开发者。",2,"2026-06-11 03:41:58","high_star"]