[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72087":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},72087,"rllm","rllm-org\u002Frllm","rllm-org","Democratizing Reinforcement Learning for LLMs","https:\u002F\u002Fdocs.rllm-project.com",null,"Python",5606,575,30,77,0,12,25,116,36,39.28,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38,39],"agent-framework","agentic-workflow","coding-agent","distributed-training","llm-reasoning","llm-training","machine-learning","ml-infrastructure","ml-platform","reinforcement-learning","search-agent","swe-agent","tinker","verl","2026-06-12 02:02:58","\u003Cdiv align=\"center\">\n\n# rLLM\n\n**Train your AI agents with RL. Any framework. Minimal code changes.**\n\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-blue?style=for-the-badge&logo=googledocs&logoColor=white)](https:\u002F\u002Fdocs.rllm-project.com\u002F)\n[![Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSlack-4A154B?style=for-the-badge&logo=slack&logoColor=white)](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Frllmproject\u002Fshared_invite\u002Fzt-3pyblo6ef-m9kqAoInI8xSyUBkpuOyXA)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSite-%233f72af.svg?style=for-the-badge&logo=semanticweb&logoColor=white)](https:\u002F\u002Frllm-project.com)\n[![Blogs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlogs-007AFF?style=for-the-badge)](https:\u002F\u002Frllm-project.com\u002Fblog)\n[![X](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-black?logo=X&style=for-the-badge)](https:\u002F\u002Fx.com\u002Frllm_project)\n\n\u003C!-- [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Frllm?style=for-the-badge)](https:\u002F\u002Fpypi.org\u002Fproject\u002Frllm\u002F) -->\n\n\u003C\u002Fdiv>\n\nrLLM is an open-source framework for training AI agents with reinforcement learning. Swap in a tracked client, define a reward function, and let RL handle the rest — no matter what agent framework you use.\n\n## Core Features\n\n- **Works with any agent framework** — LangGraph, SmolAgent, Strands, OpenAI Agents SDK, Google ADK, or plain `openai.OpenAI`. Just swap the client. 🔌\n- **Near-zero code changes** — Add `@rllm.rollout` to wrap your agent code, and rLLM traces every LLM call automatically. 🪄\n- **CLI-first workflow** — Eval and train from the command line with 50+ built-in benchmarks. `rllm eval gsm8k` just works. ⚡\n- **Battle-tested results** — rLLM-trained agents beat models 50x their size (4B → outperforms 235B on finance, 1.5B → surpasses O1-Preview on math). 📈\n- **Multiple RL algorithms** — GRPO, REINFORCE, RLOO, rejection sampling, and more. 🧠\n- **Two training backends** — `verl` for distributed multi-GPU training, `tinker` for single-machine \u002F CPU setups. Same API either way. 🔧\n\nRead more on our [documentation site](https:\u002F\u002Fdocs.rllm-project.com\u002F).\n\n## Installation\n\nrLLM requires `Python >= 3.11`. You can install it either directly via pip or build from source.\n\n```bash\nuv pip install \"rllm @ git+https:\u002F\u002Fgithub.com\u002Frllm-org\u002Frllm.git\"\n```\n\nthis installs dependencies for running rllm cli, which uses Tinker as the training backend. \n\nTo use `verl` as the training backend (GPU machine required), install via \n\n```bash\n# For distributed GPU training (verl + vLLM\u002FSGLang)\nuv pip install rllm[verl] @ git+https:\u002F\u002Fgithub.com\u002Frllm-org\u002Frllm.git\n```\n\nFor building from source or Docker, see the [installation guide](https:\u002F\u002Fdocs.rllm-project.com\u002Finstallation).\n\n## Quickstart\n\n### Option A: CLI (no code needed)\n\n```bash\n# 1. Configure your model provider\nrllm model setup\n\n# 2. Evaluate on a benchmark\nrllm eval gsm8k\n\n# 3. Train with RL\nrllm train gsm8k\n```\n\n### Option B: Python API\n\nDefine a rollout (your agent) and an evaluator (your reward function), then hand them to the trainer:\n\n```python\n# my_flow.py\nfrom openai import OpenAI\nimport rllm\nfrom rllm.types import AgentConfig, Episode, Task, Trajectory\n\n@rllm.rollout\ndef solve(task: Task, config: AgentConfig) -> Episode:\n    client = OpenAI(base_url=config.base_url, api_key=\"EMPTY\")\n    response = client.chat.completions.create(\n        model=config.model,\n        messages=[{\"role\": \"user\", \"content\": task.instruction}],\n    )\n    answer = response.choices[0].message.content or \"\"\n    return Episode(\n        trajectories=[Trajectory(name=\"solver\", steps=[])],\n        artifacts={\"answer\": answer},\n    )\n```\n\n```python\n# my_evaluator.py\nimport rllm\nfrom rllm.eval.types import EvalOutput, Signal\nfrom rllm.types import Episode\n\n@rllm.evaluator\ndef score(task: dict, episode: Episode) -> EvalOutput:\n    answer = str(episode.artifacts.get(\"answer\", \"\"))\n    is_correct = answer.strip() == task[\"ground_truth\"].strip()\n    reward = 1.0 if is_correct else 0.0\n    return EvalOutput(reward=reward, is_correct=is_correct,\n                      signals=[Signal(name=\"accuracy\", value=reward)])\n```\n\n```python\n# train.py\nfrom rllm.experimental.unified_trainer import AgentTrainer\n\ntrainer = AgentTrainer(\n    backend=\"tinker\",\n    agent_flow=solve,\n    evaluator=score,\n    config=config,\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\nDuring training, `config.base_url` points to a gateway that transparently captures token IDs and logprobs — your agent code stays the same for eval and training.\n\nSee the [cookbooks](.\u002Fcookbooks) for complete working examples (single-turn VLM solver, multi-agent solver-judge, and more).\n\n## Architecture\n\nrLLM follows a simple pipeline: **run your agent → collect traces → compute rewards → update the model**.\n\n```\n┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐\n│  Your Agent  │───▶│    Traces     │───▶│   Rewards    │───▶│  RL Update   │\n│  (any code)  │    │  (auto-logged)│    │ (your logic) │    │  (GRPO etc.) │\n└──────────────┘    └──────────────┘    └──────────────┘    └──────────────┘\n```\n\nYour agent runs as-is — rLLM's model gateway captures LLM calls (token IDs + logprobs) by URL-routed sessions and structures them into **Episodes** (one task) containing **Trajectories** (one agent run) made of **Steps** (one LLM call). A reward function scores the result, and the RL algorithm updates the model weights. The same agent code works for both eval and training.\n\nUnder the hood:\n- **Workflow Engine** runs N parallel agent instances to collect rollouts\n- **Model Gateway** routes requests and captures token IDs + logprobs\n- **Transform Pipeline** groups trajectories for advantage computation\n- **Training Backend** (verl or tinker) handles the policy update\n\n## Community Projects\n\n- [Tongyi DeepResearch](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FDeepResearch) — Open-source AI researchers by Alibaba NLP [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlibaba-NLP\u002FDeepResearch)](https:\u002F\u002Fgithub.com\u002FAlibaba-NLP\u002FDeepResearch)\n- [Terminal-Bench-RL](https:\u002F\u002Fgithub.com\u002FDanau5tin\u002Fterminal-bench-rl) — Training long-horizon terminal agents with RL [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDanau5tin\u002Fterminal-bench-rl)](https:\u002F\u002Fgithub.com\u002FDanau5tin\u002Fterminal-bench-rl)\n- [PettingLLMs](https:\u002F\u002Fgithub.com\u002Fpettingllms-ai\u002FPettingLLMs) — Multi-agent RL with on-policy training [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpettingllms-ai\u002FPettingLLMs)](https:\u002F\u002Fgithub.com\u002Fpettingllms-ai\u002FPettingLLMs)\n- [SETA](https:\u002F\u002Fgithub.com\u002Fcamel-ai\u002Fseta) — Scaling environments for terminal agents [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcamel-ai\u002Fseta)](https:\u002F\u002Fgithub.com\u002Fcamel-ai\u002Fseta)\n- [LLM-in-Sandbox](https:\u002F\u002Fgithub.com\u002Fllm-in-sandbox\u002Fllm-in-sandbox) — Building general agents by running LLMs in a sandbox [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fllm-in-sandbox\u002Fllm-in-sandbox)](https:\u002F\u002Fgithub.com\u002Fllm-in-sandbox\u002Fllm-in-sandbox)\n- [Vision-DeepResearch](https:\u002F\u002Fgithub.com\u002FOsilly\u002FVision-DeepResearch) — The first long-horizon multimodal deep-research MLLM [![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOsilly\u002FVision-DeepResearch)](https:\u002F\u002Fgithub.com\u002FOsilly\u002FVision-DeepResearch)\n- [Cogito, Ergo Ludo](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2509.25052) — An agent that learns to play by reasoning and planning\n- [Cut the Bill, Keep the Turns](https:\u002F\u002Fagate-slipper-ef0.notion.site\u002FCut-the-Bill-Keep-the-Turns-Affordable-Multi-Turn-Search-RL-003f78214a4d451fb06f453d084e666c) — Affordable multi-turn search RL\n- [Experiential Reinforcement Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.13949) — Experience-reflection-consolidation loop for RL with sparse rewards\n- [V1: Unifying Generation and Self-Verification](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.04304) — Pairwise self-verification for parallel test-time scaling\n- [TherapyGym](https:\u002F\u002Ftherapygym.stanford.edu\u002F) - Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots\n- [SandMLE](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.04872) - Synthetic Sandbox for Training MLE Agents\n## Articles & Blog Posts\n\n- [rLLM UI: Real-Time Observability Tool for Agent Training & Evaluation](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=rllm_ui.md) — Mar 2026\n- [rLLM On-Policy Distillation: Training Smaller Students from Stronger Teachers](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=opd.md) — Mar 2026\n- [Faster and Better: Open-Source Recipe for Deep Research Agents with Fully Async Training](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=async_rl.md) — Feb 2026\n- [rLLM-FinQA: How a 4B Model Outperforms 235B and Rivals Gemini 2.5 Pro on Financial Analysis](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=finqa.md) — Feb 2026\n- [rLLM SDK: Training Any Agentic Program without Code Changes](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=sdk.md) — Dec 2025\n- [rLLM v0.2: RL Training for General Agentic Programs](https:\u002F\u002Frllm-project.com\u002Fpost.html?post=rllm_v0.2.md) — Oct 2025\n- [DeepSWE: Open-source SWE Agent via RL](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepSWE-Training-a-Fully-Open-sourced-State-of-the-Art-Coding-Agent-by-Scaling-RL-22281902c1468193aabbe9a8c59bbe33) — Jul 2025\n- [DeepCoder: 14B Coder at O3-mini Level](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepCoder-A-Fully-Open-Source-14B-Coder-at-O3-mini-Level-1cf81902c14680b3bee5eb349a512a51) — Apr 2025\n- [DeepScaleR: 1.5B Surpasses O1-Preview](https:\u002F\u002Fpretty-radio-b75.notion.site\u002FDeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2) — Feb 2025\n\n## Acknowledgements\n\nOur work is done as part of [Berkeley Sky Computing Lab](https:\u002F\u002Fsky.cs.berkeley.edu\u002F). The rLLM team is generously supported by grants from [Laude Institute](https:\u002F\u002Fwww.laude.org\u002F), [AWS](https:\u002F\u002Faws.amazon.com\u002F), [Hyperbolic](https:\u002F\u002Fwww.hyperbolic.ai\u002F), [Fireworks AI](https:\u002F\u002Ffireworks.ai\u002F), and [Modal](https:\u002F\u002Fmodal.com\u002F). We pay special thanks to [Together AI](https:\u002F\u002Fwww.together.ai\u002F) for the research partnership and compute support.\n\n## Citation\n\n```bibtex\n@misc{rllm2025,\n  title={rLLM: A Framework for Post-Training Language Agents},\n  author={Sijun Tan and Michael Luo and Colin Cai and Tarun Venkat and Kyle Montgomery and Aaron Hao and Tianhao Wu and Arnav Balyan and Manan Roongta and Chenguang Wang and Li Erran Li and Raluca Ada Popa and Ion Stoica},\n  year={2025},\n  howpublished={\\url{https:\u002F\u002Fpretty-radio-b75.notion.site\u002FrLLM-A-Framework-for-Post-Training-Language-Agents-21b81902c146819db63cd98a54ba5f31}},\n  note={Notion Blog},\n}\n```\n\nYou may also cite our prior work [DeepScaleR](https:\u002F\u002Fscholar.googleusercontent.com\u002Fscholar.bib?q=info:PrmBADk39GwJ:scholar.google.com\u002F&output=citation&scisdr=CgIJFx-xEMCQ6zOgcuI:AAZF9b8AAAAAaPCmauIfzg8Rm9ImNYDad0uPUK8&scisig=AAZF9b8AAAAAaPCmahXsNqb1jTQBw2iPfw2vm9g&scisf=4&ct=citation&cd=-1&hl=en&scfhb=1), [DeepCoder](https:\u002F\u002Fscholar.googleusercontent.com\u002Fscholar.bib?q=info:xpZNEPI6opAJ:scholar.google.com\u002F&output=citation&scisdr=CgIJFx-xEMCQ6zOgjM8:AAZF9b8AAAAAaPCmlM_hb3S0tzBSVrRYBZYDLWg&scisig=AAZF9b8AAAAAaPCmlG109SG8d8230AiDP4jMxlw&scisf=4&ct=citation&cd=-1&hl=en&scfhb=1), and [DeepSWE](https:\u002F\u002Fscholar.googleusercontent.com\u002Fscholar.bib?q=info:J9rT3SnY_aMJ:scholar.google.com\u002F&output=citation&scisdr=CgIJFx-xEMCQ6zOg3D4:AAZF9b8AAAAAaPCmxD7Nl0xA_AcAeydpcE1BXCo&scisig=AAZF9b8AAAAAaPCmxE2Spzf5lf-2Toys5xEpnuA&scisf=4&ct=citation&cd=-1&hl=en&scfhb=1).\n","rLLM 是一个用于训练AI代理的强化学习开源框架。它支持任何代理框架，用户只需定义奖励函数，即可利用强化学习进行训练，几乎无需修改原有代码。项目提供了多种强化学习算法选项及两种训练后端（分布式多GPU训练和单机\u002FCPU设置），并拥有超过50个内置基准测试。rLLM特别适用于需要快速迭代、评估不同规模模型性能的研究场景以及希望简化其机器学习基础设施的企业应用中。",2,"2026-06-11 03:40:18","high_star"]