[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74032":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},74032,"OpenClaw-RL","Gen-Verse\u002FOpenClaw-RL","Gen-Verse","OpenClaw-RL: Train any agent simply by talking","https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10165",null,"Python",5467,590,41,50,0,18,52,185,54,114.31,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39],"async","coding","grpo","gui-application","memory-systems","on-policy-distillation","open-claw","openclaw-skills","rlhf","sglang","skill-learning","slime","tinker","2026-06-12 04:01:12","\u003Cdiv align=\"center\">\n  \u003Ch1 align=\"center\">\n    \u003Cimg src=\"assets\u002Fspacer.png\" alt=\"\" width=\"23\" height=\"40\" align=\"absmiddle\" \u002F>\n    OpenClaw-RL\u003C!--\n-->\u003Csup>\n    \u003Cimg src=\"assets\u002Fclawistool.png\" alt=\"Claw-RL logo\" width=\"23\" height=\"40\" align=\"absmiddle\" \u002F>\n    \u003Csup>\n  \u003C\u002Fh1>\n\n  \u003Cp>\u003Cb>Empowering OpenClaw with RL — Train a personalized agent simply by talking to it.\u003C\u002Fb>\u003C\u002Fp>\n  \u003Cp>\u003Cb>Scalable RL in real-world settings — Agentic RL for terminal, GUI, SWE, and tool-call settings.\u003C\u002Fb>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F⚡_Fully_Async-yellow?style=for-the-badge\" alt=\"Fully Async\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F💰_Zero_API_or_Zero_GPU-blue?style=for-the-badge\" alt=\"Zero API or Zero GPU\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤖_Personalized-success?style=for-the-badge\" alt=\"Personalized\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🛠️_Auto_Optimization-orange?style=for-the-badge\" alt=\"Auto\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F💬_Language_Feedback-purple?style=for-the-badge\" alt=\"Language Feedback\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧠_Hybrid_RL-red?style=for-the-badge\" alt=\"Hybrid RL\" \u002F>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌍_Real_World_Agentic_RL-green?style=for-the-badge\" alt=\"General Agentic RL\" \u002F>\n  \u003Cbr>\u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10165\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄_Tech_Report-red?style=flat-square\" alt=\"Tech Report\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fyinjjiew.github.io\u002Fprojects\u002Fopenclawrl1\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Page-blue?style=flat-square\" alt=\"OpenClaw-RL Blog\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopenclaw.ai\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpenClaw-Plugin-orange?style=flat-square\" alt=\"OpenClaw Plugin\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSlime-Supported-purple?style=flat-square\" alt=\"Slime Based\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTinker-Supported-yellow?style=flat-square\" alt=\"Tinker Supported\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-green?style=flat-square\" alt=\"License Apache 2.0\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa58aacad-3c1d-47aa-bbd1-cf8c5f36de6f\" controls width=\"200\">\u003C\u002Fvideo>\n\u003C\u002Fp>\n\n\n\n\n\n\n\n\n\n## 📰 News\n\n- **[2026\u002F4\u002F15]** 🙌 We sincerely thank [Fireworks AI](https:\u002F\u002Ffireworks.ai) for its generous support of this project, which has enabled more experiments and faster iteration.\n- **[2026\u002F4\u002F11]** ✨ Qwen3.5-4B\u002F9B\u002F27B is supported now, both text and multi-modal!\n- **[2026\u002F4\u002F4]** 👨‍👦‍👦 We support optimizing a single model based on feedback from a group of people.\n- **[2026\u002F3\u002F25]** 🙌 We sincerely thank [Tinker](https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F) for its generous support of this project, which has enabled more experiments and faster iteration.\n- **[2026\u002F3\u002F20]** 💻 You can use your own openclaw now, simply install [this extension](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL\u002Ftree\u002Fmain\u002Fextensions\u002Frl-training-headers).\n- **[2026\u002F3\u002F13]** ☁️ OpenClaw-RL now supports both local GPU and cloud ([Tinker](https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F)) deployment. Launch with [**one line of code**](#combinemethod) — Hybrid RL, OPD, and Binary RL all supported!\n- **[2026\u002F3\u002F12]** ⚡ We support LoRA training now!\n- **[2026\u002F3\u002F10]** 📃 We have released our [**Technical Report**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10165)! 🏆 Ranked **#1** on [HuggingFace Daily Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2603.10165)!\n- **[2026\u002F3\u002F10]** 🔥 Huge updates today! We released a [new combination method](.\u002Fopenclaw-combine), along with an [interesting evaluation](.\u002Fopenclaw-test) of these OpenClaw-RL methods. Track 2 is released too, featuring scalable RL implementations for general agent settings across [terminal](.\u002Fterminal-rl), [GUI](.\u002Fgui-rl), [SWE](.\u002Fswe-rl), and [tool-call](.\u002Ftoolcall-rl) scenarios. We only focus on real-world settings!\n- **[2026\u002F3\u002F3]** 🙌 Working with the authors of [SDFT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19897) and [SDPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20802), we have integrated their methods into [openclaw-opd](.\u002Fopenclaw-opd). We welcome the integration of novel and effective methods!\n- **[2026\u002F3\u002F3]** 📺 Check out these community tutorial videos on OpenClaw-RL: [Video 1](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=5xnm1vB7G64) | [Video 2](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=ZtN6Gg_bdJE)\n- **[2026\u002F2\u002F26]** 🔥 We release **OpenClaw-RL v1** — a fully asynchronous RL framework for training personalized AI agents from natural conversation feedback. \n\n---\n\n## 💡 TL;DR\n\n> **OpenClaw-RL** is a fully asynchronous reinforcement learning framework that turns everyday conversations into training signals for personalized AI agents, and supports training general agents with large-scale environment parallelization.\n\nMost RL-for-LLM systems assume centralized, batch-mode training with pre-collected datasets. **OpenClaw-RL** takes a fundamentally different approach: it wraps your self-hosted model in [OpenClaw](https:\u002F\u002Fopenclaw.ai) as an OpenAI-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background — all without interrupting your usage.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\"  alt=\"Overview\"  width=\"600\">\n\u003C\u002Fp>\n\n\n\n> **Highlights:** Fully async 4-component loop · Self-hosted & private · Zero manual labeling · Three learning paradigms (Binary RL \u002F OPD \u002F Combine) · Personal + General agent support\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🌈 Features\u003C\u002Fb>\u003C\u002Fsummary>\n\n### Fully Asynchronous 4-Component Architecture\nOpenClaw-RL decouples **agent serving**, **rollout collection**, **PRM\u002Fjudge evaluation**, and **policy training** into independent async loops. None of them block one another: the model continues serving requests while training runs in the background, and judging happens concurrently with new interactions.\n\n### Self-Hosted & Private by Design\nThe entire stack, including the **policy model**, **judge\u002FPRM**, and **trainer**, runs on **your own infrastructure**. Conversation data stays within your system, and no third-party model API is required.\n\n### From Feedback to Gradient — Automatically\nYou do not need to manually label data. The system automatically:\n- Organizes multi-turn interactions into session-aware training trajectories\n- Classifies API messages into **main-line** (trainable) vs. **side** (non-trainable) turns\n- Uses the next user, environment, or tool feedback as a natural \"next-state\" signal\n- Runs PRM\u002Fjudge evaluation asynchronously, with majority voting when needed for more robust scoring\n- Submits ready samples to the trainer as they become available\n\n### Three Optimization Methods in One Framework\n\n**Binary RL (GRPO):** A Process Reward Model scores each turn based on next-state feedback. The scalar reward is then used with GRPO advantage estimation and a PPO-style clipped surrogate loss.\n\n**On-Policy Distillation (OPD):** When the next state reveals useful hindsight, a judge model extracts a textual hint. This hint augments the original prompt to create an enhanced teacher, whose token-level log-probability gap with the student becomes a directional advantage signal richer than any scalar reward.\n\n**Hybrid Method:** OpenClaw-RL further combines Binary RL and OPD in a unified training recipe, leveraging the dense scalar supervision of Binary RL together with the richer token-level directional signal from OPD. This combination achieves stronger and more robust optimization than either method alone.\n\n### From Personal Agents to Real-World Agentic RL\nThe same framework supports both personalized OpenClaw optimization and scalable RL for **terminal**, **GUI**, **SWE**, and **tool-call** agents in real-world settings.\n\n\n\n\u003C\u002Fdetails>\n\n---\n\n\n\n## 🎯 Roadmap\n\nOur long-term goal is to **advance personalized, practically useful agents with reinforcement learning**. The roadmap has two tracks:\n\n#### Track 1 — [Personal Agent Optimization](#personalagent) (Small-Scale but Personal)\n✅ **Release Track 1:** Fully async OpenClaw-RL framework with Binary RL + OPD  \n✅ Best recipe discovery via demonstration experiments  \n✅ Support LoRA Training  \n✅ Deploy training on [Tinker](https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F)  \n✅ Deploy training on [Fireworks AI](https:\u002F\u002Ffireworks.ai)\n\n#### Track 2 — [General Agents Optimization](#generalagent) (Scalable Infra)\n✅ **Release Track 2:** Scalable agentic RL infra for general agents  \n✅ Support Qwen3.5  \n⬜ Support more cloud services  \n\n\n\u003C!--\n## 🤝 Contributing\n\nWe welcome contributions that integrate new learning methods into the OpenClaw-RL framework! The integration of [SDFT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19897) \u002F [SDPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20802) into [openclaw-opd](.\u002Fopenclaw-opd), and [supporting LoRA](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL\u002Fpull\u002F23) are great examples of successful community contributions.\n\n\n\n**Highly wanted contributions:**\n- 🤖 **Qwen3.5 model support with slime** — launch scripts and model configs for the Qwen3.5 family\n- 🔧 **Low-precision training examples** — FP8\u002FINT4 training scripts for existing methods\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>📋 Full contribution guidelines & feature wishlist\u003C\u002Fb>\u003C\u002Fsummary>\n\n\n# Call for Contributions\n\nWe welcome community contributions to OpenClaw-RL! This document outlines our contribution principles and the features we'd love help with.\n\n## Contribution Guidelines\n\nOpenClaw-RL is organized as a collection of **self-contained method folders** (e.g., `openclaw-rl\u002F`, `openclaw-opd\u002F`, `openclaw-combine\u002F`), each sitting alongside the shared `slime\u002F` training framework and `openclaw\u002F` runtime.\n\nContributions generally fall into two categories:\n\n### Adding a new method or deployment target\n\nCreate a new top-level folder (parallel to existing ones like `openclaw-opd\u002F`). All method-specific code — launch scripts, custom loss functions, rollout logic, API server adapters, data processing, and the README — should live inside this folder.\n\n### Extending an existing method\n\nFor changes within an existing method folder — such as supporting a new model family, adding a LoRA variant, or a low-precision example — **add new files** (e.g., a new `.sh` script, a new data processing script) rather than modifying existing ones. This way the original working examples stay intact and your addition can be reviewed independently.\n\n### General principles\n\n1. **Do not modify the core framework.** Avoid changes to `slime\u002F`, `Megatron-LM\u002F`, or `openclaw\u002F` unless absolutely necessary. The framework exposes extension points (`--custom-loss-function-path`, `--rollout-function-path`, `--custom-generate-function-path`, `--custom-rm-path`, etc.) specifically so that new methods can plug in without touching shared code. If a framework change is truly needed, please open a separate PR for it with a clear justification.\n\n2. **Include documentation.** For a new method folder, add a `README.md` explaining what the method does, how to run it, key environment variables, and file structure. For additions to existing folders, update the existing `README.md` with a new section. See [`openclaw-combine\u002FREADME.md`](.\u002Fopenclaw-combine\u002FREADME.md) or [`toolcall-rl\u002FREADME.md`](.\u002Ftoolcall-rl\u002FREADME.md) for good examples.\n\n3. **Follow existing conventions.** Use the same shell script structure (GPU partitioning, `CKPT_ARGS`, `ROLLOUT_ARGS`, `OPTIMIZER_ARGS`, etc.), environment variable naming, and `ray job submit` launch pattern used by the existing methods.\n\n\n\n\n\n## Highly Preferred Features\n\n\n### 1. 🤖 Qwen3.5 Model Support of slime\n\n**Type:** Extend existing method folders\n\n**Goal:** Add launch scripts and model configurations for the Qwen3.5 family across existing methods.\n\n**Requirements:**\n\n- Add new `.sh` scripts for Qwen3.5 in relevant method folders (e.g., `openclaw-combine\u002Frun_qwen35_4b_openclaw_combine.sh`).\n- Add the corresponding model config in `slime\u002Fscripts\u002Fmodels\u002F` if Qwen3.5 requires different architecture parameters (hidden size, num layers, etc.) from Qwen3.\n- Verify and document any changes needed for tokenizer, chat template, reasoning parser, or tool-call parser compatibility.\n- Update READMEs to list Qwen3.5 as a supported model.\n\n\n### 2. 🔧 Low-Precision Training\u002FInference Examples\n\n**Type:** Extend existing method folders\n\n**Goal:** Add low-precision (e.g., INT8\u002FINT4 inference, BF16\u002FFP8 training) example scripts to existing method folders, enabling users to run OpenClaw-RL on consumer-grade hardware with fewer GPUs.\n\n**Requirements:**\n\n- Add **new** `.sh` scripts within existing method folders — do not modify existing scripts.\n- Low-precision inference: demonstrate launching the SGLang rollout engine with quantized weights (e.g., AWQ\u002FGPTQ INT4) to reduce VRAM for the serving side.\n- Low-precision training: if supported by the Megatron backend, demonstrate FP8 or mixed-precision configurations that reduce training memory.\n- Update the corresponding `README.md` in each method folder with a new section documenting these scripts.\n\n---\n\nIf you're interested in any of these, feel free to open an issue to discuss your approach before submitting a PR. We're happy to provide guidance and review!\n\n\n\u003C\u002Fdetails>\n\n-->\n\n\n## 📝 Contents\n\n- [Personal OpenClaw Optimization](#personalagent)\n  - [Bybrid RL](#combinemethod)\n  - [Method Evaluation](#evalmethod)\n- [Agentic RL in Real World Settings](#agentrl)\n  - [Terminal Agent](#terminal)\n  - [GUI Agent](#gui)\n  - [SWE Agent](#swe)\n  - [Tool-call Agent](#toolcall)\n\n---\n\n\n\n\u003Ca id=\"personalagent\">\u003C\u002Fa>\n## 🔧 Personal Agent Optimization Quick Start\n\n### 1. Deployment Requirements\n\n\n- **Hardware:** 8× GPUs (default; configurable via `NUM_GPUS`, `ACTOR_GPUS`, `ROLLOUT_GPUS`, `PRM_GPUS`)\n- **Software:** CUDA 12.9, Python 3.12\n- **Framework:** [Slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) (our base RL framework)\n\nFor detailed environment setup, see [Slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) or [`.\u002Finstructions\u002FREADME.md`](.\u002Finstructions\u002FREADME.md).\n\n\u003C!--\n\n#### Don't have a GPU?\n\n**Option 1: Tinker**\n\nCreate a [Tinker API](https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F). That's all you need. Tinker only supports LoRA, which may not be as effective as full fine-tuning, but it's really cheap.\n\nSee [`.\u002Fopenclaw-tinker\u002FREADME.md`](.\u002Fopenclaw-tinker\u002FREADME.md) for setup details.\n\n**Option 2: Fireworks Training SDK**\n\nUse the [Fireworks Training SDK](https:\u002F\u002Ffireworks.ai). Supports full-parameter and LoRA training.\n\nSee [`.\u002Fopenclaw-fireworks\u002FREADME.md`](.\u002Fopenclaw-fireworks\u002FREADME.md) for setup details.\n\n-->\n\n\n\n### 2. Start the RL Server\n\n\n\n\u003Ca id=\"combinemethod\">\u003C\u002Fa>\n\n```bash\ncd slime\nbash ..\u002Fopenclaw-combine\u002Frun_qwen3_4b_openclaw_topk_select.sh\n```\n\nThis method combines binary RL and OPD to achieve the best optimization.\n\nSee [`.\u002Fopenclaw-combine\u002FREADME.md`](.\u002Fopenclaw-combine\u002FREADME.md) for algorithm details.\n\n\n\n\n\n\nOnce running, the model is served as an OpenAI-compatible API at:\n```\nhttp:\u002F\u002F\u003CHOST_IP>:30000\u002Fv1\n```\n\nwhere `\u003CHOST_IP>` is the **IP address** of the machine running the RL server (e.g. `115.190.98.251`). The port `30000` is the default and can be changed via the `PORT` environment variable.\n\n**Take note of this endpoint** — you will need it when configuring OpenClaw in the next step.\n\nWe also provide an interesting case for evaluation. A student who uses OpenClaw to do homework, does not want to be found using AI. A teacher who also uses OpenClaw to grade student's homework, wants the comments to be specific and friendly.\n\n\u003Ca id=\"evalmethod\">\u003C\u002Fa>\n\u003Cdetails>\n\u003Csummary>\u003Cb>Evaluation Setting\u003C\u002Fb> — Both student, TA and teacher use AI!\u003C\u002Fsummary>\n\n\n\u003Cimg src=\"assets\u002Fexample.png\" alt=\"Overview\" width=\"750\">\n\nSee [`.\u002Fopenclaw-test\u002FREADME.md`](.\u002Fopenclaw-test\u002FREADME.md) for setup and algorithm details. Example of evaluation [results](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL\u002Fblob\u002Fmain\u002Fopenclaw-test\u002Fresults.txt).\n\u003C\u002Fdetails>\n\n\n### 3. OpenClaw Setup\n\nYou can use your own openclaw, just install [this extension](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL\u002Ftree\u002Fmain\u002Fextensions\u002Frl-training-headers).\n\n\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Then configure OpenClaw to route requests to your RL server. \u003C\u002Fb>\u003C\u002Fsummary>\n\nOpen your `openclaw.json` (or the equivalent settings file) and add a provider entry under `\"models\"` → `\"providers\"`:\n\nExample of Slime-based RL server:\n\n```json\n{\n  \"models\": {\n    \"providers\": {\n      \"qwen\": {\n        \"baseUrl\": \"http:\u002F\u002F\u003CHOST_IP>:30000\u002Fv1\",\n        \"apiKey\": \"apiKey\",\n        \"api\": \"openai-completions\",\n        \"models\": [\n          {\n            \"id\": \"qwen3-4b\",\n            \"name\": \"Qwen3 4B\",\n            \"reasoning\": true,\n            \"input\": [\"text\"],\n            \"cost\": {\n              \"input\": 0,\n              \"output\": 0,\n              \"cacheRead\": 0,\n              \"cacheWrite\": 0\n            },\n            \"contextWindow\": 32768,\n            \"maxTokens\": 8192\n          }\n        ]\n      }\n    }\n  }\n}\n```\n\nReplace `\u003CHOST_IP>` with the IP address of your RL server machine. The `apiKey` should match the `SGLANG_API_KEY` you set when starting the server.\n\nExample of Tinker-based RL server:\n\n\n```json\n{\n  \"models\": {\n    \"providers\": {\n      \"openclaw-rl\": {\n        \"baseUrl\": \"http:\u002F\u002Flocalhost:30000\u002Fv1\",\n        \"apiKey\": \"no-auth-needed\",\n        \"api\": \"openai-completions\",\n        \"models\": [\n          {\n            \"id\": \"qwen3-4b-lora\",\n            \"name\": \"Qwen3 4B (OpenClaw-RL LoRA)\",\n            \"reasoning\": true,\n            \"input\": [\"text\"],\n            \"cost\": {\n              \"input\": 0,\n              \"output\": 0,\n              \"cacheRead\": 0,\n              \"cacheWrite\": 0\n            },\n            \"contextWindow\": 32768,\n            \"maxTokens\": 8192\n          }\n        ]\n      }\n    }\n  }\n}\n```\n\n\n\nThat's it — start chatting with your OpenClaw agent. The RL server will automatically collect conversation trajectories, compute rewards, and train the model. Your agent gets better the more you use it.\n\n\u003C\u002Fdetails>\n\n\n\n\n---\n\n\u003Ca id=\"agentrl\">\u003C\u002Fa>\n## 🔧 Agentic RL in Real-world Settings\n\nThe same asynchronous RL backbone that powers our personal-agent setting can also support large-scale optimization for these broader real-world environments.\n\n| Setting | Environment | Next-state signal | Horizon |\n|---|---|---|---|\n| Terminal | Shell execution sandbox | stdout\u002Fstderr, exit code | Long |\n| GUI | Screen state + accessibility tree | Visual state diff, task progress | Long |\n| SWE | Code repository + test suite | Test verdicts, diff, lint output | Long |\n| Tool-call | API\u002Ffunction execution | Return values, error traces | Medium |\n\n\u003Ca id=\"terminal\">\u003C\u002Fa>\n### 🖥️ Terminal Agent — the most widely used computer-use agent\n\n```bash\ncd slime\nbash ..\u002Fterminal-rl\u002Fterminal_qwen3_8b_rl.sh\n```\n\n\nSee [`.\u002Fterminal-rl\u002FREADME.md`](.\u002Fterminal-rl\u002FREADME.md) for setup details.\n\n\n\u003Ca id=\"gui\">\u003C\u002Fa>\n### 📟 GUI Agent — the most general computer-use agent\n\n```bash\ncd slime\nbash ..\u002Fgui-rl\u002Fgui_qwen3vl_8b_rl.sh\n```\n\n\nSee [`.\u002Fgui-rl\u002FREADME.md`](.\u002Fgui-rl\u002FREADME.md) for setup details.\n\n\u003Ca id=\"swe\">\u003C\u002Fa>\n### 👨‍💻 SWE Agent — software engineering agent\n\n```bash\ncd slime\nbash ..\u002Fswe-rl\u002Frun_swe_rl_32b_remote_8nodes.sh\n```\n\n\nSee [`.\u002Fswe-rl\u002FREADME.md`](.\u002Fswe-rl\u002FREADME.md) for setup details.\n\n\u003Ca id=\"toolcall\">\u003C\u002Fa>\n### 🛠️ Tool-call Agent — the most practical agent\n\n```bash\ncd slime\nbash ..\u002Ftoolcall-rl\u002Fretool_qwen3_4b_rl.sh\n```\n\nSee [`.\u002Ftoolcall-rl\u002FREADME.md`](.\u002Ftoolcall-rl\u002FREADME.md) for setup details.\n\n\n\n\n\n## 📖 Citation\n\n```\n@article{wang2026openclawrl,\n  title={OpenClaw-RL: Train Any Agent Simply by Talking},\n  author={Wang, Yinjie and Chen, Xuyang and Jin, Xiaolong and Wang, Mengdi and Yang, Ling},\n  journal={arXiv preprint arXiv:2603.10165},\n  year={2026}\n}\n\n@article{wang2026rlanything,\n  title={RLAnything: Forge Environment, Policy, and Reward Model in Completely Dynamic RL System},\n  author={Wang, Yinjie and Xie, Tianbao and Shen, Ke and Wang, Mengdi and Yang, Ling},\n  journal={arXiv preprint arXiv:2602.02488},\n  year={2026}\n}\n```\n\n## 🙏 Acknowledgements\n\nThis work aims to explore more effective paradigms for Agentic RL. Our implementation builds upon the excellent codebases of [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime), [OpenClaw](https:\u002F\u002Fgithub.com\u002Fopenclaw\u002Fopenclaw), [Tinker](https:\u002F\u002Fthinkingmachines.ai\u002Ftinker\u002F) and [Open-AgentRL](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpen-AgentRL). \n\nWe also build terminal RL using [SETA](https:\u002F\u002Fgithub.com\u002Fcamel-ai\u002Fseta)'s dataset and agent framework, GUI RL using [OSWorld](https:\u002F\u002Fgithub.com\u002Fxlang-ai\u002FOSWorld)'s evaluation scripts, SWE RL using [mini-swe-agent](https:\u002F\u002Fgithub.com\u002FSWE-agent\u002Fmini-swe-agent)'s evaluation scripts, and tool-call RL based on the work of [Retool](https:\u002F\u002Fgithub.com\u002FReTool-RL\u002FReTool).\n\nWe sincerely thank these projects for their valuable insights and high-quality implementations, which have greatly facilitated our research.\n\n## ⚠️ Reminder\n\nWhen using OpenClaw-RL, please do not provide sensitive personal information during conversations with the model. Also, make sure to keep your API keys secure and never expose them in prompts, logs, or shared files.\n\n\n---\n\n\n\n","OpenClaw-RL 是一个通过自然语言交互来训练个性化代理的项目。它支持完全异步操作，无需API调用或GPU资源，能够实现自动优化，并基于语言反馈进行强化学习。该项目采用了混合型强化学习技术，适用于终端、图形用户界面、软件工程以及工具调用等多种真实世界场景。其核心功能包括通过对话训练代理执行特定任务，适合需要灵活定制智能助手的应用环境，如个人生产力提升、自动化测试及开发辅助等。",2,"2026-06-11 03:48:30","high_star"]