[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-4775":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},4775,"RL_Envs_101","adithya-s-k\u002FRL_Envs_101","adithya-s-k","Building and Scaling RL environments in the age of LLMs","https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Frl-environments-guide",null,"Python",144,15,30,0,6,8,56,18,3.61,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:01:03","# RL Environments 101: A Guide to Building RL Environments\n\n---\n\n[![Blog post](.\u002Fassets\u002Fblog_thumbnail.png)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Frl-environments-guide)\n\n> 📝 **This repo is the companion code to the blog post:** **[RL Environments Guide →](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Frl-environments-guide)**\n> Read the blog for the full write-up. This repo contains the runnable implementations referenced throughout.\n\n---\n\nA practical, hands-on guide to building RL environments for LLMs.\n\nThe idea is simple. Take the **same environment** and reimplement it across **multiple RL environment frameworks** (currently OpenEnv, ORS, NeMo Gym, Verifiers, SkyRL Gym, and GEM) so you can see, side by side, how each one models tools, state, rewards, and episodes. The goal isn't training. It's helping you understand the ecosystem: what each framework actually gives you, where the boundaries are, and what code you have to write yourself.\n\nWe start with three reference environments — a **Jupyter agent** (multi-turn, real code execution in an E2B sandbox), a **Wordle solver** (multi-turn, pure Python), and a **Desktop computer-use** env (multi-turn, vision-driven, full Linux desktop in an E2B sandbox) — and will keep adding more over time. Each new environment is another \"Rosetta stone\" entry: same logic, different framework dialects.\n\nIf you've ever wondered:\n- What is an \"RL environment\" really made of?\n- Why do six frameworks call the same thing by six different names?\n- Should I build my env as an HTTP server, or run it in-process?\n- How do I plug any of these into TRL's `GRPOTrainer`?\n\n…this repo is the answer. Each framework folder is a **runnable, minimal example** showing how to set up the environment and do a sample LLM rollout against it. We also walk through **how to think about designing an environment** in the first place: the components, the key decisions, and the common pitfalls, independent of any framework.\n\n## Agent Skills\n\nThis repo also ships **5 agent skills** at `.claude\u002Fskills\u002F` that turn a plain-English env description into runnable code across the 4 target frameworks. They follow the open [SKILL.md spec](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fskills) and work with any agent that supports it — **Claude Code, Cursor, Codex, OpenCode, Gemini CLI**, and dozens more.\n\n```bash\n# install into your current project (auto-detects which agent you use)\nnpx skills add adithya-s-k\u002FRL_Envs_101\n```\n\nSkills included:\n- **`rl-env-from-description`** — orchestrator. Just describe the env in plain English; it interviews you, picks an archetype, builds the shared domain module, and ports across all 4 frameworks.\n- **`generate-openenv-env`**, **`generate-ors-env`**, **`generate-verifiers-env`**, **`generate-nemo-gym-env`** — single-framework variant builders. Useful when you only want one.\n\nThe skills are **folder-agnostic** — they work in any project, don't assume the `envs\u002F\u003Cenv>\u002F` layout this repo uses, and ask where you want files written. See [Agent Skills](#agent-skills-detail) below for trigger phrases and design notes.\n\n## Table of Contents\n\n- [Repository Layout](#repository-layout)\n- [The Reference Environments](#the-reference-environments)\n- [Framework Cheat Sheet](#framework-cheat-sheet)\n- [How to Set Up the Jupyter Agent Environment](#how-to-set-up-the-jupyter-agent-environment)\n- [How to Set Up the Wordle Environment](#how-to-set-up-the-wordle-environment)\n- [How to Set Up the Desktop Environment](#how-to-set-up-the-desktop-environment)\n- [How to Build an RL Environment](#how-to-build-an-rl-environment) (framework-agnostic)\n- [Agent Skills](#agent-skills-detail)\n- [Further Reading](#further-reading)\n- [Contributing](#contributing)\n\n---\n\n## Repository Layout\n\n```\nRL_Envs_101\u002F\n├── README.md                       # this file\n├── assets\u002F                         # blog thumbnail, diagrams\n└── envs\u002F\n    ├── jupyter_env\u002F                # E2B-sandboxed Jupyter agent (multi-turn, 4 tools)\n    │   ├── openenv\u002F                # HTTP, MCP protocol\n    │   ├── ors\u002F                    # HTTP, REST + SSE\n    │   ├── nemo_gym\u002F               # HTTP, REST + cookies\n    │   ├── verifiers\u002F              # in-process (Python)\n    │   ├── skyrl_gym\u002F              # in-process (Gym-style)\n    │   └── gem\u002F                    # in-process (Gymnasium)\n    ├── wordle_env\u002F                 # Wordle solver (multi-turn, 1 tool)\n    │   ├── openenv\u002F\n    │   ├── ors\u002F\n    │   ├── nemo_gym\u002F\n    │   ├── verifiers\u002F\n    │   ├── skyrl_gym\u002F\n    │   └── gem\u002F\n    └── desktop_env\u002F                # Computer-use desktop (multi-turn, 19 tools, vision)\n        ├── desktop.py              # shared DesktopController (E2B + 19 actions)\n        ├── tasks.py                # shared task list\n        ├── openenv\u002F                # MCP + Gradio UI, image-block screenshots\n        ├── ors\u002F                    # ORS protocol, terminate-as-reward\n        ├── nemo_gym\u002F               # HTTP, REST + cookies, \u002Fverify\n        ├── verifiers\u002F              # in-process, plain Python (DesktopToolkit)\n        ├── skyrl_gym\u002F              # in-process, BaseTextEnv with action tags\n        └── gem\u002F                    # in-process, Gymnasium 5-tuple with action tags\n```\n\n---\n\n## The Reference Environments\n\n### Jupyter Agent (multi-turn, tool-using)\n- **What the model does:** writes and executes Python in a real Jupyter kernel running inside an E2B cloud sandbox, until it answers the question.\n- **Tools (4):** `add_and_execute_code_cell`, `edit_and_execute_current_cell`, `execute_shell_command`, `get_notebook_state`.\n- **Why it's interesting:** real code execution, persistent state across turns, a real external backend (E2B).\n\n### Wordle (multi-turn, deterministic)\n- **What the model does:** plays Wordle over multiple turns. It guesses a 5-letter word, sees per-letter feedback, refines, and repeats until it solves the puzzle or runs out of attempts.\n- **Tools (1):** `guess(word)`.\n- **Why it's interesting:** pure-Python logic, no external services, persistent state across turns. The cleanest way to see how each framework models multi-turn episodes without the noise of a sandbox backend.\n\nWordle is also the *cross-domain proof*: same training and rollout patterns work on a totally different problem with no changes.\n\n### Desktop Computer-Use (multi-turn, vision-driven)\n- **What the model does:** sees a screenshot of a full Linux desktop and drives the mouse\u002Fkeyboard with tool calls until the task is done.\n- **Tools (19):** mirror Anthropic's `computer_20251124` schema — `screenshot`, `left\u002Fright\u002Fmiddle\u002Fdouble\u002Ftriple_click`, `mouse_move`, `left_click_drag`, `left_mouse_down\u002Fup`, `scroll`, `type`, `key`, `hold_key`, `wait`, `terminate`, `run_command`, `cursor_position`, `get_screen_size`. Coordinates are `[x, y]` pixel arrays so OpenAI Operator and Qwen3-VL output drives the env with minimal token-level adaptation.\n- **Why it's interesting:** real cloud VM (E2B Desktop), screenshots returned as **MCP image blocks** (the model sees pixels, not base64 text), terminal reward via `terminate(status)`. Goes well beyond text-only envs.\n\n---\n\n## Framework Cheat Sheet\n\n| Framework  | Type             | Tool syntax            | Reward model     | Deployable | Best for |\n|------------|------------------|------------------------|------------------|------------|----------|\n| **OpenEnv**    | HTTP (MCP)       | `@mcp.tool`            | External         | ✅ Docker \u002F HF Space | Long-running sandboxes; MCP ecosystem |\n| **ORS**        | HTTP (REST+SSE)  | `@tool` + Pydantic     | Per-tool-call    | ✅ Docker \u002F HF Space \u002F OpenReward | Server-decided rewards; OpenReward marketplace |\n| **NeMo Gym**   | HTTP (REST)      | `app.post()`           | Post-episode `\u002Fverify` | ✅ Docker \u002F HF Space | NVIDIA stack; Ray-based scaling |\n| **Verifiers**  | in-process       | plain Python `def`     | `Rubric` system  | ⚙️          | Fast prototyping; bundled datasets |\n| **SkyRL Gym**  | in-process       | inside `step()`        | `step()` returns | ⚙️          | Gym-style RL; SkyRL training stack |\n| **GEM**        | in-process       | inside `step()`        | `step()` returns | ⚙️          | Gymnasium API; pure-Python games |\n\nHTTP frameworks (OpenEnv, ORS, NeMo Gym) wrap a remote server. In-process frameworks (Verifiers, SkyRL, GEM) run the env class in the same Python process as the trainer or rollout script.\n\n---\n\n## How to Set Up the Jupyter Agent Environment\n\nEvery framework folder under `envs\u002Fjupyter_env\u002F\u003Cframework>\u002F` ships a working `rollout.py`. Each rollout connects to the env (deployed HF Space or local server, depending on framework), wires up the env's tools, and drives a multi-turn loop with **Qwen3-Coder-480B** through Hugging Face Inference Providers using the standard `openai` Python client. Auto-detect: if `ROLLOUT_MODEL` contains a `:provider` suffix it's routed via the HF Router, otherwise it goes to OpenAI native.\n\n### Credentials (one-time setup)\n\n```bash\ncp .env.example .env       # at the repo root\n# fill in:\n#   HF_TOKEN=hf_...        for HF Inference Providers (Qwen)\n#   OPENAI_API_KEY=sk-...  optional, only if ROLLOUT_MODEL is an OpenAI model\n#   E2B_API_KEY=e2b_...    required for in-process envs and for running HTTP servers locally\n```\n\nEvery `rollout.py` reads these via `python-dotenv` from the **repo-root `.env`** — you don't need a `.env` per folder.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>1. OpenEnv\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F MCP &nbsp;·&nbsp; \u003Ccode>MCPToolClient\u003C\u002Fcode> &nbsp;·&nbsp; deployed + local both verified\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fopenenv\nuv sync\nuv run python rollout.py                 # talks to deployed HF Space by default\n# or run the env locally first:\nuv run python -m server.app              # serves on :8000\nOPENENV_URL=http:\u002F\u002Flocalhost:8000 uv run python rollout.py\n```\nThe rollout uses `openenv-core`'s generic `MCPToolClient` — no env-specific package install required. Tools are auto-discovered via `list_tools()` and converted to OpenAI tool schemas. Deployed: [`AdithyaSK\u002Fjupyter-agent-openenv`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fjupyter-agent-openenv). Verified end-to-end with both Qwen and `gpt-4o-mini`.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>2. ORS\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + SSE &nbsp;·&nbsp; \u003Ccode>openreward\u003C\u002Fcode> &nbsp;·&nbsp; per-call reward &nbsp;·&nbsp; deployed + local both verified\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fors\nuv sync\nuv run python rollout.py                 # talks to deployed HF Space\n# or local:\nuv run python server.py                  # serves on :8080\nORS_URL=http:\u002F\u002Flocalhost:8080 uv run python rollout.py\n```\nUses the official [`openreward`](https:\u002F\u002Fpypi.org\u002Fproject\u002Fopenreward\u002F) client: `EnvironmentsAPI(base_url=..., api_key=\"\").get(\"jupyteragentors\").session(task=tasks[0])`. Reward arrives **per tool call** as `ToolOutput.reward`. Deployed: [`AdithyaSK\u002Fjupyter-agent-ors`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fjupyter-agent-ors). Verified end-to-end (`reward=1.18 finished=True`).\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>3. NeMo Gym\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + cookies &nbsp;·&nbsp; raw \u003Ccode>requests\u003C\u002Fcode> &nbsp;·&nbsp; deployed only (Ray blocks local)\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fnemo_gym\nuv sync                                  # needs Python 3.12\nuv run python rollout.py                 # talks to deployed HF Space\n```\nRaw HTTP via `requests` + cookies, no SDK needed. `POST \u002Fseed_session` sets the session cookie, then `POST \u002F\u003Ctool_name>` for each call. Deployed: [`AdithyaSK\u002Fjupyter-agent-nemo-gym`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fjupyter-agent-nemo-gym).\n\n> ⚠️ NeMo Gym **requires Ray** at server startup, which fails on shared HF \u002F SLURM cluster nodes (`gcs_server` can't bind). Local `python server.py` does not work on those machines, so the deployed Space is the path. See `envs\u002Fjupyter_env\u002Fnemo_gym\u002FREADME.md` for the full story.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>4. Verifiers\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F plain Python &nbsp;·&nbsp; auto-built OpenAI tool schemas via \u003Ccode>inspect\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fverifiers\nuv sync\nuv run python rollout.py\n```\nNo server. The 4 tool functions are imported directly from `env.py`; OpenAI tool schemas are auto-generated from each function's signature + docstring via `inspect`. The E2B sandbox is created in-process, so `E2B_API_KEY` is required.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>5. SkyRL Gym\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>BaseTextEnv\u003C\u002Fcode> &nbsp;·&nbsp; text-action with tag parsing\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fskyrl_gym\nuv sync\nuv run python rollout.py\n```\n`JupyterSkyRLEnv(BaseTextEnv)` with `init()` \u002F `step()`. **No OpenAI tool-calling** — the rollout passes the raw assistant text as the action; the env parses `\u003Ccode>...\u003C\u002Fcode>` \u002F `\u003Cshell>...\u003C\u002Fshell>` \u002F `\u003Cedit>...\u003C\u002Fedit>` tags out of it. `step()` returns `BaseTextEnvStepOutput(observations, reward, done, ...)`.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>6. GEM\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>gem.Env\u003C\u002Fcode> &nbsp;·&nbsp; Gymnasium 5-tuple\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fjupyter_env\u002Fgem\nuv sync\nuv run python rollout.py\n```\n`JupyterGemEnv(gem.Env)` with `reset()` \u002F `step()`. Same text-action + tag-parsing pattern as SkyRL, but `step()` returns the classic Gymnasium 5-tuple `(obs, reward, terminated, truncated, info)`. Has `spawn()` for parallel rollouts.\n\n\u003C\u002Fdetails>\n\n### Common rollout knobs\n\n| Variable | Default | Where it goes |\n|---|---|---|\n| `ROLLOUT_MODEL` | `Qwen\u002FQwen3-Coder-480B-A35B-Instruct:together` | If it contains `:` → HF Router. Else → OpenAI native. |\n| `MAX_TURNS` | `6`–`8` | Hard cap on tool-call \u002F step turns per rollout. |\n| `OPENENV_URL` \u002F `ORS_URL` \u002F `NEMO_GYM_URL` | deployed HF Space | Set to `http:\u002F\u002Flocalhost:\u003Cport>` to hit a local server. |\n\n### Local-server status (verified)\n\n| Framework | Deployed Space | Local server |\n|---|---|---|\n| openenv | ✅ | ✅ `uv run python -m server.app` (:8000) |\n| ors | ✅ | ✅ `uv run python server.py` (:8080) |\n| nemo_gym | ✅ | ⚙️ Ray init fails on shared cluster nodes |\n| verifiers \u002F skyrl_gym \u002F gem | n\u002Fa (in-process) | n\u002Fa (in-process) |\n\n> Each framework subfolder has its own `README.md` with the canonical consumption pattern, configuration knobs, and full sample rollout output.\n\n---\n\n## How to Set Up the Wordle Environment\n\nWordle has **no external backend** — it's pure Python (the shared `WordleGame` lives in `envs\u002Fwordle_env\u002Fgame.py`). The same `guess(word)` tool, the same dictionary, the same scoring, written six different ways. Each framework folder ships a working `rollout.py` and `README.md` following the exact same pattern as the Jupyter agent rollouts.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>1. OpenEnv\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F MCP &nbsp;·&nbsp; 3 tools: \u003Ccode>guess\u003C\u002Fcode>, \u003Ccode>get_history\u003C\u002Fcode>, \u003Ccode>reset_game\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fopenenv && uv sync && uv run python rollout.py\n```\nGeneric `MCPToolClient` against [`AdithyaSK\u002Fwordle-openenv`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-openenv).\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>2. ORS\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + SSE &nbsp;·&nbsp; 50 bundled tasks in the \u003Ccode>train\u003C\u002Fcode> split\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fors && uv sync && uv run python rollout.py\n```\n[`openreward`](https:\u002F\u002Fpypi.org\u002Fproject\u002Fopenreward\u002F) client → `EnvironmentsAPI(base_url=..., api_key=\"\").get(\"wordleors\")` against [`AdithyaSK\u002Fwordle-ors`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-ors). Each task has the answer in `task_spec`.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>3. NeMo Gym\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + cookies &nbsp;·&nbsp; raw \u003Ccode>requests\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fnemo_gym && uv sync && uv run python rollout.py\n```\nRaw `requests` against [`AdithyaSK\u002Fwordle-nemo-gym`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-nemo-gym). Same Ray-blocks-local caveat as the Jupyter sibling — deployed Space is the path.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>4. Verifiers\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>WordleToolkit\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fverifiers && uv sync && uv run python rollout.py\n```\nImports `WordleToolkit`, auto-builds OpenAI tool schemas via `inspect`, drives the loop manually.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>5. SkyRL Gym\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>BaseTextEnv\u003C\u002Fcode> &nbsp;·&nbsp; \u003Ccode>&lt;guess&gt;word&lt;\u002Fguess&gt;\u003C\u002Fcode> tag parsing\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fskyrl_gym && uv sync && uv run python rollout.py\n```\n`WordleSkyRLEnv(BaseTextEnv)` with text-action: model emits `\u003Cguess>word\u003C\u002Fguess>`, env parses.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>6. GEM\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>gem.Env\u003C\u002Fcode> &nbsp;·&nbsp; Gymnasium 5-tuple\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fwordle_env\u002Fgem && uv sync && uv run python rollout.py\n```\n`WordleGemEnv(gem.Env)` returns `(obs, reward, terminated, truncated, info)`.\n\n\u003C\u002Fdetails>\n\n> Compare any two `server.py` (or env class) files side-by-side and you'll learn more about the frameworks in 10 minutes than from any docs page.\n\nThe HTTP variants are deployed on HF Spaces (cold-start may take a minute):\n\n- OpenEnv: [`AdithyaSK\u002Fwordle-openenv`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-openenv)\n- ORS: [`AdithyaSK\u002Fwordle-ors`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-ors)\n- NeMo Gym: [`AdithyaSK\u002Fwordle-nemo-gym`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fwordle-nemo-gym)\n\nThe shared `WordleGame` logic lives at `envs\u002Fwordle_env\u002Fgame.py` and is reused by all six framework folders.\n\n---\n\n## How to Set Up the Desktop Environment\n\nThe Desktop env is the third reference: a full Linux desktop in a cloud sandbox, controlled by the model with vision + computer-use tools. **Six framework variants**, all sharing the same 19-tool action schema modelled on Anthropic's `computer_20251124` (the broadest superset across Claude \u002F OpenAI Operator \u002F Qwen3-VL ComputerUse) so a model's native computer-use output drives the env with minimal token-level adaptation.\n\nThe shared `DesktopController` in `envs\u002Fdesktop_env\u002Fdesktop.py` wraps E2B Desktop with all 19 actions (`screenshot`, `left\u002Fright\u002Fmiddle\u002Fdouble\u002Ftriple_click`, `mouse_move`, `left_click_drag`, `left_mouse_down\u002Fup`, `scroll`, `type`, `key`, `hold_key`, `wait`, `terminate`, `run_command`, `cursor_position`, `get_screen_size`). Coordinates are `[x, y]` arrays in pixel space.\n\nThe HTTP variants ship **two rollouts**: OpenAI `computer-use-preview` (Responses API) and Qwen3-VL via HF Router. The in-process variants ship one Qwen3-VL rollout (multimodal per turn).\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>1. OpenEnv\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F MCP &nbsp;·&nbsp; Gradio UI &nbsp;·&nbsp; \u003Ccode>ImageContent\u003C\u002Fcode> screenshots &nbsp;·&nbsp; deployed + local\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fopenenv\nuv sync\nuv run uvicorn server.app:app --port 8000 &\nuv run python rollout_openai.py                  # OpenAI computer-use-preview\nuv run python rollout_qwen.py                    # Qwen3-VL via HF Router\n```\nGeneric `MCPToolClient` against [`AdithyaSK\u002Fdesktop-openenv`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fdesktop-openenv). Custom Gradio UI mounted at `\u002Fweb` reuses the original `e2b_desktop` reference UI. Screenshots come back as MCP image blocks so the model actually sees pixels.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>2. ORS\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + SSE &nbsp;·&nbsp; \u003Ccode>openreward\u003C\u002Fcode> &nbsp;·&nbsp; per-call reward + \u003Ccode>terminate\u003C\u002Fcode> signal\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fors && uv sync\nuv run python server.py --port 8080 &\nuv run python rollout_openai.py\nuv run python rollout_qwen.py\n```\n[`openreward`](https:\u002F\u002Fpypi.org\u002Fproject\u002Fopenreward\u002F) client → `EnvironmentsAPI(base_url=..., api_key=\"\").get(\"desktopors\")` against [`AdithyaSK\u002Fdesktop-ors`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fdesktop-ors). `terminate(status=\"success\")` → `reward=1.0, finished=True`.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>3. NeMo Gym\u003C\u002Fb> &nbsp;·&nbsp; HTTP \u002F REST + cookies &nbsp;·&nbsp; raw \u003Ccode>requests\u003C\u002Fcode> &nbsp;·&nbsp; \u003Ccode>\u002Fverify\u003C\u002Fcode> grader\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fnemo_gym && uv sync && uv run python server.py\nuv run python rollout.py\n```\n19 tools as `app.post(\"\u002F\u003Ctool>\")` endpoints + `\u002Fseed_session` + `\u002Fverify`. Same Ray-blocks-local caveat as the Jupyter sibling — deployed Space is the path on shared cluster nodes.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>4. Verifiers\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F plain Python &nbsp;·&nbsp; \u003Ccode>DesktopToolkit\u003C\u002Fcode>\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fverifiers && uv sync && uv run python rollout.py\n```\n`DesktopToolkit` owns one E2B sandbox per episode; public methods are introspected as tools by both the TRL adapter and `vf.ToolEnv`. `screenshot()` returns the image as base64 PNG embedded in markdown.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>5. SkyRL Gym\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>BaseTextEnv\u003C\u002Fcode> &nbsp;·&nbsp; tag-parsed actions\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fskyrl_gym && uv sync && uv run python rollout.py\n```\n`DesktopSkyRLEnv(BaseTextEnv)` parses action tags from free text: `\u003Cclick x=\"100\" y=\"200\"\u002F>`, `\u003Ctype>hello\u003C\u002Ftype>`, `\u003Ckey>ctrl+s\u003C\u002Fkey>`, `\u003Cterminate status=\"success\"\u002F>`, etc. The rollout sends the latest screenshot as an image in the user message each turn so a multimodal model can ground its coordinates.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>6. GEM\u003C\u002Fb> &nbsp;·&nbsp; in-process \u002F \u003Ccode>gem.Env\u003C\u002Fcode> &nbsp;·&nbsp; Gymnasium 5-tuple, same tag grammar\u003C\u002Fsummary>\n\n```bash\ncd envs\u002Fdesktop_env\u002Fgem && uv sync && uv run python rollout.py\n```\n`DesktopGemEnv(gem.Env)` returns `(obs, reward, terminated, truncated, info)`. Same tag grammar as SkyRL — only the framework wrapping differs.\n\n\u003C\u002Fdetails>\n\nThe HTTP variants are deployed on HF Spaces (cold-start may take a minute):\n\n- OpenEnv: [`AdithyaSK\u002Fdesktop-openenv`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fdesktop-openenv)\n- ORS: [`AdithyaSK\u002Fdesktop-ors`](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Fdesktop-ors)\n\nBoth Spaces expect `E2B_API_KEY` set as a Space secret. The in-process variants need `E2B_API_KEY` in your repo-root `.env`.\n\n### Local-rollout status (verified)\n\n| Framework | Result |\n|---|---|\n| openenv | ✅ end-to-end vs deployed Space (OpenAI computer-use-preview + Qwen3-VL) |\n| ors     | ✅ end-to-end vs deployed Space (both models) |\n| nemo_gym | ⚙️ Ray init fails on shared cluster nodes (same as wordle\u002Fjupyter siblings) |\n| verifiers | ✅ in-process rollout via `DesktopToolkit` (Qwen3-VL) |\n| skyrl_gym | ✅ in-process rollout — tag-parsed actions reach E2B (Qwen3-VL) |\n| gem       | ✅ in-process rollout — `reward=1.0` on first turn (Qwen3-VL emitted `\u003Cclick>`+`\u003Ctype>`+`\u003Ckey>`+`\u003Cterminate>` inline) |\n\n> Note on coordinate spaces: Qwen3-VL emits coordinates outside the configured display (e.g. y≈965 in a 768-px screen), suggesting an internal normalized scale. A small rescaling adapter in the rollout will be needed before training.\n\n---\n\n## How to Build an RL Environment\n\nFramework-agnostic. This section is about **how to think** before you start writing code.\n\n### Step 1. Define the loop in plain English\n\nBefore opening any framework's docs, write down:\n\n1. **What is the model trying to do?** (\"Solve coding tasks\", \"Play Wordle\", \"Browse the web until it finds X\").\n2. **What can it DO?** List the actions and tools.\n3. **What does it SEE back?** The observation format.\n4. **When is it done?** Termination condition.\n5. **How do you score it?** The reward function, even a sketch.\n\nIf you can't write this in 10 lines, you don't have an environment yet. You have an idea.\n\n### Step 2. Identify the components\n\nEvery RL environment, regardless of framework, is made of these eight pieces:\n\n| Component | What it answers | Decide before coding |\n|-----------|----------------|----------------------|\n| **Tasks \u002F Dataset** | *What problems should the model solve?* | List 5 to 10 example tasks by hand. |\n| **Prompt template** | *How is the task presented?* | Write the system + user prompt. |\n| **Tools \u002F Actions** | *What can the model DO?* | Sketch function signatures. |\n| **Observations** | *What does the model SEE back?* | Decide: raw string? structured? |\n| **Execution backend** | *Where do actions actually run?* | Sandbox? In-process Python? None? |\n| **State** | *What persists across turns?* | Session-scoped dict? File system? |\n| **Reward \u002F Rubric** | *How is success measured?* | Exact match? LLM-as-judge? Unit tests? |\n| **Termination** | *When does it end?* | Max turns? `done` from a tool? |\n\nPicking a framework before you've written these down is putting the cart before the horse.\n\n### Step 3. Make four key decisions\n\nThese four decisions, more than any framework feature, determine what your environment will look like.\n\n#### Decision A. In-process or HTTP server?\n\n| Factor | Pick **in-process** if… | Pick **HTTP server** if… |\n|---|---|---|\n| Backend | Pure Python (game logic, math) | Sandbox \u002F Docker \u002F external service |\n| Scale | \u003C100 parallel rollouts | 100s to 1000s of concurrent sessions |\n| Iteration speed | You're prototyping | Production deployment |\n| Resource isolation | Doesn't matter | Env shouldn't share GPU node deps |\n| Languages | Python only | Mixed (env can be in any language) |\n\n**Rule of thumb:** start in-process. Move to HTTP only when you outgrow it.\n\n#### Decision B. Single-turn or multi-turn?\n\n- **Single-turn:** the model produces one output, you score it, done. (A math problem, classification, single-shot guess.) Reward is a function over the final answer.\n- **Multi-turn:** the model takes multiple actions, sees results, decides what to do next. (Coding agent, Wordle, web browser, dialog.) State must persist, and you must decide *who* controls the loop (trainer, framework, or env).\n\nMulti-turn is **far** more complex. If you can frame your task as single-turn, do it.\n\n#### Decision C. Where does the reward come from?\n\n| Pattern | When to use | Example framework |\n|---|---|---|\n| **External** (training script computes from final output) | Reward depends on the trajectory as a whole | OpenEnv, Verifiers, SkyRL, GEM |\n| **Per tool call** (env returns reward with each action) | You can score every step independently | ORS |\n| **Post-episode `\u002Fverify`** (separate endpoint scores the run) | Holistic LLM-as-judge or unit-test scoring | NeMo Gym |\n\nIf you're unsure, **start with external**. It's the most flexible and the easiest to debug.\n\n#### Decision D. Stateless or stateful tools?\n\n- **Stateless tools** (`add(a,b)` returning `a+b`) are trivial: no session needed.\n- **Stateful tools** (`run_code(...)` in a Jupyter kernel) need session management. Every concurrent rollout needs its own isolated state. This is where session IDs, cookies, and sandbox lifetimes start to matter.\n\nIf your tools are stateful, you'll spend half your engineering time on state management. Plan for it.\n\n### Step 4. Pick the framework that matches your decisions\n\n| If you decided… | Strong match |\n|---|---|\n| In-process + bundled dataset + rubric system | **Verifiers** |\n| In-process + Gymnasium API + parallel `make_vec()` | **GEM** |\n| In-process + Gym-style + SkyRL trainer | **SkyRL Gym** |\n| HTTP + MCP \u002F community + HF Spaces | **OpenEnv** |\n| HTTP + per-call rewards + OpenReward marketplace | **ORS** |\n| HTTP + post-episode verify + NVIDIA stack | **NeMo Gym** |\n\nWhen in doubt: **prototype in Verifiers (fastest), productionize in OpenEnv or ORS (deployable).**\n\n### Step 5. Implement the smallest possible version first\n\nDon't try to build the final environment on day one. Build the dumbest possible version:\n\n1. **One task.** Hardcoded.\n2. **One tool.** Even if your real env has ten.\n3. **No reward.** Just print \"got result: X\".\n4. **One rollout.** With a known model, e.g. `Qwen3-4B`, no training.\n\nGet that working end-to-end. Only then add: more tasks, more tools, real rewards, batching, async, deployment.\n\n### Step 6. Validate with a rollout, not with training\n\nTraining is a slow, expensive way to find out your environment is broken. Before you run *any* training:\n\n- Manually call `env.reset()`, then call each tool, then `env.close()`.\n- Run a single LLM rollout and **read the trajectory by hand**. Did the model see what you expected? Did the tool returns make sense? Did the reward fire correctly?\n- If a human can't read the trajectory and tell whether the model did well, neither can a reward function.\n\nThe biggest mistakes in RL env design are caught by reading 5 trajectories. They will *not* be caught by 1000 training steps.\n\n### Common pitfalls\n\n- **Reward is too sparse.** Every rollout returns 0.0, so GRPO has no signal. Fix: design partial credit, or pick easier tasks for the smoke test.\n- **Reward is too dense or leaky.** Model gets reward for behaviors that don't generalize. Fix: read trajectories, look for shortcuts.\n- **Tasks are too easy.** Model solves them in one tool call, so there's no learning signal in multi-turn settings.\n- **Tools are too powerful.** One tool can solve everything, so there's no exploration and no interesting behavior.\n- **State leaks across rollouts.** Same sandbox or dict reused without reset, so episodes contaminate each other.\n- **No timeout or max turns.** A buggy model loops forever and stalls training.\n- **Observation format the model can't parse.** Huge JSON dumps, or stack traces longer than the context window.\n\n---\n\n\u003Ca id=\"agent-skills-detail\">\u003C\u002Fa>\n\n## Agent Skills\n\n5 agent skills under `.claude\u002Fskills\u002F`, written to the open [SKILL.md spec](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fskills) so any spec-compliant agent (Claude Code, Cursor, Codex, OpenCode, Gemini CLI, …) can load them.\n\n| Skill | What it builds |\n|---|---|\n| `rl-env-from-description` | Orchestrator — interview, archetype selection, shared domain module, all 4 framework variants, smoke-test rollouts |\n| `generate-openenv-env` | OpenEnv (Meta) MCP variant |\n| `generate-ors-env` | OpenReward (ORS) per-call-reward variant |\n| `generate-verifiers-env` | Verifiers (PrimeIntellect) in-process variant |\n| `generate-nemo-gym-env` | NeMo Gym (NVIDIA) Resources Server variant |\n\n### Install\n\n```bash\n# auto-detects your agent (Claude Code, Cursor, Codex, etc.) and installs into the right place\nnpx skills add adithya-s-k\u002FRL_Envs_101\n```\n\nIf you've cloned this repo, the skills are already loaded — every spec-compliant agent auto-discovers `.claude\u002Fskills\u002F` when launched in the repo (verify with `ls .claude\u002Fskills\u002F`).\n\n### Use\n\nTriggering is automatic from the descriptions. Examples:\n\n| What you type | Triggers |\n|---|---|\n| *\"make me an env where the agent plays connect-four\"* | `rl-env-from-description` (orchestrator) |\n| *\"wrap my game in OpenEnv\"* | `generate-openenv-env` |\n| *\"add per-call rewards via OpenReward\"* | `generate-ors-env` |\n| *\"build a Verifiers toolkit for X\"* | `generate-verifiers-env` |\n| *\"make a NeMo Gym resources server\"* | `generate-nemo-gym-env` |\n\nThe skills are **folder-agnostic** — they work in any project, don't assume the `envs\u002F\u003Cenv>\u002F` layout this repo uses, and ask where you want files written.\n\n---\n\n## Further Reading\n\n📝 **Blog post:** [RL Environments Guide](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAdithyaSK\u002Frl-environments-guide), the full write-up this repo accompanies.\n\n### Framework links\n\n- [OpenEnv](https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002FOpenEnv) (Meta)\n- [ORS \u002F OpenReward](https:\u002F\u002Fopenrewardstandard.io\u002F) (General Reasoning)\n- [NeMo Gym](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FGym) (NVIDIA)\n- [Verifiers](https:\u002F\u002Fgithub.com\u002FPrimeIntellect-ai\u002Fverifiers) (PrimeIntellect)\n- [SkyRL Gym](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyRL\u002Ftree\u002Fmain\u002Fskyrl-gym) (NovaSky-AI)\n- [GEM](https:\u002F\u002Fgithub.com\u002Faxon-rl\u002Fgem) (Axon-RL)\n\n---\n\n## Contributing\n\n🚧 **More environments and framework implementations are on the way. PRs welcome!**\n\nGood ways to contribute:\n- **Port an existing env to a new framework** (e.g. add a 7th implementation).\n- **Add a new reference environment.** Pick something with a clear loop and reward, and ship it across as many frameworks as you can.\n- **Improve the rollout or setup scripts.** Make them clearer, faster, more portable.\n- **Fix bugs or docs.** Typos, broken commands, outdated links.\n\nOpen an issue first if you're planning anything larger than a small fix.\n","该项目是一个实用指南，旨在帮助开发者构建适用于大型语言模型（LLMs）的强化学习（RL）环境。它通过在多个RL环境框架（如OpenEnv、ORS、NeMo Gym等）中重新实现相同的环境，使用户能够直观地比较不同框架下的工具、状态、奖励和回合处理方式。项目提供三个参考环境：Jupyter代理、Wordle求解器和桌面计算机使用环境，并附带了五个代理技能，这些技能可以将自然语言描述转换为可运行代码，支持Claude Code、Cursor等多个代理。此项目适合希望深入了解RL环境构建过程及各框架特性的开发者，以及需要快速搭建自定义RL环境的研究人员。",2,"2026-06-11 03:00:26","CREATED_QUERY"]