[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76159":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":25,"discoverSource":26},76159,"SERL","OliverLeeXZ\u002FSERL","OliverLeeXZ","Official implement on 'What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents'",null,"Python",122,2,1,0,3,41.73,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:20","\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fdocs\u002Fserl\u002Fserl_icon.png\" alt=\"SERL logo\" width=\"30%\">\n\u003C\u002Fp>\n\n# SERL: Selective Hindsight Distillation for Long-Horizon LLM Agents\n\n\u003Cp align=\"center\">\n  \u003Ca href=\".\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg\" alt=\"License\">\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10%20%7C%203.12-blue.svg\" alt=\"Python\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b31b1b.svg\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Code-181717.svg\" alt=\"GitHub\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter%2FX-Thread-000000.svg\" alt=\"Twitter\u002FX\">\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-Project-ffcc4d.svg\" alt=\"Hugging Face\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nSERL is a reinforcement-learning recipe for text-based LLM agents. It uses multi-feedback from agent-environment rollouts to build a teacher signal, then applies that signal selectively to action tokens while leaving chain-of-thought and formatting tokens under the original GRPO objective.\n\nThis release focuses on two long-horizon agent environments:\n\n- ALFWorld\n- WebShop\n\nMain entrypoints:\n\n- `recipe\u002Fserl\u002Frun_alfworld.sh`\n- `recipe\u002Fserl\u002Frun_webshop.sh`\n\n## 📣 What's New\n\n- **[2026.05.12]** SERL is released with training recipes for ALFWorld and WebShop.\n\n## ✨ Highlights\n\n1. **Multi-feedback hindsight signal.** SERL can condition the teacher on immediate feedback, next observation, future trajectory, successful trajectory, current trajectory, or combinations of these signals.\n2. **Action-token-only distillation.** The teacher signal reweights action tokens, while thinking tokens keep the normal GRPO full-response credit. This matches the method design in which feedback should guide what the agent does, not overwrite every reasoning token.\n3. **Flexible feedback granularity.** SERL supports step-level feedback and anchor-level variants that group semantically related states before applying hindsight feedback.\n4. **Practical agent recipes.** The repository keeps a compact open-source surface: one ALFWorld script, one WebShop script, and a single SERL config.\n\n## 🧭 Method Overview\n\nSERL targets the sparse-reward setting common in interactive agent tasks. During rollout, each sampled trajectory contains states, actions, task rewards, and immediate feedback. SERL builds privileged hindsight contexts from these records and asks a synchronized teacher policy to score the student's action tokens under that feedback.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fdocs\u002Fserl\u002Ffigure1.png\" alt=\"SERL multi-feedback sources and granularity\" width=\"100%\">\n\u003C\u002Fp>\n\nFigure 1 illustrates the feedback design. SERL can draw feedback from the current step, the next observation, the current trajectory, the future trajectory, and successful trajectories. It can also switch between step-level and anchor-level granularity.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fdocs\u002Fserl\u002Ffigure2.png\" alt=\"SERL selective action-token distillation\" width=\"100%\">\n\u003C\u002Fp>\n\nFigure 2 shows the selective distillation objective. Teacher-student probability gaps are converted into bounded action-token weights. Thinking tokens are masked from teacher reweighting, while action tokens are promoted, suppressed, or kept unchanged according to the teacher signal.\n\n\n## 🗂️ Repository Layout\n\n```text\nrecipe\u002Fserl\u002F                         SERL training recipe, config, trainer, and launch scripts\nrecipe\u002Fserl\u002Frun_alfworld.sh          ALFWorld launch script\nrecipe\u002Fserl\u002Frun_webshop.sh           WebShop launch script\nagent_system\u002Fenvironments\u002F           Multi-turn agent environment wrappers\njudge_utils\u002F                         Utilities for LLM-judged feedback\nexamples\u002Fdata_preprocess\u002Fprepare.py  Text-mode parquet preparation\ndocs\u002Fserl\u002F                           SERL logo and paper figures\n```\n\n## ⚙️ Installation\n\n### 🧱 Base Runtime\n\nCreate the base SERL environment from the repository root:\n\n```bash\nconda create -n serl python==3.12 -y\nconda activate serl\n\npip3 install vllm==0.11.0\npip3 install flash-attn==2.7.4.post1 --no-build-isolation --no-cache-dir\npip install -e .\n```\n\nEnvironment packages may have conflicting Python and dependency requirements. Use a separate conda environment for each backend when needed.\n\n### 🧪 ALFWorld\n\nInstall ALFWorld:\n\n```bash\npip3 install gymnasium==0.29.1\npip3 install stable-baselines3==2.6.0\npip install alfworld\n```\n\nDownload PDDL files, game files, and the pretrained MaskRCNN detector:\n\n```bash\nalfworld-download -f\n```\n\nSERL reads ALFWorld games from `ALFWORLD_DATA`. If you install the data outside the default `~\u002F.cache\u002Falfworld`, export the path before launching:\n\n```bash\nexport ALFWORLD_DATA=\u002Fpath\u002Fto\u002Falfworld\n```\n\nUse `--extra` if you also want pretrained checkpoints and seq2seq data:\n\n```bash\nalfworld-download -f --extra\n```\n\nVerify the text-game installation:\n\n```bash\nalfworld-play-tw\n```\n\n### 🛒 WebShop\n\nWebShop requires Python `\u003C=3.10`, so create a dedicated environment:\n\n```bash\nconda create -n serl-webshop python==3.10 -y\nconda activate serl-webshop\n```\n\nInstall WebShop dependencies and data inside the bundled WebShop directory:\n\n```bash\ncd .\u002Fagent_system\u002Fenvironments\u002Fenv_package\u002Fwebshop\u002Fwebshop\n.\u002Fsetup.sh -d small\n```\n\nThe default SERL WebShop config uses the 1k WebShop split and expects these files to exist under `agent_system\u002Fenvironments\u002Fenv_package\u002Fwebshop\u002Fwebshop\u002F`:\n\n```text\ndata\u002Fitems_shuffle_1000.json\ndata\u002Fitems_ins_v2_1000.json\nsearch_engine\u002Findexes\u002F\n```\n\nUse `.\u002Fsetup.sh -d all` instead if you plan to run with `env.webshop.use_small=False`. If `gdown` fails, visit `https:\u002F\u002Fdrive.google.com\u002F`, get your Google Drive cookie, and paste it into `.cache\u002Fgdown\u002Fcookies.txt`. Manual download of the required files is also acceptable, as long as the files are placed in the WebShop directory above and the search index has been built.\n\nAfter WebShop is installed, return to the SERL repository root and install the training dependencies in the same `serl-webshop` environment:\n\n```bash\ncd \u002Fpath\u002Fto\u002FSERL\npip3 install torch==2.6.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\npip3 install flash-attn==2.7.4.post1 --no-build-isolation\npip3 install -e .\npip3 install vllm==0.8.2\n```\n\nWarnings about `spacy` or `weasel` requiring an older `typer` can be ignored for the WebShop training scripts.\n\n## 🚀 Quickstart\n\nPrepare the text-mode parquet files. The parquet files provide the text modality marker and dataset size. Task observations, valid actions, rewards, and feedback are produced online by the environment during rollout.\n\n```bash\nmkdir -p ~\u002Fdata\u002Fserl\u002Ftext\npython3 examples\u002Fdata_preprocess\u002Fprepare.py \\\n  --mode text \\\n  --local_dir ~\u002Fdata\u002Fserl \\\n  --train_data_size 256 \\\n  --val_data_size 256\n```\n\nThis creates:\n\n```text\n~\u002Fdata\u002Fserl\u002Ftext\u002Ftrain.parquet\n~\u002Fdata\u002Fserl\u002Ftext\u002Ftest.parquet\n```\n\nRun ALFWorld:\n\n```bash\nconda activate serl\nbash recipe\u002Fserl\u002Frun_alfworld.sh\n```\n\nRun WebShop:\n\n```bash\nconda activate serl-webshop\nbash recipe\u002Fserl\u002Frun_webshop.sh\n```\n\nThe launch scripts default to `Qwen\u002FQwen2.5-7B-Instruct`, `SAMPLING_MODE=immediate_feedback`, and `TRAJECTORY_FORMAT=response`.\n\nCommon ALFWorld override:\n\n```bash\nMODEL_PATH=Qwen\u002FQwen2.5-7B-Instruct \\\nTRAIN_FILE=~\u002Fdata\u002Fserl\u002Ftext\u002Ftrain.parquet \\\nVAL_FILE=~\u002Fdata\u002Fserl\u002Ftext\u002Ftest.parquet \\\nOUTPUT_ROOT=.\u002Foutputs\u002Falfworld \\\nSAMPLING_MODE=immediate_feedback \\\nTRAJECTORY_FORMAT=response \\\nbash recipe\u002Fserl\u002Frun_alfworld.sh\n```\n\nCommon WebShop override:\n\n```bash\nMODEL_PATH=Qwen\u002FQwen2.5-7B-Instruct \\\nTRAIN_FILE=~\u002Fdata\u002Fserl\u002Ftext\u002Ftrain.parquet \\\nVAL_FILE=~\u002Fdata\u002Fserl\u002Ftext\u002Ftest.parquet \\\nOUTPUT_ROOT=.\u002Foutputs\u002Fwebshop \\\nSAMPLING_MODE=immediate_feedback \\\nTRAJECTORY_FORMAT=response \\\nbash recipe\u002Fserl\u002Frun_webshop.sh\n```\n\nThe first positional argument can switch the rollout engine:\n\n```bash\nbash recipe\u002Fserl\u002Frun_alfworld.sh vllm\nbash recipe\u002Fserl\u002Frun_webshop.sh vllm\n```\n\nArbitrary Hydra overrides can be appended after the script:\n\n```bash\nSAMPLING_MODE=anchor_successful_sample_immediate_feedback \\\nbash recipe\u002Fserl\u002Frun_webshop.sh \\\n  trainer.total_epochs=150 \\\n  actor_rollout_ref.actor.optim.lr=1e-6\n```\n\n## 💬 Supported Feedback Modes\n\nSet the feedback source with `SAMPLING_MODE=\u003Cmode>`. Implementation names use `successful_sample` for a successful trajectory reference.\n\n| `SAMPLING_MODE` | Feedback source |\n| --- | --- |\n| `immediate_feedback` | immediate per-step feedback |\n| `next_observation` | next observation |\n| `future_trajectory` | future trajectory |\n| `successful_sample_or_immediate_feedback` | successful trajectory or immediate feedback |\n| `successful_sample_immediate_feedback` | successful trajectory and immediate feedback |\n| `successful_sample_next_observation` | successful trajectory and next observation |\n| `successful_sample_future_trajectory` | successful trajectory and future trajectory |\n| `successful_sample_future_trajectory_immediate_feedback` | successful trajectory, future trajectory, and immediate feedback |\n| `successful_sample_future_trajectory_next_observation` | successful trajectory, future trajectory, and next observation |\n\nExamples:\n\n```bash\nSAMPLING_MODE=immediate_feedback bash recipe\u002Fserl\u002Frun_alfworld.sh\nSAMPLING_MODE=successful_sample_immediate_feedback bash recipe\u002Fserl\u002Frun_webshop.sh\nSAMPLING_MODE=successful_sample_future_trajectory_next_observation bash recipe\u002Fserl\u002Frun_webshop.sh\n```\n\n## ⚓ Anchor-Level Feedback\n\nAnchor placement is enabled with the `anchor_` prefix. To disable anchor placement, use the corresponding non-anchor mode.\n\nSupported anchor modes:\n\n```text\nanchor_immediate_feedback\nanchor_next_observation\nanchor_future_trajectory\nanchor_successful_sample_or_immediate_feedback\nanchor_successful_sample_immediate_feedback\nanchor_successful_sample_next_observation\nanchor_successful_sample_future_trajectory\nanchor_successful_sample_future_trajectory_immediate_feedback\nanchor_successful_sample_future_trajectory_next_observation\n```\n\nExamples:\n\n```bash\nSAMPLING_MODE=anchor_immediate_feedback bash recipe\u002Fserl\u002Frun_alfworld.sh\nSAMPLING_MODE=anchor_successful_sample_immediate_feedback bash recipe\u002Fserl\u002Frun_webshop.sh\n```\n\nOptional similarity filtering can be enabled with Hydra overrides:\n\n```bash\nSAMPLING_MODE=anchor_immediate_feedback \\\nbash recipe\u002Fserl\u002Frun_webshop.sh \\\n  actor_rollout_ref.actor.serl.anchor_enable_similarity=True \\\n  actor_rollout_ref.actor.serl.anchor_similarity_thresh=0.95\n```\n\n## ⚖️ LLM-Judged Feedback\n\nSERL also supports judged feedback, where an OpenAI-compatible judge model summarizes a trajectory into concise guidance before teacher scoring.\n\n| `SAMPLING_MODE` | Meaning |\n| --- | --- |\n| `judge_current_traj` | Judge the current trajectory. |\n| `judge_current_traj_on_successful_sample` | Judge the current trajectory with a successful trajectory as reference. |\n\nExample:\n\n```bash\nJUDGE_API_URL=http:\u002F\u002Flocalhost:8000\u002Fv1 \\\nJUDGE_MODEL=your-judge-model \\\nJUDGE_API_KEY=your-api-key \\\nSAMPLING_MODE=judge_current_traj \\\nbash recipe\u002Fserl\u002Frun_alfworld.sh\n```\n\n## 🔀 Trajectory Format\n\nSERL supports two trajectory organization formats:\n\n| Format | Description |\n| --- | --- |\n| `response` | Response-oriented trajectory rendering. This is the default. |\n| `observation_action` | Observation-action turn rendering. |\n\nChoose the format with `TRAJECTORY_FORMAT=\u003Cformat>`:\n\n```bash\nTRAJECTORY_FORMAT=response bash recipe\u002Fserl\u002Frun_alfworld.sh\nTRAJECTORY_FORMAT=observation_action bash recipe\u002Fserl\u002Frun_webshop.sh\n```\n\n## ✏️ Citation\n\nBibTeX will be added when the paper metadata is public.\n\n## 🙏 Acknowledgement\n\nSERL is implemented on top of [verl-agent](https:\u002F\u002Fgithub.com\u002Flangfengq\u002Fverl-agent). The environment integrations build on [ALFWorld](https:\u002F\u002Fgithub.com\u002Falfworld\u002Falfworld) and [WebShop](https:\u002F\u002Fgithub.com\u002Fprinceton-nlp\u002FWebShop). We thank the authors and contributors of these projects.\n","SERL是一个针对文本型大语言模型代理的强化学习方案，它通过从代理-环境交互中收集多反馈来构建教师信号，并将该信号有选择性地应用于动作标记，同时保持链式思考和格式化标记在原始GRPO目标下。其核心技术特点包括基于多种反馈源（如即时反馈、下一观察、未来轨迹等）构建的后见之明信号、仅对动作标记进行蒸馏以及支持灵活的反馈粒度设置。此项目特别适用于需要长期规划与决策的任务场景，例如ALFWorld和WebShop这两个长时域代理环境中。通过提供简洁易用的训练脚本，SERL降低了研究者们探索复杂互动任务解决方案的门槛。","2026-06-11 03:54:41","CREATED_QUERY"]