SERL

OliverLeeXZ

Official implement on 'What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents'

AI 简介

SERL是一个面向长周期文本型LLM智能体的强化学习训练方法，实现论文《What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents》提出的选择性回溯蒸馏机制。其核心是利用多源回溯反馈（如即时反馈、未来轨迹、成功轨迹等）构建教师信号，并仅对动作token进行选择性蒸馏，保留思维链token的原始GRPO目标；支持步级与锚点级反馈粒度，已在ALFWorld和WebShop两个稀疏奖励交互环境验证。适用于需要高效利用稀疏反馈提升多步决策能力的文本智能体训练场景。

Python

Apache License 2.0

在 GitHub 查看

Stars

Forks

Watchers

Issues

Star 增长

今日0

近 7 天0

近 30 天+1

综合评分41

默认分支main

SERL

Star 增长

加入交流群