[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76157":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":12,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":13,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},76157,"SOD","YoungZ365\u002FSOD","YoungZ365","PyTorch-based open-source code for paper \"SOD: Step-wise On-policy Distillation for Small Language Model Agents\"",null,"Python",141,9,3,4,0,7,41,21,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:40","\u003Cdiv align=\"center\">\n\n\u003Ch2>SOD: Step-wise On-policy Distillation for\u003Cbr>Small Language Model Agents\u003C\u002Fh2>\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.07725\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red?logo=arxiv&logoColor=red\"\n      alt=\"Paper on arXiv\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fyoungzhong\u002Fsod\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModels-SOD-blue?logo=huggingface&logoColor=yellow\" \n        alt=\"Models\u002FSOD\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FGen-Verse\u002Fopen-agentrl-68eda4c05755ca5a8c663656\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDatasets-Agent%20RL%20Datasets-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"Datasets for Agent RL\"\n    \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## Introduction\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fintro.png\" width=\"100%\">\n\u003C\u002Fp>\n\nApplying On-Policy Distillation (OPD) to Tool-Integrated Reasoning (TIR) suffers from **cascading error propagation**: incorrect tool calls inject out-of-distribution observations that progressively amplify the student-teacher distribution shift, rendering the teacher's token-level supervision unreliable or even harmful.\n\n**SOD (Step-wise On-policy Distillation)** addresses this by introducing an adaptive step-level weighting mechanism that:\n- **Suppresses** distillation loss on steps where the student has drifted far from the teacher (erroneous pattern)\n- **Restores** full supervision when the student recovers alignment (recovery pattern)\n- **Maintains** dense token-level guidance on well-aligned steps (stable pattern)\n\nAll at **negligible additional computational cost** — the divergence metric reuses log-probabilities already computed in the OPD forward pass.\n\nExperiments on challenging math, science, and code benchmarks show that SOD achieves up to **20.86%** improvement over the second-best baseline. **Notably, our 0.6B student achieves 26.13% on average@32 at AIME 2025.**\n\n## Framework\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\" width=\"100%\">\n\u003C\u002Fp>\n\n## 📊 Main Results\n\nPerformance comparison of the Qwen3 series on 4 benchmarks. We report average@32.\n\n| Params | Method | AIME 2024 | AIME 2025 | GPQA | LiveCodeBench | Average |\n|--------|--------|:---------:|:---------:|:----:|:-------------:|:-------:|\n| **0.6B** | Vanilla | 7.71 | 12.81 | 13.24 | 14.89 | 12.16 |\n| | SFT | 5.67 | 5.42 | 15.20 | 9.61 | 8.97 |\n| | GRPO | 4.06 | 4.90 | 20.38 | 15.95 | 11.32 |\n| | OPD | 16.82 | 22.95 | 17.76 | 22.65 | 20.04 |\n| | OPSD_gt | 12.63 | 17.04 | 17.32 | 16.73 | 15.93 |\n| | OPSD_hint | 9.77 | 14.12 | 15.98 | 12.65 | 13.13 |\n| | **SOD** | **20.84** | **26.13** | **22.19** | **27.72** | **24.22** |\n| **1.7B** | Vanilla | 9.90 | 8.96 | 26.80 | 22.73 | 17.10 |\n| | SFT | 26.77 | 22.40 | 29.85 | 24.63 | 25.91 |\n| | GRPO | 25.63 | 21.67 | 33.55 | 20.70 | 25.39 |\n| | OPD | 43.86 | 37.04 | 31.73 | 32.45 | 36.27 |\n| | OPSD_gt | 33.85 | 24.69 | 35.02 | 22.73 | 29.07 |\n| | OPSD_hint | 34.42 | 21.43 | 33.46 | 23.12 | 28.11 |\n| | **SOD** | **50.83** | **41.72** | **38.72** | **40.63** | **42.98** |\n\n## 🚀 Get Started\n### Environment Setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FYoungZ365\u002FSOD.git\nconda create -n SOD python=3.11\nconda activate SOD\ncd SOD\nbash scripts\u002Finstall_vllm_sglang_mcore.sh\npip install -e .[vllm]\n```\n\n### Data Preparation\n\nDownload the following datasets:\n\n| Dataset | Link | Usage |\n|---------|------|-------|\n| 3K Agentic SFT Data | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FGen-Verse\u002FOpen-AgentRL-SFT-3K) | Cold-start SFT |\n| 30K Agentic RL Data | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FGen-Verse\u002FOpen-AgentRL-30K) | RL \u002F Distillation Training |\n| Evaluation Benchmarks | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FGen-Verse\u002FOpen-AgentRL-Eval) | AIME2024\u002F2025, GPQA-Diamond, LiveCodeBench |\n\n### Sandbox Configuration\n\nConfigure [SandboxFusion](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSandboxFusion) for code execution:\n\n1. **Local Deployment**: Refer to [SandboxFusion deployment docs](https:\u002F\u002Fbytedance.github.io\u002FSandboxFusion\u002Fdocs\u002Fdocs\u002Fget-started#local-deployment)\n2. **Cloud Service**: Use [Volcano Engine Code Sandbox](https:\u002F\u002Fwww.volcengine.com\u002Fdocs\u002F6662\u002F1539235)\n\nAfter obtaining an API endpoint, configure it in:\n- `recipe\u002Fdemystify\u002Fsandbox_fusion_tool_config.yaml`\n- The function `check_correctness` in `verl\u002Futils\u002Freward_score\u002Flivecodebench\u002Fcode_math.py`\n\n## 🔧 Training\n\n### Step 1: Cold-Start SFT\n\nConfigure `examples\u002FSOD\u002Frun_sft.sh` with your paths:\n\n- `MODEL_PATH`: Base model path (e.g., [Qwen3-1.7B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-1.7B) or [Qwen3-0.6B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-0.6B))\n- `TRAIN_DATA`: Path to the SFT `.parquet` file\n- `SAVE_PATH`: Directory to save SFT checkpoints\n\n```bash\nbash examples\u002FSOD\u002Frun_sft.sh\n```\n\nAfter SFT, merge the model checkpoint:\n\n```bash\npython3 -m verl.model_merger merge --backend fsdp \\\n    --local_dir \u003Ccheckpoint_dir>\u002Fglobal_step_xxx \\\n    --target_dir \u003Ccheckpoint_dir>\u002Fglobal_step_xxx\u002Fhuggingface\n```\n\n### Step 2: SOD Training (Step-wise On-policy Distillation)\n\nConfigure `examples\u002FSOD\u002Frun_sod.sh` with your paths:\n\n- `MODEL_PATH`: Path to the SFT student model\n- `TEACHER_MODEL_PATH`: Path to the teacher model (e.g., a GRPO-trained 4B model)\n- `TRAIN_DATA`: Path to the RL `.parquet` file (30K dataset)\n- Evaluation data paths for AIME2024\u002F2025\n\n```bash\nbash examples\u002FSOD\u002Frun_sod.sh\n```\n\n**Training Resources**: 8× NVIDIA H20 96GB GPUs, batch size 64.\n\nYou can monitor training dynamics and evaluation results via Weights & Biases (wandb).\n\n## 📊 Evaluation\n\nWe support evaluation on **AIME 2024\u002F2025**, **GPQA-Diamond**, and **LiveCodeBench-v6**.\n\nTaking AIME as an example:\n\n```bash\nbash examples\u002FSOD\u002Feval\u002Frun_eval_aime.sh\n```\n\nYou can observe average@32 \u002F pass@32 \u002F maj@32 metrics from your wandb project.\n\n## 🤗 Model Zoo\n\nIf needed, you can download our distilled models directly:\n\n| Model | Link | Description |\n|-------|------|-------------|\n| SOD-0.6B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fyoungzhong\u002FSOD-0.6B) | SOD distilled from Qwen3-4B teacher |\n| SOD-1.7B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fyoungzhong\u002FSOD-1.7B) | SOD distilled from Qwen3-4B teacher |\n| SOD-GRPO_teacher-4B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fyoungzhong\u002FSOD-GRPO_teacher-4B) | GRPO-trained Qwen3-4B teacher model |\n\nAll models are also available in our [HuggingFace Collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fyoungzhong\u002Fsod).\n\n## 📝 Citation\n\n```bibtex\n@article{zhong2026sod,\n      title={SOD: Step-wise On-policy Distillation for Small Language Model Agents}, \n      author={Qiyong Zhong and Mao Zheng and Mingyang Song and Xin Lin and Jie Sun and Houcheng Jiang and Xiang Wang and Junfeng Fang},\n      journal={arXiv preprint arXiv:2605.07725},\n      year={2026}\n}\n```\n\n## 🙏 Acknowledgements\n\nOur implementation builds upon the excellent codebases of [VeRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl), [Open-AgentRL](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpen-AgentRL), and [ReTool](https:\u002F\u002Fgithub.com\u002FReTool-RL\u002FReTool). We sincerely thank these projects for their valuable contributions to the community.\n","SOD是一个基于PyTorch的开源项目，旨在通过逐步在线策略蒸馏（Step-wise On-policy Distillation）来优化小型语言模型代理的表现。其核心功能包括引入自适应步级加权机制，能够抑制学生模型在与教师模型偏差较大时的蒸馏损失，同时在学生模型恢复对齐时恢复全监督，并保持良好对齐步骤上的密集令牌级别指导。这一过程几乎不增加额外的计算成本。SOD特别适用于需要高精度推理能力的小型语言模型场景中，如数学、科学及编程等复杂任务领域。实验表明，在多个基准测试上，SOD相比其他方法有显著性能提升。",2,"2026-06-11 03:54:41","CREATED_QUERY"]