[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1356":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":29,"discoverSource":30},1356,"OPD","thunlp\u002FOPD","thunlp","Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe","",null,"Python",637,35,238,2,0,38,103,347,114,8.67,false,"main",true,[],"2026-06-12 02:00:26","﻿\u003Cdiv align=\"center\">\n\n# Rethinking On-Policy Distillation of Large Language Models:\u003Cbr>Phenomenology, Mechanism, and Recipe\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13016)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOPD-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FOPD)  [![HF Papers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--Paper-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2604.13016)  [![Twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002FHBX_hbx\u002Fstatus\u002F2044464414829777354)\n\n\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\">\n  \u003Cp>\n    \u003Ca href=\"#news\" style=\"text-decoration: none; font-weight: bold;\">🎉 News\u003C\u002Fa> •\n    \u003Ca href=\"#overview\" style=\"text-decoration: none; font-weight: bold;\">📖 Overview\u003C\u002Fa> •\n    \u003Ca href=\"#getting-started\" style=\"text-decoration: none; font-weight: bold;\">✨ Getting Started\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"#contact\" style=\"text-decoration: none; font-weight: bold;\">📨 Contact\u003C\u002Fa> •\n    \u003Ca href=\"#citation\" style=\"text-decoration: none; font-weight: bold;\">🎈 Citation\u003C\u002Fa> •\n    \u003Ca href=\"#star-history\" style=\"text-decoration: none; font-weight: bold;\">⭐ Star History\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n---\n\n## 🎉News\n\n- **[2026-04-15]** We investigate the dynamics and mechanisms of on-policy distillation (OPD) of LLMs, and propose practical strategies to recover failing OPD. Check it out: [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.13016).\n\n## 📖Overview\n\n![1776212644959](figs\u002Fopd_teaser.png)\n\nOn-policy distillation (OPD) has become a core technique in the post-training of large language models, yet its training dynamics remain poorly understood.\nThis paper provides a systematic investigation of OPD dynamics and mechanisms.\nWe first identify that two conditions govern whether OPD succeeds or fails: (i) the student and teacher should share compatible thinking patterns; and (ii) even with consistent thinking patterns and higher scores, the teacher must offer genuinely new capabilities beyond what the student has seen during training.\nWe validate these findings through weak-to-strong reverse distillation, showing that same-family 1.5B and 7B teachers are distributionally indistinguishable from the student’s perspective.\nProbing into the token-level mechanism, we show that successful OPD is characterized by progressive alignment on high-probability tokens at student-visited states, a small shared token set that concentrates most of the probability mass (97\\%--99\\%).\nWe further propose two practical strategies to recover failing OPD: off-policy cold start and teacher-aligned prompt selection.\nFinally, we show that OPD's apparent free lunch of dense token-level reward comes at a cost, raising the question of whether OPD can scale to long-horizon distillation.\n\n## ✨Getting Started\n\n### Environment Setup\n\nOur code is mainly based on [verl](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl) (v0.7.0). To prepare the environment used for OPD and RL:\n\n```bash\nconda create -n verl python==3.12\nconda activate verl\ncd verl\u002F\nUSE_MEGATRON=0 bash scripts\u002Finstall_vllm_sglang_mcore.sh\npip install math-verify\n```\n\nAnd we use [LlamaFactory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) (v0.9.5) for SFT training. To prepare the environment for SFT:\n\n```bash\nconda create -n sft python==3.11\ncd LlamaFactory\u002F\npip install -e .\npip install -r requirements\u002Fmetrics.txt\n```\n\n### Training\n\n#### OPD\n\nUse the following command to start on-policy distillation:\n\n```bash\nbash on_policy_distillation.sh\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Key Parameters\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| **Distillation Method** |||\n| `ADV_ESTIMATOR` | `token_reward_direct` | It can't be modified if you use OPD |\n| `ACTOR_MODEL_PATH` | — | Path to the student (policy) model to be trained |\n| `REWARD_MODEL_PATH` | — | Path to the teacher model that provides token-level reward signals |\n| **Generation Control** |||\n| `N_RESPONSES` | `4` | Number of rollout responses generated per prompt |\n| `MAX_PROMPT_LENGTH` | `1024` | Maximum token length for prompts |\n| `MAX_RESP_LENGTH` | `7168` | Maximum token length for responses during training |\n| `MAX_VAL_RESP_LENGTH` | `31744` | Maximum token length for responses during validation (set larger to ensure complete generation) |\n| **Top-K & Weighting Strategy** |||\n| `LOG_PROB_TOP_K` | `16` | Number of Top-K tokens retained when computing token-level rewards; setting to `0` falls back to sampled-token OPD |\n| `TOP_K_STRATEGY` | `only_stu` | Strategy for selecting the Top-K token set. Options: `only_stu` (select Top-K from the student, then query the teacher for corresponding log-probs), `only_tch` (select Top-K from the teacher), `intersection` (keep tokens appearing in both student and teacher Top-K), `union` (merge student and teacher Top-K), `union-intersection` (tokens in either Top-K but not both, i.e. symmetric difference) |\n| `REWARD_WEIGHT_MODE` | `student_p` | Weighting scheme for token rewards. `student_p`: weighted by student probability; `teacher_p`: weighted by teacher probability; `none`: no weighting |\n\n\u003C\u002Fdetails>\n\n> [!NOTE]\n> You can use `scripts\u002Finfer\u002Fdedup_deepmath.py` to deduplicate DeepMath against DAPO-Math-17K and avoid data overlap, as the experiments shown in Section 5.2 in our paper.\n\n#### SFT\n\nUse `scripts\u002Finfer\u002Fvllm_rollout.py` to rollout teacher responses that will later be used for student SFT.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Key Parameters\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Parameter | Default | Description |\n|-----------|---------|-------------|\n| `--input-parquet` | required | Path to the parquet file that provides prompts for teacher rollout |\n| `--model-path` | required | Path to the teacher model checkpoint used to generate responses |\n| `--gpu-ids` | `0,1,2,3,4,5,6,7` | Comma-separated GPU IDs used for multiprocessing rollout |\n| `--enable-thinking` | `false` | Whether to enable the model's thinking template when formatting prompts |\n| `--enable-rejection-sampling` | `true` | Whether to reject invalid outputs and retry generation |\n| `--max-attempts-per-rollout` | `3` | Maximum number of retries for each rollout slot when rejection sampling is enabled |\n\n\u003C\u002Fdetails>\n\nBelow is an example command for generating teacher responses with `Qwen3-4B (Non-thinking)`:\n\n```bash\npython scripts\u002Finfer\u002Fvllm_rollout.py \\\n  --input-parquet datasets\u002FOpenThoughts3-1.2M-math.parquet \\\n  --model-path model\u002FQwen3-4B \\\n  --gpu-ids 0,1,2,3,4,5,6,7 \\\n  --enable-thinking false \\\n  --enable-rejection-sampling true \\\n  --max-attempts-per-rollout 3\n```\n\nAfter the rollout finishes, use the generated teacher responses for student SFT. An example SFT training command is:\n\n```bash\nllamafactory-cli train LlamaFactory\u002Fexamples\u002Ftrain_full\u002Fqwen3_base_full_sft.yaml\n```\n\nWe release the resulting SFT checkpoint [Qwen3-1.7B-SFT](https:\u002F\u002Fhuggingface.co\u002Flllyx\u002FQwen3-1.7B-SFT), which is obtained by supervised fine-tuning from `Qwen3-1.7B-Base`.\n\n#### RL (GRPO)\n\nWe use GRPO as the RL algorithm. To enable RL, set `ADV_ESTIMATOR=grpo` and `LOG_PROB_TOP_K=0`. A reference script `grpo.sh` is provided.\n\nWe release the resulting RL checkpoint [Qwen3-4B-Base-GRPO](https:\u002F\u002Fhuggingface.co\u002Flllyx\u002FQwen3-4B-Base-GRPO), which is obtained by zero RL from `Qwen3-4B-Base`.\n\n> [!IMPORTANT]\n> **Non-thinking Models:** When training a non-thinking model (e.g., `Qwen3-1.7B (Non-thinking)`) using OPD or RL, you must add `+data.apply_chat_template_kwargs.enable_thinking=False` to the training script.\n\n### Validation\n\nWe reuse the evaluation pipeline from [JustRL](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FJustRL).\n\n**Generation (Optional)**\n\n```bash\ncd scripts\u002Fval\u002Feval\npython gen_vllm.py\n```\n\nBefore running generation, set `MODEL_NAMES` in `gen_vllm.py` to the checkpoint(s) you want to evaluate. And set appropriate `available_workers`.\n\n**Grading**\n\n```bash\ncd scripts\u002Fval\u002Feval\npython grade.py\n```\n\nThe grading script processes all JSONL files in the output directory and generates grading_results.json. If needed, you can enable the LLM-based verifier with:\n\n```bash\npython grade.py --enable_model_verifier\n```\n\n*All experiments were conducted on 8 x NVIDIA A800 80GB GPUs.*\n\n## 📨Contact\n\n- Bingxiang He: hebx24@mails.tsinghua.edu.cn\n- Ning Ding: dingning@mail.tsinghua.edu.cn\n\n## 🎈Citation\n\nIf you find this work helpful, please cite us:\n\n```bibtex\n@article{li2026rethinking,\n  title={Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe},\n  author={Li, Yaxuan and Zuo, Yuxin and He, Bingxiang and Zhang, Jinqian and Xiao, Chaojun and Qian, Cheng and Yu, Tianyu and Gao, Huan-ang and Yang, Wenkai and Liu, Zhiyuan and others},\n  journal={arXiv preprint arXiv:2604.13016},\n  year={2026}\n}\n```\n\n## ⭐Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F#thunlp\u002FOPD&Date\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=thunlp\u002FOPD&type=date&theme=dark\" \u002F>\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=thunlp\u002FOPD&type=date\" \u002F>\n    \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=thunlp\u002FOPD&type=date\" \u002F>\n  \u003C\u002Fpicture>\n\u003C\u002Fa>\n","该项目旨在重新思考大规模语言模型的在线策略蒸馏（OPD），探究其现象学、机制并提出实践方法。核心功能包括系统地分析OPD训练动态和机制，识别影响OPD成功与否的关键条件，并提出两种恢复失败OPD的实际策略：离线冷启动与教师对齐提示选择。技术上采用Python实现，适合于需要优化或改进大型语言模型性能的场景，尤其是在模型压缩与知识转移领域。","2026-06-11 02:43:15","CREATED_QUERY"]