[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2004":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":15,"starSnapshotCount":15,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},2004,"AwesomeOPD","thinkwee\u002FAwesomeOPD","thinkwee","Awesome List for On-Policy Distillation","",null,609,11,7,1,0,26,76,386,78,7.24,false,"main",true,[25,26,27,28],"awesome-list","distillation","large-language-model","on-policy-distillation","2026-06-12 02:00:35","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"banner.png\" alt=\"banner Logo\" width=\"800\">\n\u003C\u002Fdiv>\n\n\n\u003Cdiv align=\"center\">\n\n[![oosmetrics](https:\u002F\u002Fapi.oosmetrics.com\u002Fapi\u002Fv1\u002Fbadge\u002Fachievement\u002Fd1bd4be6-a545-4fb5-8c5a-0069d4c0b0d8.svg)](https:\u002F\u002Foosmetrics.com\u002Frepo\u002Fthinkwee\u002FAwesomeOPD)\n\n![Surveys](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSurveys_&_Position-7-4E6813?style=for-the-badge)\n![White-Box](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWhite--Box_OPD-17-BFA2DB?style=for-the-badge)\n![Black-Box](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlack--Box_OPD-3-845C40?style=for-the-badge)\n\u003Cbr>\n![OPSD](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOPSD-15-A259FF?style=for-the-badge)\n![Iterative](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIterative_Self--Bootstrapping-2-50C878?style=for-the-badge)\n![OPD-RL](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOPD--RL_Hybrids-16-9B59B6?style=for-the-badge)\n\u003Cbr>\n![Reasoning](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FReasoning_OPD-3-FF69B4?style=for-the-badge)\n![Multimodal](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMultimodal_OPD-6-2ECC71?style=for-the-badge)\n![Agent](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAgent_&_Embodied-5-1F4CAD?style=for-the-badge)\n\u003Cbr>\n![SpecDec](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSpeculative_Decoding-11-D89F7B?style=for-the-badge)\n![Frameworks](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFrameworks-11-FA5A4C?style=for-the-badge)\n![Industrial](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProduction_Reports-8-ffc884?style=for-the-badge)\n\n\u003C\u002Fdiv>\n\n# When LLMs Distill On-Policy\n\n**AwesomeOPD** is an awesome list summarising **open-source repositories and papers** for training LLMs (and VLMs \u002F agents \u002F draft models) with **On-Policy Distillation (OPD)** and **On-Policy Self-Distillation (OPSD)**:\n - 🎯 **OPD = C1 + C2.** `C1`: student samples its own trajectories `y ~ π_student(·|x)` during training. `C2`: teacher provides per-token \u002F sequence supervision on those student samples. Methods that only partially satisfy are flagged in **📝 Strictness notes** per section.\n - 🪞 **OPSD** = special case where teacher *is the same model*, conditioned on privileged context (verified trace \u002F answer \u002F \"be concise\" prefix \u002F longer context) or an earlier checkpoint.\n - 🚀 Each entry is annotated along four design axes — **teacher source** (external · same model with privileged context · earlier checkpoint · multi-teacher · discriminator), **supervision signal** (logits \u002F top-k \u002F sequence reward \u002F verbal score \u002F discriminator \u002F verifier \u002F feature), **rollout consumption** (all \u002F selected \u002F truncated \u002F replaced \u002F as PG samples), and **pipeline slot** (cold-start \u002F mid \u002F RL-replacement \u002F inside-RL \u002F inter-stage \u002F compression \u002F continual-anchor).\n - ⚠️ Built by reading paper PDFs, project pages, and source code with LLM coding agents; manually reviewed but errors possible. PRs welcome.\n - 📌 If you find this repository helpful for your research, please cite it via the **\"Cite this repository\"** button in the right sidebar of the GitHub page.\n - 📅 Last updated: 2026-05-18\n\nTaxonomy:\n - **📚 Surveys, Foundations & Position Papers** — meta-references and seed papers (GKD, MiniLLM, Thinking Machines blog, Tencent \u002F THUNLP surveys)\n - **🔬 White-Box** — logit-based OPD on student rollouts with an external teacher\n - **🎭 Black-Box** — discriminator \u002F verbal \u002F preference, no teacher logits\n - **♻️ OPSD** — privileged-context self-distillation (same model, different conditioning)\n - **🔁 Iterative Self-Bootstrapping** — same model as previous-checkpoint teacher\n - **🤝 OPD-RL Hybrids** — inside-RL OPD: KL-as-reward, RL+OPD fusion\n - **🧠 Reasoning \u002F 🖼️ Multimodal \u002F 🤖 Agent & Embodied** — by application; cuts across all teacher-source categories\n - **⚡ Speculative-Decoding Distillation** — drafter distillation; \"student\" is a draft model\n - **🛠️ Frameworks & Toolkits** — what to actually run\n - **🏭 Industrial \u002F Production Reports** — what the labs ship\n\nShorthand: **FKL** = forward KL · **RKL** = reverse KL · **JSD** = Jensen–Shannon · **Skew-KL** \u002F **AKL** = skewed \u002F adaptive KL · `📄 paper-only` = no public code yet.\n\n## Updates\n\n\u003Cdetails>\n\u003Csummary>📢 click to expand\u003C\u002Fsummary>\n\n- **2026-05-18** — add COPSD, MSD\n- **2026-05-15** — add TCOD, Healthcare AI GYM, HyperEyes (and cross-list Skill-SD into Agent)\n- **2026-05-14** — add CORD\n- **2026-05-13** — add Uni-OPD; add HY-MT, Baichuan-M3, KAT-Coder-V2, HY-Embodied, Qwen3.5-Omni\n- **2026-05-11** — add Skill-SD\n- **2026-04-30** — add π-Play\n- **2026-04-29** — add SD-Zero, *Why Does Self-Distillation (Sometimes) Degrade Reasoning?*\n- **2026-04-28** — initial release; add NPO, VLA-OPD, KDFlow, HPD, DeepSeek-V4\n\n\u003C\u002Fdetails>\n\n---\n\n## 📚 Surveys, Foundations & Position Papers\n\n| Resource | 🌟 Stars | Date | Org | Paper \u002F Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [GKD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13649) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13649) | 2023.06 | Google DeepMind (Agarwal et al.) | [arXiv 2306.13649](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.13649) — implemented in [TRL `GKDTrainer`](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002Ftrl\u002Fexperimental\u002Fgkd\u002Fgkd_trainer.py) | **GKD: On-Policy Distillation of Language Models — Learning from Self-Generated Mistakes** (Seminal · ICLR 2024) |\n| [Blog](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fon-policy-distillation\u002F) | [![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fblog_post-3.2k_cookbook-blue?style=for-the-badge)](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fon-policy-distillation\u002F) | 2025.10 | Thinking Machines Lab (Kevin Lu et al.) | [Blog](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fon-policy-distillation\u002F) · [tinker-cookbook](https:\u002F\u002Fgithub.com\u002Fthinking-machines-lab\u002Ftinker-cookbook) | **Thinking Machines Lab — On-Policy Distillation (blog)** |\n| [tinker-cookbook](https:\u002F\u002Fgithub.com\u002Fthinking-machines-lab\u002Ftinker-cookbook) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthinking-machines-lab\u002Ftinker-cookbook?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.10 | Thinking Machines Lab | — | Reference impl. of the OPD recipe on the Tinker SDK |\n| [revisiting_opd](https:\u002F\u002Fgithub.com\u002Fhhh675597\u002Frevisiting_opd) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhhh675597\u002Frevisiting_opd?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | CASIA (Fu et al.) | [arXiv 2603.25562](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.25562) | Revisiting OPD: Failure Modes & Simple Fixes |\n| [Tencent OPD Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.00626) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.00626) | 2026.04 | Tencent (Mingyang Song & Mao Zheng) | [arXiv 2604.00626](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.00626) | **A Survey of On-Policy Distillation for LLMs** |\n| [OPD](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FOPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FOPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | Tsinghua THUNLP | [arXiv 2604.13016](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13016) | **Rethinking On-Policy Distillation: Phenomenology, Mechanism & Recipe** |\n| [Lightning OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13010) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13010) | 2026.04 | Wu, Han, Cai | [arXiv 2604.13010](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13010) | **Lightning OPD: Efficient Post-Training with Offline OPD** |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Resource | Loss \u002F Divergence | Data | Teacher Access | Granularity | Notes |\n| :----: | :----: | :----: | :----: | :----: | :---- |\n| GKD (Agarwal) | Generalised JSD (FKL\u002FRKL configurable) | Mixed (`λ` interpolates teacher↔student) | White-box | Token | The seminal paper that named OPD; introduced student-self-rollout supervision. |\n| Thinking Machines blog | Reverse KL (student‖teacher) | Student rollouts | White-box | Token | \"Swap KL ref model for stronger teacher\" recipe; one-line addition to RL trainer. Replicates Qwen3 result at ~1\u002F10 RL cost. |\n| Revisiting OPD | Truncated reverse KL + top-p sampling + special-token masking | Student | White-box | Token (filtered) | Diagnoses 3 failure modes: imbalanced one-token signal, unreliable prefix guidance, tokenizer mismatch. |\n| Tencent OPD Survey | (survey) | (survey) | (survey) | (survey) | Catalogues 50+ methods; useful as a reference index. |\n| THUNLP Rethinking OPD | Reverse KL with progressive top-K alignment | Student | White-box | Token | Identifies two success conditions: compatible thinking patterns + genuinely new teacher capability. Recipe = **off-policy cold-start + teacher-aligned prompt selection**. |\n| Lightning OPD | Cached teacher log-probs over SFT rollouts (offline OPD) | Student (cached) | White-box | Token | Introduces \"teacher consistency\" — same teacher must be used for SFT and OPD or else gradient bias. Eliminates the live teacher server. |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cb>Strictness notes\u003C\u002Fb> (against the strict OPD definition \u003Ccode>C1: student samples its own trajectories during training\u003C\u002Fcode> + \u003Ccode>C2: teacher provides supervision on those samples\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n- **Lightning OPD** — ⚠️ partially satisfies C1: teacher log-probs are pre-computed *once* over SFT rollouts and reused during training; student doesn't actively sample during the OPD step. Authors call this \"offline OPD\" explicitly. Listed in OPD because the data is past-student-generated rollouts, not teacher-generated.\n\n\u003C\u002Fdetails>\n\n---\n\n## 🔬 OPD with Larger External Teachers — White-Box\n\nWhite-box methods use **teacher logits \u002F log-probabilities** to supervise the student on **student-generated rollouts**. Each entry below has been verified to (a) train on student rollouts and (b) operate at the token level.\n\nMethods that turned out to be RL-style on verification have been moved to [OPD-RL Hybrids](#-opd-rl-hybrids); off-policy \u002F pure-loss-function \u002F pretraining-side methods are excluded from this list.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [LMOps `\u002Fminillm`](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps\u002Ftree\u002Fmain\u002Fminillm) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLMOps?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2023.06 | Microsoft \u002F Tsinghua | [arXiv 2306.08543](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.08543) | MiniLLM (ICLR 2024) |\n| [distillm](https:\u002F\u002Fgithub.com\u002Fjongwooko\u002Fdistillm) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjongwooko\u002Fdistillm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2024.02 | KAIST \u002F Microsoft | [arXiv 2402.03898](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.03898) | DistiLLM (ICML 2024) |\n| [google-research `\u002Fspeculative_kd`](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fgoogle-research\u002Ftree\u002Fmaster\u002Fspeculative_kd) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgoogle-research\u002Fgoogle-research?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2024.10 | UCSB \u002F Google | [arXiv 2410.11325](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11325) | Speculative KD (ICLR 2025) |\n| [distillm-2](https:\u002F\u002Fgithub.com\u002Fjongwooko\u002Fdistillm-2) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjongwooko\u002Fdistillm-2?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.03 | KAIST \u002F Microsoft | [arXiv 2503.07067](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.07067) | DistiLLM-2 (ICML 2025 Oral) |\n| [DSKDv2](https:\u002F\u002Fgithub.com\u002Fsongmzhang\u002FDSKDv2) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsongmzhang\u002FDSKDv2?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.04 | BJTU | [arXiv 2504.11426](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11426) | DSKDv2 — cross-tokenizer; supports on-policy mode |\n| [Constrained OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.22921) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.22921) | 2025.09 | Huawei Noah's Ark | [arXiv 2509.22921](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.22921) | Constrained OPD (CMDP) |\n| [AdaSwitch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07842) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07842) | 2025.10 | RUC \u002F Baidu | [arXiv 2510.07842](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.07842) | AdaSwitch (on-\u002Foff-policy switching) |\n| [Veto](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07155) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07155) | 2026.01 | SNU | [arXiv 2601.07155](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.07155) | Veto (Stable OPD) — ACL 2026 Findings |\n| [G-OPD](https:\u002F\u002Fgithub.com\u002FRUCBM\u002FG-OPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCBM\u002FG-OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | RUC \u002F Tencent | [arXiv 2602.12125](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12125) | G-OPD |\n| [Fast OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.15260) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.15260) | 2026.02 | Industrial | [arXiv 2602.15260](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.15260) | Fast OPD (prefix-truncated) |\n| [Entropy-Aware OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07079) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07079) | 2026.03 | KAIST \u002F IBM | [arXiv 2603.07079](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.07079) | Entropy-Aware OPD |\n| [REOPOLD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11137) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11137) | 2026.03 | KAIST \u002F Microsoft | [arXiv 2603.11137](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11137) | REOPOLD (Relaxed OPD) — code soon |\n| [OPSD_OnPolicyDistillation](https:\u002F\u002Fgithub.com\u002FHJSang\u002FOPSD_OnPolicyDistillation) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHJSang\u002FOPSD_OnPolicyDistillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | LinkedIn | [arXiv 2603.11178](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.11178) | PACED — frontier curriculum self-distill |\n| [TSD-KD](https:\u002F\u002Fgithub.com\u002Fkmswin1\u002FTSD-KD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkmswin1\u002FTSD-KD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | Korea Univ. | [arXiv 2603.13260](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.13260) | TSD-KD — token-selective dual KD (ICLR 2026) |\n| [SCOPE](https:\u002F\u002Fgithub.com\u002Fmachine981\u002FSCOPE) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmachine981\u002FSCOPE?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | USTC \u002F Meituan \u002F Fudan | [arXiv 2604.10688](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10688) | SCOPE — signal-calibrated dual-path |\n| [OPSD_OnPolicyDistillation](https:\u002F\u002Fgithub.com\u002FHJSang\u002FOPSD_OnPolicyDistillation) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHJSang\u002FOPSD_OnPolicyDistillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | Meta \u002F LinkedIn | [arXiv 2604.14084](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14084) | TIP — Token Importance, shares LinkedIn OPSD repo with PACED |\n| [Hybrid-Policy-Distillation](https:\u002F\u002Fgithub.com\u002Fzwhong714\u002FHybrid-Policy-Distillation) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzwhong714\u002FHybrid-Policy-Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | zwhong714 | [arXiv 2604.20244](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.20244) | HPD — Hybrid Policy Distillation; LlamaFactory + verl backends |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Loss \u002F Divergence | Data | Granularity | Domain | Notes |\n| :----: | :----: | :----: | :----: | :----: | :---- |\n| MiniLLM | Reverse KL via policy gradient | Student | Sequence (PG) | General | The seminal \"OPD\" recipe by Yuxian Gu et al.; predates GKD by days. Mode-seeking. |\n| DistiLLM | Skewed-KL (mix of FKL\u002FRKL) | Mixed (adaptive off→on, with student samples) | Token | General | Skew parameter `α` interpolates between FKL and RKL; importance-reweighted student samples. |\n| Speculative KD (Xu) | Interleaved propose-and-correct (gated KL) | Student-proposed, teacher-corrected | Token | General | Bridges teacher-student gap via interleaved sampling. |\n| DistiLLM-2 | Contrastive: Skew-FKL on teacher data + Skew-RKL on student data | Mixed | Token | General | Asymmetric losses on each data source; ICML 2025 oral. |\n| DSKDv2 | KL in dual aligned space; explicit on-policy mode | Student | Token | Cross-tokenizer | Cross-vocabulary distillation; supports both on\u002Foff-policy. |\n| Constrained OPD | KL-constrained CMDP | Student | Token | General | Hard KL constraint instead of soft penalty. Borderline OPD-RL. |\n| AdaSwitch | Adaptive on\u002Foff-policy switching | Mixed | Token | General | Switches between teacher-data and student-rollout based on divergence threshold. |\n| Veto | Logit-space geometric bridge with adaptive gradient veto | Student | Token | General | Adaptive Target Reformulation. |\n| G-OPD \u002F ExOPD | Reverse KL + scaled reward extrapolation | Student | Token | General | Generalises OPD as KL-constrained RL; allows reward scale > 1 to \"exceed\" the teacher. |\n| Fast OPD | Prefix-truncated distillation reducing FLOPs | Student | Token (truncated) | Reasoning | 2× to 47× speedup via reasoning-prefix truncation. |\n| Entropy-Aware OPD | Switch between FKL and RKL based on teacher entropy | Student | Token | Reasoning | When teacher entropy high → FKL; low → RKL. |\n| REOPOLD | Mixture-based reward clipping + entropy-based dynamic sampling | Student | Token | Reasoning | \"Relaxed OPD\"; views OPD as policy optimisation with teacher-student log-ratio reward. |\n| PACED | Frontier curriculum at student competence boundary | Student | Token | General | Self-distill style (privileged-context \u002F earlier-checkpoint); difficulty weighting `w(p)=p(1−p)`. |\n| TSD-KD | Indirect (student-propose \u002F teacher re-rank) + direct selective logit KD | Mixed | Token (selected) | General | Hybrid; partial OPD + partial preference. |\n| SCOPE | Teacher-PPL-weighted KL on incorrect rollouts; student-PPL-weighted MLE on correct | Student | Token | Reasoning | Signal-Calibrated OPD with Dual-Path Adaptive Weighting; verifier-routing. |\n| TIP | Top-50% high-entropy student tokens carry the OPD signal | Student (selected) | Token (filtered) | Reasoning | ~47% memory savings; only entropy-high student tokens trained. |\n| HPD | Reweighted log-likelihood unifying FKL + RKL | Mixed (off-policy + lightweight approximate on-policy sampling) | Token | General | Unifies KD as token-level reweighted likelihood; lightweight on-policy sampling preserves training efficiency. |\n\n\u003C\u002Fdetails>\n\n---\n\n## 🎭 OPD with Black-Box \u002F Outcome-Based Teachers\n\nWhen the teacher is **API-only** (no logits), OPD uses scalar rewards, verbal scores, preferences, or adversarial discriminators — all evaluated on **student rollouts**. Entries that turned out to use static teacher data only (Lion, SuperCorrect, DAIL, SODA) are excluded from this list.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [ORPO-Distill](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25100) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25100) | 2025.09 | Industrial | [arXiv 2509.25100](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.25100) | ORPO-Distill |\n| [LMOps `\u002Fgad`](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLMOps?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.11 | Microsoft Research | [arXiv 2511.10643](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10643) · [project](https:\u002F\u002Fytianzhu.github.io\u002FGenerative-Adversarial-Distillation\u002F) | GAD — Black-Box OPD |\n| [OVD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21968) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21968) | 2026.01 | HKU \u002F Huawei | [arXiv 2601.21968](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21968) | OVD (On-policy Verbal Distillation) — project page `OVD.github.io` 404s |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Feedback Signal | Data | Granularity | Domain | Notes |\n| :----: | :----: | :----: | :----: | :----: | :---- |\n| ORPO-Distill | Student-Generated Outputs (SGO) + ORPO contrastive | Mixed (student-generated negatives, teacher positives) | Sequence | Cross-architecture | \"Mixed-policy strategy utilizing student-generated outputs\"; NeurIPS 2025 WS. |\n| GAD (Generative Adversarial Distillation) | Discriminator (on-policy reward model) | Student | Sequence | General | A trained discriminator distinguishes student outputs from teacher (e.g. GPT-5) responses; minimax game makes the discriminator co-evolve into an on-policy reward model. Qwen2.5-14B student becomes comparable to GPT-5-Chat on LMSYS. |\n| OVD | Verbal scores (0–9) on student trajectories | Student | Sequence | General | Replaces token-level logit matching with verbal scoring; +25.7% over baselines. |\n\n\u003C\u002Fdetails>\n\n---\n\n## ♻️ Self-Distillation with Privileged Context — OPSD\n\n**Same model = teacher = student**, but the teacher is conditioned on something the student doesn't see (verified trace, ground-truth answer, \"be concise\" prefix, longer context, document, …). The gap exists *because of the conditioning*, not weights.\n\nSeveral entries previously listed here turned out on verification to use static teacher data or a fixed self-rewritten dataset rather than student rollouts; those have been excluded. SPIN was reclassified to [Iterative Self-Bootstrapping](#-iterative-self-bootstrapping).\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [OPSD](https:\u002F\u002Fgithub.com\u002Fsiyan-zhao\u002FOPSD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsiyan-zhao\u002FOPSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.01 | UCLA \u002F Meta FAIR | [arXiv 2601.18734](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.18734) · [blog](https:\u002F\u002Fsiyan-zhao.github.io\u002Fblog\u002F2026\u002Fopsd\u002F) | OPSD — Self-Distilled Reasoner |\n| [Self-Distillation](https:\u002F\u002Fgithub.com\u002Fidanshen\u002FSelf-Distillation) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fidanshen\u002FSelf-Distillation?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.01 | MIT \u002F ETH | [arXiv 2601.19897](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.19897) | SDFT-Continual |\n| [mtp-lm](https:\u002F\u002Fgithub.com\u002Fjwkirchenbauer\u002Fmtp-lm) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjwkirchenbauer\u002Fmtp-lm?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | UMD \u002F LLNL | [arXiv 2602.06019](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06019) | MTP Self-Distill |\n| [LMOps `\u002Fopcd`](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLMOps?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | Microsoft Research | [arXiv 2602.12275](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12275) | OPCD — On-Policy Context Distillation |\n| [GATES](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20574) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20574) | 2026.02 | UMD | [arXiv 2602.20574](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.20574) | GATES (Self-Distillation under Privileged Context) |\n| [CRISP_Reasoning_Compression](https:\u002F\u002Fgithub.com\u002FHJSang\u002FCRISP_Reasoning_Compression) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHJSang\u002FCRISP_Reasoning_Compression?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | LinkedIn | [arXiv 2603.05433](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05433) | OPSDC \u002F CRISP |\n| [LMOps `\u002Foel`](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FLMOps) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FLMOps?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | Microsoft Research | [arXiv 2603.16856](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.16856) | OEL — Online Experiential Learning |\n| [self-distillation-analysis](https:\u002F\u002Fgithub.com\u002Fbeanie00\u002Fself-distillation-analysis) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbeanie00\u002Fself-distillation-analysis?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | MSR \u002F KAIST \u002F SNU | [arXiv 2603.24472](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24472) | **Why Does Self-Distillation (Sometimes) Degrade Reasoning?** — diagnostic study of OPSD failure modes |\n| [ml-ssd](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-ssd) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fapple\u002Fml-ssd?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | Apple MLR | [arXiv 2604.01193](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.01193) | Apple — Embarrassingly Simple Self-Distillation |\n| [Skill-SD](https:\u002F\u002Fskill-sd.github.io\u002F) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10674) | 2026.04 | UCAS \u002F CUHK \u002F USTC \u002F vivo AI Lab | [arXiv 2604.10674](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10674) | **Skill-SD** — skill-conditioned OPSD for multi-turn LLM agents |\n| [SD-Zero](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.12002) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.12002) | 2026.04 | Princeton \u002F Toronto \u002F CMU | [arXiv 2604.12002](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.12002) | **SD-Zero** — Self-Revision turns binary rewards into dense supervision |\n| [π-Play](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14054) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14054) | 2026.04 | CASIA \u002F UCAS \u002F Meituan | [arXiv 2604.14054](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.14054) | **π-Play** — multi-agent self-play turns the question-construction path into privileged context for OPSD on search agents |\n| [OPSDL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.17535) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.17535) | 2026.04 | Baidu | [arXiv 2604.17535](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.17535) | OPSDL (Long-Context Self-Distillation) |\n| [MSD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.02971) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.02971) | 2026.05 | Tongji \u002F Shanghai AI Lab | [arXiv 2605.02971](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.02971) | **MSD** — multilingual safety OPSD; teacher conditioned on English query translation + CoT instruction; DPSW weights safety-critical tokens |\n| [COPSD](https:\u002F\u002Fgithub.com\u002Fcisnlp\u002FCOPSD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcisnlp\u002FCOPSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.05 | LMU Munich \u002F MCML | [arXiv 2605.09548](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.09548) | **COPSD** — crosslingual OPSD; teacher sees English problem translation + reference solution, student rolls out in low-resource language (17 African languages) |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Privileged Context (Teacher) | Loss \u002F Divergence | Granularity | Domain | Notes |\n| :----: | :----: | :----: | :----: | :----: | :---- |\n| OPSD (Self-Distilled Reasoner) | Verified reasoning trace | Per-token RKL with point-wise clipping | Token | Math reasoning | Same-model OPSD; matches GRPO with 1×8 rollouts and 1024 length vs. GRPO's 8×16 \u002F 16k. **The canonical OPSD paper.** Built on TRL's GOLD trainer. |\n| SDFT-Continual (idanshen) | Demo-conditioned same model | RKL on student rollouts vs. demo-conditioned teacher | Token | Continual learning | Self-distillation enables continual learning. |\n| MTP Self-Distill | Multi-token prediction same model | RKL on student rollouts | Token | General | Multi-Token Prediction via Self-Distillation. Author-stated on-policy. |\n| OPCD | In-context-knowledge-augmented same model | RKL on student rollouts | Token | Knowledge internalisation | Internalise context to be faithful even after context is removed. |\n| GATES | Document-conditioned tutor (same model) | RKL gated by tutor consensus | Token (gated) | Document QA | Both tutor and student sample rollouts; on-policy student-rollout updates contribute \"modest additional improvement\" on top of off-policy distillation. Mixed. |\n| CRISP \u002F OPSDC | \"Be concise\" instruction prefix | Per-token RKL on student rollouts | Token | Reasoning compression | Compresses long-CoT without entropy collapse (unlike RL-with-length-penalty). |\n| OEL (Online Experiential Learning) | Same model with interactive game environment | RKL on student rollouts | Token | Game \u002F planning | Self-distillation on interactive trajectories. |\n| Why-Does-SD-Degrade (analysis) | Varies (controlled study over rich-vs-thin context teachers) | RKL on student rollouts (analysis only) | Token | Math reasoning (in-domain + OOD) | **Diagnostic paper**, not a training method. Finds that conditioning the teacher on richer privileged context suppresses *epistemic verbalization* (uncertainty expression) in the student → fast in-domain gains but up to 40% OOD drops on Qwen3-8B \u002F DeepSeek-Distill-Qwen-7B \u002F Olmo3-7B-Instruct. Implication: privileged-context richness is a double-edged knob in OPSD. |\n| Apple SSD | Same model w\u002F temperature\u002Ftruncation sampling | Cross-entropy on its own samples | Sequence | Code generation | \"Embarrassingly simple\" — sample, then SFT on those samples. Degenerate OPSD; \"decoding-config\" privilege. |\n| Skill-SD | Trajectory-derived skill summaries condition the teacher only | GRPO + importance-weighted reverse-KL on student rollouts | Token | Multi-turn agentic tasks (AppWorld, Sokoban) | Extends OPSD to multi-turn agentic interaction with dynamic training-only skills. |\n| SD-Zero | Reviser conditioned on generator's response + binary reward | Per-token KL: distill reviser → generator on student rollouts | Token | Math \u002F code reasoning | Single model plays Generator + Reviser; reviser's reward-conditioned token distribution becomes dense supervision over the generator's response. Outperforms RFT, GRPO, SDFT under matched sample budget on Qwen3-4B-Instruct \u002F Olmo-3-7B-Instruct (≥10% over base). Exhibits token-level self-localization and iterative self-evolution. |\n| π-Play | Teacher conditioned on **Question Construction Path (QCP)** — the reverse-direction artifact emitted by an examiner agent when it generates the task | Per-token reverse KL on student rollouts; teacher is an EMA copy of the student (τ=0.05) | Token | Search \u002F deep-research \u002F multi-hop QA agents (NQ, TriviaQA, HotpotQA, 2WikiMQA, MuSiQue, …) | Self-play loop *examiner ↔ student\u002Fteacher* with no external data. The QCP is privileged because it captures the reverse solution path the examiner used to construct the task; the teacher sees it, the student doesn't. Converts sparse-reward self-play into dense per-token supervision; data-free π-Play surpasses fully supervised search agents and is 2–3× more sample-efficient than conventional self-play. |\n| OPSDL | Short-context same model | Point-wise RKL | Token | Long-context | On-Policy Self-Distillation for Long-Context LMs. |\n| MSD | English (high-resource) query translation + CoT instruction (privileged *crosslingual* context) | Per-token reverse KL with **Dual-Perspective Safety Weighting (DPSW)**: w_t = w_t^T · w_t^S, combining teacher top-K entropy (safety-criticality) × student disagreement risk (1−p_S) | Token (DPSW-weighted) | Multilingual safety alignment (jailbreak + utility benchmarks; e.g., English → Javanese) | Same model = teacher = student; ships **on-policy MSD** (student samples its own multilingual responses, strict C1+C2) and **off-policy MSD** (teacher-sampled) variants. Requires no translated response data — only multilingual queries. |\n| COPSD | English problem translation + reference solution (privileged *crosslingual* context) | Per-token reverse KL with full-vocabulary logit distillation; teacher fixed during training; gradients flow only through student | Token | Multilingual math reasoning (PolyMath, AfriMGSM — 17 low-resource African languages) | Same model serves as both student (rollouts in low-resource language) and teacher (English-conditioned). Transfers a model's own high-resource reasoning behavior to low-resource languages; improves answer-format adherence and test-time scaling. |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cb>Strictness notes\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **Apple SSD** — ⚠️ C2 is degenerate: no teacher KL signal; pure self-generated SFT (sample with temperature\u002Ftruncation, then SFT on those samples). Closer to STaR-style self-bootstrapping than to OPSD. Kept because the \"teacher\" is the same model with a different decoding config — privileged-context-by-decoding.\n- **GATES** — ⚠️ Authors' own ablation says off-policy trajectory-level distillation drives the *primary gains*; on-policy student-rollout updates contribute only \"modest additional improvement\". Mixed; the OPSD leg is genuine but secondary.\n- **SD-Zero** — privileged context is *non-textual*: the reviser is conditioned on the generator's full response **plus its scalar binary reward**. C1 ✓ (generator samples its own rollouts), C2 ✓ (per-token KL from reviser). Compared head-to-head against GRPO in the paper but is not itself an RL method — there is no policy-gradient objective; the reward is a conditioning signal, not a return. Listed in OPSD rather than OPD-RL Hybrids for that reason.\n- **Why-Does-SD-Degrade** — analysis-only; no new training algorithm proposed. Listed here because the failure mode it characterises (epistemic-verbalization collapse under rich privileged context) is specific to OPSD.\n- **π-Play** — teacher and student have *separate parameter sets*; the teacher is an EMA-tracking copy of the student rather than literally the same weights. Listed in OPSD because (i) the paper itself frames the method as \"Privileged Self-Distillation\" and (ii) the gap between teacher and student exists *because of QCP conditioning*, not weight divergence (the EMA target collapses to the student in the limit). C1 ✓ (student samples its own rollouts), C2 ✓ (per-token RKL from QCP-conditioned teacher).\n\n\u003C\u002Fdetails>\n\n### 🔁 Iterative Self-Bootstrapping\n\nSame model is the teacher, but as a *frozen earlier checkpoint*, not a privileged-context view. The teacher snapshot is frozen for one round, the student trains, then the snapshot rolls forward. Listed separately because the supervision is typically sequence-level \u002F preference, not per-token logit-distillation.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [SPIN](https:\u002F\u002Fgithub.com\u002Fuclaml\u002FSPIN) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fuclaml\u002FSPIN?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2024.01 | UCLA | [arXiv 2401.01335](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.01335) | SPIN — Self-Play Fine-Tuning (ICML 2024) |\n| [rStar](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FrStar) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmicrosoft\u002FrStar?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.01 | Microsoft Research | [rStar-Math 2501.04519](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519) · [rStar2-Agent 2508.20722](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.20722) | rStar \u002F rStar-Math \u002F rStar2-Agent |\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cb>Strictness notes\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **SPIN** — ⚠️ C1 ✓ (student samples), but C2 fails strict per-token logit form: supervision is *sequence-level DPO preference* against the previous frozen checkpoint. More accurately \"iterative on-policy DPO\" than per-token OPD. Kept because the \"teacher = previous self\" pattern is what people search for in OPD lists.\n- **rStar \u002F rStar-Math \u002F rStar2-Agent** — ⚠️ MCTS-filtered student samples + SFT; the \"teacher signal\" is a step-level PPM \u002F discriminator score, not per-token logit KL. Iterative self-improvement, not classical OPD.\n\n\u003C\u002Fdetails>\n\n---\n\n## 🤝 OPD-RL Hybrids — Inside-RL OPD\n\nMethods that fuse OPD with **RLVR \u002F GRPO \u002F PPO \u002F DPO**. Teacher logits become a dense reward shaping or trust-region anchor inside an RL objective; or BoN \u002F preference signals are used as the imitation target.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [BOND](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14622) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14622) | 2024.07 | Google DeepMind | [arXiv 2407.14622](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14622) | BOND (Best-of-N Distillation) |\n| [Faster WIND](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20727) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20727) | 2024.10 | CMU \u002F Google | [arXiv 2410.20727](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20727) | Faster WIND (iterative BoN) — AISTATS 2025 |\n| [AlignDistil](https:\u002F\u002Fgithub.com\u002Fsongmzhang\u002FAlignDistil) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsongmzhang\u002FAlignDistil?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.03 | BJTU \u002F Tencent | [arXiv 2503.02832](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02832) | AlignDistil — RLHF-equivalent KD (ACL 2025) |\n| [LUFFY](https:\u002F\u002Fgithub.com\u002FElliottYan\u002FLUFFY) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FElliottYan\u002FLUFFY?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.04 | Westlake U. | [arXiv 2504.14945](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14945) | LUFFY — mixed-policy GRPO |\n| [KETCHUP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19024) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19024) | 2025.04 | U. Alberta | [arXiv 2504.19024](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19024) | KETCHUP (k-step RL-KD) |\n| [KDRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02208) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02208) | 2025.06 | HIT \u002F Huawei | [arXiv 2506.02208](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02208) | KDRL (Joint KD + RL) |\n| [SDPO](https:\u002F\u002Fgithub.com\u002Flasgroup\u002FSDPO) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Flasgroup\u002FSDPO?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.01 | ETH \u002F MIT | [arXiv 2601.20802](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20802) · [project](https:\u002F\u002Fself-distillation.github.io\u002FSDPO) | SDPO — RL via Self-Distillation |\n| [KEPO](https:\u002F\u002Fgithub.com\u002FCorleno\u002FKEPO) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCorleno\u002FKEPO?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.01 | Industrial | [arXiv 2602.00400](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00400) | KEPO |\n| [Open-AgentRL](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpen-AgentRL) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FOpen-AgentRL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | Gen-Verse | — | Open-AgentRL — RLAnything \u002F DemyAgent multi-domain |\n| [Towards-On-Policy-SFT](https:\u002F\u002Fgithub.com\u002Fzhangmiaosen2000\u002FTowards-On-Policy-SFT) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fzhangmiaosen2000\u002FTowards-On-Policy-SFT?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | MSRA \u002F Shopee | [arXiv 2602.12222](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12222) | DDT — on-policy SFT theory |\n| [𝒳-KD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12674) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12674) | 2026.02 | BUPT | [arXiv 2602.12674](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12674) | 𝒳-KD (IRL-style) |\n| [RLAD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.22495) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.22495) | 2026.02 | AWS | [arXiv 2602.22495](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.22495) | RLAD (Reinforcement-aware KD) |\n| [OpenClaw-RL](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FOpenClaw-RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | Gen-Verse | [arXiv 2603.10165](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10165) | OpenClaw-RL — combines GRPO + OPD |\n| [ExGRPO](https:\u002F\u002Fgithub.com\u002FZhen-Tan-dmml\u002FExGRPO) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZhen-Tan-dmml\u002FExGRPO?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | UNC \u002F ASU | [arXiv 2603.19266](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.19266) | Probing-to-Refine \u002F EI \u002F EXGRPO |\n| [HDPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.23871) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.23871) | 2026.03 | NVIDIA | [arXiv 2603.23871](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.23871) | HDPO (Hybrid Distillation PO) |\n| [RLSD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.03128) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.03128) | 2026.04 | Multi-org | [arXiv 2604.03128](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.03128) | Self-Distilled RLVR (RLSD) |\n| [NPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.20733) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.20733) | 2026.04 | IIE CAS \u002F UCAS \u002F JD.COM | [arXiv 2604.20733](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.20733) | NPO \u002F AutoNPO — mixed-policy GRPO with **near-future self** as teacher |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Inner RL | Teacher Role | Data | Granularity | Domain | Notes |\n| :----: | :----: | :----: | :----: | :----: | :----: | :---- |\n| BOND | Best-of-N distillation | Same model's BoN target | Student (iterative) | Sequence | Alignment | Treats Best-of-N as the target distribution; iterative anchor; Jeffreys divergence. |\n| Faster WIND | Win-rate dominance | Same model BoN | Student (iterative) | Sequence | Alignment | Game-theoretic acceleration of BOND. |\n| AlignDistil | RLHF-equivalent KD | DPO-derived combination of DPO model + ref-model logits | Student | Token | Alignment | Re-frames DPO as policy distillation. |\n| LUFFY | Mixed-Policy GRPO + policy shaping | Off-policy R1 traces inserted into student rollouts | Mixed | Token + sequence | Reasoning | \"Learn to reason under off-policy guidance\". On-policy student-roll + off-policy teacher-trace mix. |\n| KETCHUP | k-step return REINFORCE on KD | External teacher | Student | Sequence | General | RL-based KD with k-step Bellman returns. |\n| KDRL | Joint reverse-KL + GRPO rule-based reward | External teacher (Skywork-OR1) | Student | Token + outcome | Reasoning | Unified KD + RL objective. |\n| SDPO | Custom self-distillation policy gradient | Feedback-conditioned same model = self-teacher | Student | Token | Code, tool-use, science | Sample student rollout, get tokenised feedback, re-evaluate under feedback-conditioned self-teacher, distill the corrected next-token distribution back into policy. |\n| KEPO | Knowledge-enhanced PO | Knowledge-base teacher | Mixed | Sequence | Reasoning | Adds KB grounding to preference RL. |\n| Open-AgentRL | GRPO-TCR | Multi-domain teachers | Student | Token | Reasoning \u002F GUI \u002F Coding | Includes process-reward modelling via SandboxFusion. |\n| DDT | On-policy SFT theory | Theoretical | Student | Token | General | Distribution Discriminant Theory; foundations for on-policy SFT. |\n| 𝒳-KD | AVRIL inverse-RL | Joint reward + policy distillation | Student | Token + sequence | General | IRL-flavoured experiential KD. |\n| RLAD | PPO\u002FGRPO ratio anchored to teacher–old-policy mixture | External teacher (Qwen3-32B) | Student | Token | Reasoning | Trust-region likelihood-ratio. |\n| OpenClaw-RL | GRPO + OPD | Judge model extracts hindsight hints, teacher token-logprob gap = directional advantage | Mixed | Token | Terminal \u002F GUI \u002F SWE \u002F Tool-call | Unifies binary RL and OPD in one trainer. |\n| Probing-to-Refine | \"Explanatory probes\" force logical articulation; GRPO + dialogue-structure reward | Self-probe | Student | Sequence | Reasoning | Reinforcement Distillation via Explanatory Inversion. |\n| HDPO | RL on most prompts; on \"cliff\" prompts generate privileged rollouts and self-distill | Same model w\u002F privilege | Student | Token | Reasoning | Privileged self-distillation as RL fallback. |\n| Self-Distilled RLVR (RLSD) | RLVR direction + teacher evidence-ratio modulates magnitude | Same model + privileged answer | Student | Token + outcome | Reasoning | Combines self-distillation magnitudes with RLVR directions. |\n| NPO \u002F AutoNPO | Mixed-Policy GRPO | Verifier-filtered trajectories from a **later checkpoint of the same training run** | Mixed | Sequence | Reasoning (RLVR) | \"Learn from your near-future self\". Picks a teacher that is *strong enough* (higher Q than current policy) yet *close enough* (low V vs. external teachers like R1), maximising effective Q\u002FV signal. AutoNPO adaptively schedules the interventions; preserves higher entropy than vanilla GRPO. |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cb>Strictness notes\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **LUFFY** — ⚠️ Mixed-policy: half on-policy student rollouts (C1+C2 ✓) + half *off-policy R1 traces* inserted into GRPO (C1 ✗ on the off-policy half). Net is OPD-flavor with off-policy import.\n- **NPO \u002F AutoNPO** — ⚠️ Same mixed-policy GRPO pattern as LUFFY, but the off-policy traces come from a **near-future checkpoint of the same run** instead of an external R1 teacher. Authors frame it as RLVR, not OPD; included here as an OPD variant because (a) the imported trajectories play the same \"stronger-self teacher\" role, and (b) the paper itself explicitly invites follow-up work to inject the near-future-self signal via on-policy distillation. Strict per-token logit KL (C2) is *not* the loss — supervision is verifier-filtered sequence-level trajectory mixing inside GRPO.\n- **BOND, Faster WIND** — ⚠️ Iterative self-bootstrapping; teacher = same model's BoN distribution. Loss is Jeffreys \u002F win-rate-dominance at the **sequence level** — *no per-token logit supervision* (C2 partially fails strict form). More accurately \"on-policy iterative alignment\" than OPD.\n- **KETCHUP** — ⚠️ Sequence-level RL-based KD with k-step Bellman returns; the paper itself self-describes as \"RL-based KD\". Closer to RL with KD-anchor reward than per-token OPD.\n- **𝒳-KD** — ⚠️ Built on AVRIL inverse-RL framework with joint reward modeling; closer to IRL+OPD hybrid than pure OPD.\n- **DDT** — ⚠️ Theoretical foundations paper for \"on-policy SFT\" (Distribution Discriminant Theory); not a specific deployable algorithm. Kept for completeness.\n- **KEPO, Open-AgentRL, Probing-to-Refine** — ⚠️ C1 ✓ (on-policy student rollouts), but the per-token KL component vs. sequence-level reward shaping vs. preference optimization is not fully resolved from abstracts. Listed because the papers self-describe as OPD\u002Fon-policy distillation but exact form of C2 needs full-paper reading.\n\n\u003C\u002Fdetails>\n\n---\n\n## 🧠 Reasoning OPD (by application)\n\nGenuine OPD work on math \u002F code \u002F long-CoT reasoning. Off-policy SFT-distill from R1, pure RL methods (Skywork-OR1, SimpleRL-Zoo, Time-R1), and analysis-only papers are excluded from this list — each had no student-rollout-with-teacher-supervision component.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [G-OPD](https:\u002F\u002Fgithub.com\u002FRUCBM\u002FG-OPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCBM\u002FG-OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.02 | RUC \u002F Tencent | [arXiv 2602.12125](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.12125) | G-OPD (cross-list) |\n| [OPD-AVMP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.07944) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.07944) | 2026.04 | Academic | [arXiv 2604.07944](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.07944) | OPD for Autonomous Vehicle Motion Planning |\n| [OPD](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FOPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FOPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.04 | Tsinghua THUNLP | [arXiv 2604.13016](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.13016) | Rethinking OPD recipe |\n\nThe reasoning-OPD canon already lives across **OPSD** (siyan-zhao\u002FOPSD, CRISP, SD-Zero), **Iterative Self-Bootstrapping** (rStar \u002F rStar-Math), **OPD-RL Hybrids** (LUFFY, RLAD, KDRL, RLSD, HDPO), and **White-Box** (REOPOLD, Fast OPD, Entropy-Aware OPD, TIP, SCOPE, PACED). This section only lists items not already covered above.\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Loss \u002F Objective | Data | Teacher | Granularity | Base \u002F Benchmark | Notes |\n| :----: | :----: | :----: | :----: | :----: | :----: | :---- |\n| OPD for AV Motion Planning | GPT-Driver framework + GKD on student-generated trajectories | Student | White-box (LLM teacher) | Token | Driving | 5× model-size reduction. |\n| Rethinking OPD (THUNLP) | RKL with progressive top-K alignment + off-policy cold-start | Mixed | White-box (Qwen3-4B\u002F1.7B teacher pairs) | Token | Math reasoning | Identifies *teacher-novelty* and *thinking-pattern compatibility* as success conditions. |\n\n\u003C\u002Fdetails>\n\n---\n\n## 🖼️ Multimodal OPD (VLM, Video, Audio, Image)\n\nStrict OPD work in non-text modalities. Many \"R1\"\u002F\"GRPO\" multimodal models that bear the brand are pure RL (no teacher-distillation loss) and are excluded.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [piFlow](https:\u002F\u002Fgithub.com\u002FLakonik\u002FpiFlow) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLakonik\u002FpiFlow?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.10 | Multi-org | [arXiv 2510.14974](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.14974) | π-Flow — image \u002F flow OPD (ICLR 2026) |\n| [VOLD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.23497) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.23497) | 2025.10 | INRIA \u002F Goethe Univ. | [arXiv 2510.23497](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.23497) · [project page](https:\u002F\u002Fwalidbousselham.com\u002FVOLD\u002F) | VOLD (LLM→VLM OPD) — repo placeholder; ICLR 2026 |\n| [Step-Audio-R1](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Audio-R1) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStep-Audio-R1?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.11 | StepFun | [arXiv 2511.15848](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.15848) | Step-Audio-R1 |\n| [CORD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.16547) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.16547) | 2026.01 | Baidu Ernie | [arXiv 2601.16547](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.16547) | Reasoning: Text ➡️ Audio |\n| [Video-OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02994) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02994) | 2026.02 | Industrial | [arXiv 2602.02994](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.02994) | Video-OPD |\n| [X-OPD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24596) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24596) | 2026.03 | Tencent Hunyuan \u002F ZJU | [arXiv 2603.24596](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.24596) | X-OPD (Speech LLM) |\n| [Uni-OPD](https:\u002F\u002Fgithub.com\u002FWenjinHou\u002FUni-OPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWenjinHou\u002FUni-OPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.05 | Multi-org | [arXiv 2605.03677](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.03677) | **Uni-OPD** — unified OPD across LLMs & MLLMs via dual-perspective recipe |\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Modality | Teacher | Loss | Data | Notes |\n| :----: | :----: | :----: | :----: | :----: | :---- |\n| π-Flow | Image generation (flow models) | Teacher velocity field | L2 imitation distillation | Student | Strict OPD for diffusion: student predicts policy at each timestep along its own trajectory. |\n| Step-Audio-R1 | Audio reasoning | Self (modality-grounded) | Iterative self-distillation + SFT + PPO\u002FRLVR | Student | Iterative on-policy cycles; only audio-relevant questions used in self-distill. |\n| VOLD | LLM → VLM | Text-only LLM | GRPO + on-policy KL distillation | Student | Cold-start SFT alignment + unified RL+KD; ICLR 2026. The flagship VLM OPD recipe. |\n| CORD | LLM → Audio | Self with text | Token-level RKL and sequence-level KL + GRPO | Student | Align cross-model reasoning |\n| Video-OPD | MLLM | LLM teacher | Token-level KL on student rollouts | Student | Temporal video grounding via OPD. |\n| X-OPD | Speech LLM | Text LLM | Cross-modal token-level KL | Student | Capability alignment in speech LLMs. |\n| Uni-OPD | LLM & MLLM (5 domains \u002F 16 benchmarks) | Single- or multi-teacher; supports strong-to-weak and cross-modal | Outcome-guided margin calibration + offline\u002Fonline data balancing | Student rollouts | Dual-perspective recipe: addresses (i) insufficient exploration of informative student states via data balancing and (ii) unreliable teacher supervision via margin calibration restoring order-consistency between correct\u002Fincorrect trajectories. |\n\n\u003C\u002Fdetails>\n\n---\n\n## 🤖 Agent & Embodied OPD (by application)\n\nGenuine OPD where the **student is an agent** rolling out actions; teacher (or self) supervises those trajectories. Pure-RL agent works (WebRL, WebAgent-R1, InfiGUI-G1, GUI-R1) and off-policy SFT-on-teacher-trajectories (Nardien, AgentRefine, Chain-of-Agents, MapCoder-Lite, SAD, Structured-Web) are excluded.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [LLM4Teach](https:\u002F\u002Fgithub.com\u002FZJLAB-AMMI\u002FLLM4Teach) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZJLAB-AMMI\u002FLLM4Teach?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2023.11 (updated 2025) | ZJ Lab AMMI | [arXiv 2311.13373](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.13373) | LLM4Teach — small-RL agent guided by LLM |\n| [RPD](https:\u002F\u002Fgithub.com\u002FRefined-Policy-Distillation\u002FRPD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRefined-Policy-Distillation\u002FRPD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.03 | TUM \u002F Freiburg | [arXiv 2503.05833](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05833) · [project](https:\u002F\u002Frefined-policy-distillation.github.io\u002F) | Refined Policy Distillation, VLA (IROS 2026) |\n| [easydistill](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Feasydistill) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmodelscope\u002Feasydistill?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.09 | Alibaba ModelScope | [SCoRe arXiv 2509.14257](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.14257) | `\u002Fprojects\u002FSCoRe` |\n| [OpenClaw-RL](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FOpenClaw-RL) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FOpenClaw-RL?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | Gen-Verse | [arXiv 2603.10165](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10165) | OpenClaw-RL (cross-list with OPD-RL) |\n| [VLA-OPD](https:\u002F\u002Firpn-lab.github.io\u002FVLA-OPD\u002F) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.26666) | 2026.03 | HKUST (Guangzhou) — IRPN Lab | [arXiv 2603.26666](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.26666) · [project](https:\u002F\u002Firpn-lab.github.io\u002FVLA-OPD\u002F) | **VLA-OPD** — bridging offline SFT & online RL for VLA via OPD (code coming soon) |\n| [Skill-SD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10674) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10674) | 2026.04 | Vivo | [arXiv 2604.10674](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.10674) | Skill-SD — skill-conditioned self-distillation for multi-turn LLM agents|\n| [TCOD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.24005) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.24005) | 2026.04 | Tongyi Lab, Alibaba \u002F CUHK | [arXiv 2604.24005](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.24005) | TCOD — temporal curriculum OPD for multi-turn agents; F2B & B2F schedules |\n| [Healthcare AI GYM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.02943) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fminstar\u002FHealthcare_GYM?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.05 | Upstage AI \u002F Korea University | [arXiv 2605.02943](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.02943) | Healthcare AI GYM — medical agent RL environment + turn-level truncated OPD |\n| [HyperEyes](https:\u002F\u002Fgithub.com\u002FDeepExperience\u002FHyperEyes) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDeepExperience\u002FHyperEyes?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.05 | Xiaohongshu \u002F Cambridge | [arXiv 2605.07177](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.07177) | HyperEyes — parallel multimodal search agent with dual-grained efficiency-aware RL (TRACE + OPD) |\n\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Domain | Teacher Role | Loss | Notes |\n| :----: | :----: | :----: | :----: | :---- |\n| LLM4Teach | Small RL agent | LLM teacher (action-level) | Distillation + RL annealed | Strict OPD for embodied; predates the wave. |\n| RPD | VLA \u002F robot manipulation | Teacher VLA actions | PPO + behavioural cloning on student rollouts | Cleanest VLA-OPD recipe. |\n| SCoRe | 12 agent benchmarks | Larger teacher (72B) corrects earliest error in student rollout | SFT-on-corrections + short-horizon RL | 7B student matches 72B teacher. |\n| OpenClaw-RL | Terminal \u002F GUI \u002F SWE \u002F Tool-call | Judge model + token-logprob gap | GRPO + OPD | Hindsight-hint extraction; combines binary RL and per-token OPD. |\n| VLA-OPD | VLA \u002F robot manipulation (LIBERO, RoboTwin2.0) | Expert VLA teacher, dense token-level supervision on student trajectories | Reverse-KL (avoids FKL entropy explosion + Hard-CE collapse) | Replaces sparse RL reward; preserves generalist priors and mitigates catastrophic forgetting.| | Skill-SD | Multi-turn LLM agents| Skill-conditioned teacher: teacher conditions on analytical skills distilled from completed trajectories; student acts under plain task prompt | Importance-weighted RKL (Schulman K3 + importance weighting for unbiased gradients) + GRPO |Skills guide teacher (not student); dynamic teacher synchronization; sampled-token (not full-vocab) distillation|\n| TCOD| Multi-turn autonomous agents| Full-trajectory teacher; curriculum controls exposed depth: F2B (shallow→deep) or B2F (teacher demos front, student learns back) | Trajectory-level KL with temporal curriculum scheduling (linear pacing) | Solves trajectory-level KL instability in multi-turn OPD |\n| Healthcare AI GYM |  Clinical agent  |  EMA teacher with outcome-privileged info provides dense turn-level KL regularization  |  GRPO + TT-OPD (turn-level truncated OPD)  | Also provide a good gym for clinical agent training |\n| HyperEyes | Parallel multimodal search agent | External teacher|  TRACE (trajectory-level adaptive cost efficiency) + OPD (token-level) + GRPO  | Macro (trajectory) + micro (token) dual-grained|\n\n\u003C\u002Fdetails>\n  \n---\n\n## ⚡ Speculative-Decoding Distillation\n\nDistillation **of the draft model** so it better mimics the verifier\u002Ftarget. The on-policy element here is over the *drafter*'s own continuations as judged by the *target*. Listed separately because the goal is *inference speedup*, not student capability.\n\nThis section only lists drafters trained with the drafter's own rollouts. Off-policy drafter training (EAGLE-1\u002F2, Medusa, Hydra, Kangaroo, ReDrafter, BiTA, SpecDec++, LayerSkip, FREE, AdaSPEC, POSS) and training-free system tricks (Ouroboros, Sequoia, TriForce, SwiftKV, SuffixDecoding) are excluded.\n\n| Resource | 🌟 Stars | Date | Org | Paper Link | Title \u002F Notes |\n| :----: | :----: | :----: |  :----: | :----: | :---- |\n| [OSD](https:\u002F\u002Fgithub.com\u002FLiuXiaoxuanPKU\u002FOSD) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLiuXiaoxuanPKU\u002FOSD?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2023.10 | UCB \u002F NVIDIA | [arXiv 2310.07177](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07177) | Online Speculative Decoding |\n| [DistillSpec](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08461) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08461) | 2023.10 | Google DeepMind | [arXiv 2310.08461](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.08461) | DistillSpec (ICLR 2024) |\n| [HASS](https:\u002F\u002Fgithub.com\u002FHArmonizedSS\u002FHASS) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHArmonizedSS\u002FHASS?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2024.08 | Academic | [arXiv 2408.15766](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15766) | HASS |\n| [Falcon](https:\u002F\u002Fgithub.com\u002FBestpay-inc\u002FFalcon) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBestpay-inc\u002FFalcon?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2024.12 | Bestpay | [arXiv 2412.12639](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12639) | Falcon |\n| [CORAL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16880) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16880) | 2025.02 | Academic | [arXiv 2502.16880](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16880) | CORAL (Cross-Step Representation Alignment) — ACL 2025 |\n| [EAGLE](https:\u002F\u002Fgithub.com\u002FSafeAILab\u002FEAGLE) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSafeAILab\u002FEAGLE?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2025.03 | PKU \u002F Microsoft | [EAGLE-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01840) | EAGLE-3 — on-policy multi-step TTT |\n| [MASSV](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10526) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10526) | 2025.05 | Cerebras | [arXiv 2505.10526](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10526) | MASSV (multimodal SD draft) |\n| [DVI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05421) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05421) | 2025.10 | Academic | [arXiv 2510.05421](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.05421) | DVI (Draft-Verify-Improve, online RL) |\n| [SpecKD](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24021) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24021) | 2025.10 | XJTU (Haiduo Huang et al.) | [arXiv 2510.24021](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.24021) | SpecKD \u002F SelecTKD (verification-gated KD; v1=SpecKD, v2 retitled SelecTKD) |\n| [ReSpec](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26475) | [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📄-paper-845C40?style=for-the-badge)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26475) | 2025.10 | Academic | [arXiv 2510.26475](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26475) | ReSpec (RL drafter evolution) |\n| [SpecForge](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002FSpecForge) | \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fsgl-project\u002FSpecForge?style=for-the-badge&logo=github&logoColor=white&labelColor=181717&color=ffd700\" alt=\"Stars\"> | 2026.03 | SGLang | [LMSYS blog](https:\u002F\u002Fwww.lmsys.org\u002Fblog\u002F2025-07-25-spec-forge\u002F) | SpecForge — open EAGLE-3 training framework |\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cb>Strictness notes\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **HASS, Falcon** — ⚠️ Partial on-policy: multi-step draft trajectory \u002F glancing distillation uses drafter samples for a subset of the training signal. Listed because the on-policy leg drives the gains.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view technical details\u003C\u002Fsummary>\n\n| Method | Drafter type | On-\u002FOff-policy | Loss | Notes |\n| :----: | :----: | :----: | :----: | :---- |\n| Online Speculative Decoding (OSD) | Draft-model | **On-policy \u002F online** | Online KD on rejected tokens | The canonical online\u002Fon-policy SD paper. |\n| DistillSpec | Draft-model | **On-policy** (draft samples) | Choice of FKL\u002FRKL\u002FJSD\u002FTVD | The seminal \"OPD for SD\" paper. |\n| HASS | Self-speculative | **Partial on-policy** (multi-step draft trajectory in training) | Multi-step KD CE + feature alignment | Harmonized objective + harmonized context alignment. |\n| Falcon | Draft-model (semi-AR) | **Partial on-policy** (glancing uses draft samples) | Glancing CE + KD | Coupled Sequential Glancing Distillation. |\n| CORAL | Self-speculative | **On-policy multi-step** | Cross-step alignment + CE | Fixes draft training\u002Finference mismatch. |\n| EAGLE-3 | Self-speculative (uses target features) | **On-policy multi-step (TTT)** | Smooth-L1 (feature) + CE (token) | \"Training-Time Test\" simulates draft rollouts during training. |\n| MASSV | Multimodal draft-model | **On-policy** (drafter samples) | KD CE | Multimodal speculative-decoding drafter. |\n| DVI | Self-speculative | **On-policy online (RL on verifier signal)** | KL → reward-masked CE + PG | Continual online training. |\n| SpecKD | Distillation framework | **On-policy with verification gating** | Gated KL (accepted tokens only) | Inverts SD: uses accept\u002Freject as KD-loss gate. |\n| ReSpec | Draft-model | **On-policy online (RL rollouts)** | KD weighted by rollout reward | Drafter evolved during RL training. |\n| SpecForge | Self-speculative (EAGLE-3 framework) | **On-policy TTT supported** | EAGLE-3 losses | Open-source EAGLE-3 training framework. |\n\n\u003C\u002Fdetails>\n\n---\n\n## 🛠️ Frameworks & Toolkits\n\nOpen-source frameworks ","AwesomeOPD 是一个汇总了使用在线策略蒸馏（On-Policy Distillation, OPD）和在线策略自蒸馏（On-Policy Self-Distillation, OPSD）技术训练大型语言模型（LLMs）、视觉语言模型（VLMs）、代理及草稿模型的开源库和论文的优秀列表。该项目详细介绍了OPD的核心概念，即学生模型在训练过程中生成自己的轨迹，并由教师模型对这些样本提供逐标记或序列级别的监督；而OPSD则是在特定条件下，教师与学生为同一模型时的应用场景。每个条目都根据四个设计维度进行注释：教师来源、监督信号类型、回放数据处理方式以及管道位置。这使得研究者能够快速定位到符合需求的方法。适用于需要提升现有模型性能、优化资源利用效率或探索新架构的研究场景。",2,"2026-06-11 02:47:32","CREATED_QUERY"]