[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80115":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":16,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":12,"starSnapshotCount":12,"syncStatus":13,"lastSyncTime":26,"discoverSource":27},80115,"UniSD","Ahren09\u002FUniSD","Ahren09","Official implementation for \"Towards a Unified Self-Distillation Framework for Large Language Models\" (https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06597).",null,"Python",110,0,2,1,52,3,47.7,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:26","\u003Cdiv align=\"center\">\n\n# 🧬 UniSD\n\n### *A Unified Self-Distillation Framework for Large Language Models*\n\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐_Project-Website-2EA44F.svg)](https:\u002F\u002Funifiedsd.github.io\u002F)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.06597-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06597)\n[![HF Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Hugging_Face-Paper-FFD21E.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.06597)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](LICENSE)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.12-3776AB.svg?logo=python&logoColor=white)](https:\u002F\u002Fwww.python.org\u002F)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.9-EE4C2C.svg?logo=pytorch&logoColor=white)](https:\u002F\u002Fpytorch.org\u002F)\n[![Transformers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗_Transformers-4.57-FFD21E.svg)](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers)\n[![vLLM](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FvLLM-0.12-30A14E.svg)](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n\n[Yiqiao Jin](https:\u002F\u002Fahren09.github.io)¹\\*, [Yiyang Wang](https:\u002F\u002Fhello-diana.github.io\u002F)¹\\*, [Lucheng Fu](https:\u002F\u002Fluchengfu6.github.io\u002F)¹, [Yijia Xiao](https:\u002F\u002Fyijia-xiao.com\u002F)², [Yinyi Luo](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fyinyi-luo-5b0805324\u002F)³,\n[Haoxin Liu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=8xaTRNsAAAAJ)¹, [B. Aditya Prakash](https:\u002F\u002Ffaculty.cc.gatech.edu\u002F~badityap\u002F)¹, [Josiah Hester](https:\u002F\u002Fjosiahhester.com\u002F)¹, [Jindong Wang](https:\u002F\u002Fjd92.wang\u002F)⁴†, [Srijan Kumar](https:\u002F\u002Ffaculty.cc.gatech.edu\u002F~srijan\u002F)¹†\n\n¹ Georgia Institute of Technology · ² UCLA · ³ Carnegie Mellon University · ⁴ William & Mary\n\n\u003Csub>\\* Equal contribution &nbsp;·&nbsp; † Corresponding authors\u003C\u002Fsub>\n\n\u003C\u002Fdiv>\n\n---\n\n## 📖 Abstract\n\nSelf-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose **UniSD**, a **Uni**fied framework to systematically study **S**elf-**D**istillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct **UniSD\\***, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 and the strongest baseline by +2.8. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.\n\n## ✨ Highlights\n\n- 🧩 **Unified framework** spanning the three axes of self-distillation: supervision reliability, representation alignment, and training stability.\n- 🔬 **Five complementary mechanisms** studied in isolation *and* in combination across **6 benchmarks × 6 models × 3 model families**.\n- 🏆 **UniSD\\*** — the integrated recipe — achieves the strongest overall performance using **only self-derived supervision**, no stronger external teacher required.\n\n## 🧩 The UniSD Framework\n\nUniSD is built from five complementary mechanisms that can be enabled independently or composed into the integrated **UniSD\\*** recipe.\n\n| Component | `--mode` | Key flag(s) |\n| :--- | :--- | :--- |\n| 🤝 **Multi-Teacher Agreement** *(sequence-level)* | `agreement_seq_{random,retrieval,induction}` | `--num-auxiliary-contexts`, `--gamma_agreement` |\n| 🎯 **Multi-Teacher Agreement** *(token-level)* | `agreement_tok_{random,retrieval,induction}` | `--num-auxiliary-contexts`, `--gamma_agreement`, `--agreement_stat` |\n| 🌊 **EMA Teacher Stabilization** | `ema` | `--ref_model_sync_steps`, `--ref_model_mixup_beta` |\n| ⚖️ **Token-Level Contrastive Learning** | `contrastive` | `--contrastive_weight`, `--contrastive_margin` |\n| 🧠 **Feature Matching** | `match_joint` \u002F `match_repr` | `--final_layer_distill_weight` |\n| ✂️ **Divergence Clipping** *(JSD-Clip)* | `clip` | `--alpha`, `--token_clip` |\n| ⭐ **UniSD\\*** *(integrated)* | `unisd_star` | combines EMA + matching + contrastive + agreement |\n\n## 🚀 Installation\n\nUniSD targets **Python 3.12 + CUDA 12.8** (cu128 wheels). The install has a few prerequisite steps before the final `pip install -r requirements.txt`, because (a) PyTorch's cu128 build lives on the PyTorch wheel index and (b) flash-attention-2 must be compiled against the installed torch.\n\n```bash\n# 1) Create and activate the env\nconda create -n unisd python=3.12 -y\nconda activate unisd\npip install -U pip setuptools wheel packaging ninja\n\n# 2) Install cu128 PyTorch from the PyTorch wheel index (must precede flash-attn build)\npip install --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 \\\n    torch==2.11.0 torchvision==0.26.0 torchaudio==2.11.0\n\n# 3) Point flash-attn's CUDA build at a 12.x toolkit\n#    (on many hosts \u002Fusr\u002Flocal\u002Fcuda → 13.x, which mismatches torch's cu128 ABI)\nexport CUDA_HOME=\u002Fusr\u002Flocal\u002Fcuda-12.6\n\n# 4) Install everything else — flash-attn builds from source here (~20 min the first time)\npip install -r requirements.txt --no-build-isolation\n```\n\n> 💡 **Don't have `\u002Fusr\u002Flocal\u002Fcuda-12.6`?** Any CUDA 12.x toolkit (12.4–12.8) works. Run `ls -d \u002Fusr\u002Flocal\u002Fcuda-12*` to see what's available and set `CUDA_HOME` to that path.\n\n> ⚠️ **trl ↔ vLLM compatibility**: this environment ships `trl==1.4.0` (officially supports vLLM 0.12.0–0.18.0) with `vllm==0.20.2`. The combination works in our smoke tests but trl will print a warning at import time. If you hit a runtime error from `VLLMClient`, pin `vllm\u003C0.19`.\n\n### Verify the install\n\n```bash\npython -c \"\nimport torch, vllm, flash_attn, flashinfer\nprint('torch       ', torch.__version__, 'cuda_ok:', torch.cuda.is_available())\nprint('vllm        ', vllm.__version__)\nprint('flash_attn  ', flash_attn.__version__)\nprint('flashinfer  ', flashinfer.__version__)\n\"\n```\n\nOptional environment variables: `WANDB_API_KEY` (logging), `HF_TOKEN` (gated models).\n\n## ⚡ Quick Start\n\nUniSD provides **two ways** to launch training: a high-level orchestrator with sane defaults, and a direct command for full per-flag control.\n\n### Option 1 — Preset orchestrator *(preferred)*\n\n`scripts\u002Frun_experiments.py` handles GPU scheduling, dependency-aware sweeps, and sensible defaults.\n\n```bash\n# Template\npython scripts\u002Frun_experiments.py \u003CSUBCOMMAND> [--gpus \u003CGPU_IDS>] [subcommand-flags...]\n\n# Example: token-level contrastive learning\npython scripts\u002Frun_experiments.py contrastive --weight 0.1 --margin 0.5\n```\n\n> 💡 Run `python scripts\u002Frun_experiments.py --dry-run` to preview every job before launch.\n\n### Option 2 — Direct command\n\n`python -m src.train.train_unisd` exposes every UniSD flag for fine-grained control.\n\n```bash\n# Template\npython -m src.train.train_unisd \\\n    --mode \u003CMODE> --dataset \u003CDATASET> \\\n    --model_name \u003CMODEL> \\\n    --per_device_train_batch_size \u003CBATCH> \\\n    --num-auxiliary-contexts \u003CN> \\\n    --use_vllm\n\n# Example: token-level contrastive on MBPP with Qwen2.5-7B\npython -m src.train.train_unisd \\\n    --mode contrastive --dataset mbpp \\\n    --model_name Qwen\u002FQwen2.5-7B-Instruct \\\n    --per_device_train_batch_size 4 \\\n    --contrastive_weight 0.1 --use_vllm\n```\n\n### Valid placeholder values\n\n| Placeholder | Values |\n| :--- | :--- |\n| `\u003CSUBCOMMAND>` | `agreement`, `ema`, `contrastive`, `match_joint`, `match_repr`, `clip`, `unisd_star` *(= UniSD\\*)*, `induction` |\n| `\u003CMODE>` | `agreement_{seq,tok}_{random,retrieval,induction}`, `ema`, `contrastive`, `match_joint`, `match_repr`, `clip`, `unisd_star` |\n| `\u003CDATASET>` | `mbpp`, `tooluse`, `scienceqa`, `cos_e`, `medmcqa` *(eval-only: `gpqa`, `humaneval`)* |\n| `\u003CMODEL>` | Qwen2.5 (0.5B\u002F1.5B\u002F3B\u002F7B-Instruct), Llama-3.1-8B-Instruct, Gemma-3-4B-IT, InternLM3-8B-Instruct |\n\n### One-time cache prep\n\nA few modes require a one-time cache build:\n\n- 🔻 **`contrastive` and `unisd_star`** need a negative-demonstration cache:\n  ```bash\n  python -m src.teacher.negative_demonstrations \\\n      --model_name Qwen\u002FQwen2.5-7B-Instruct --dataset mbpp\n  ```\n- 🪄 **`agreement_*_induction`** modes need an induction cache:\n  ```bash\n  python scripts\u002Frun_experiments.py induction --num-demos 5\n  ```\n- ✅ **`random` and `retrieval`** agreement modes need no prep — embeddings auto-build on first run.\n\n## 📊 Datasets\n\nUniSD is evaluated across **six benchmarks** spanning four task families.\n\n| Dataset | Role | Task |\n| :--- | :--- | :--- |\n| 🔬 **ScienceQA** | train + eval | Scientific reasoning |\n| 💻 **MBPP** | train + eval | Code generation |\n| 💭 **CoS-E** | train + eval | Commonsense reasoning |\n| 🛠️ **ToolAlpaca** | train + eval | Tool usage |\n| 🎓 **GPQA** | OOD eval | Scientific reasoning |\n| 🧪 **HumanEval** | OOD eval | Code generation |\n\n## 🤖 Supported Models\n\nUniSD is validated across three model families:\n\n- **Qwen2.5** — 0.5B \u002F 1.5B \u002F 3B \u002F 7B-Instruct *(default: `Qwen\u002FQwen2.5-7B-Instruct`)*\n- **Llama-3.1** — 8B-Instruct\n- **Gemma-3** — 4B-IT\n- **InternLM3** — 8B-Instruct\n\n## 🧪 Evaluation\n\nEvaluation entry points live under `src\u002Feval\u002F`:\n\n```bash\n# Code generation (MBPP \u002F HumanEval)\npython -m src.eval.eval_code   --mode \u003CMODE> --dataset humaneval \\\n    --model_name_or_path \u003CCKPT_OR_HF_ID>\n\n# Multiple-choice QA (ScienceQA \u002F GPQA \u002F CoS-E \u002F MedMCQA)\npython -m src.eval.eval_mcqa   --mode \u003CMODE> --dataset gpqa \\\n    --model_name_or_path \u003CCKPT_OR_HF_ID>\n\n# Tool usage (ToolAlpaca)\npython -m src.eval.eval_tooluse --mode \u003CMODE> --dataset tooluse \\\n    --model_name_or_path \u003CCKPT_OR_HF_ID>\n```\n\n## 📝 Citation\n\nIf you find UniSD useful in your research, please cite:\n\n```bibtex\n@article{jin2026unisd,\n  title={UniSD: Towards a Unified Self-Distillation Framework for Large Language Models},\n  author={Jin, Yiqiao and Wang, Yiyang and Fu, Lucheng and Xiao, Yijia and Luo, Yinyi and Liu, Haoxin and Prakash, B Aditya and Hester, Josiah and Wang, Jindong and Kumar, Srijan},\n  journal={arXiv preprint arXiv:2605.06597},\n  year={2026}\n}\n```\n\n## 🙏 Acknowledgements\n\nUniSD is built on top of excellent open-source work from the community:\n[🤗 Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) ·\n[🤗 TRL](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl) ·\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) ·\n[DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed) ·\n[PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) ·\n[Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate).\n\n## ⚖️ License\n\nThis project is released under the [Apache License 2.0](LICENSE).\n","UniSD是一个为大型语言模型设计的统一自蒸馏框架。该项目通过集成多种机制来解决监督可靠性、表示对齐和训练稳定性问题，包括多教师一致、EMA教师稳定、词元级对比学习、特征匹配和发散裁剪等技术特点。它适用于需要在没有更强外部教师的情况下适应或改进大型语言模型性能的场景，如自然语言处理任务中的模型压缩与加速。基于Python开发，并利用了PyTorch及Transformers库的支持，使得研究人员能够轻松复现论文中的实验结果并探索不同组件之间的相互作用。","2026-06-11 03:59:18","CREATED_QUERY"]