[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80888":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":11,"stars7d":14,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":15,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":24,"discoverSource":25},80888,"Lite-OPD","yedaotian9\u002FLite-OPD","yedaotian9",null,"Python",36,1,34,0,2,3,41.1,"MIT License",false,"main",[],"2026-06-12 04:01:30","# Lite-OPD: On-Policy Distillation\n\n[English](README.md) | [Chinese](README_zh.md)\n\nLite-OPD is an on-policy distillation training framework designed for research. The student model generates rollouts in real time, the teacher model scores them, and the student learns the teacher's distribution via KL divergence loss.\n\nSupported models: Qwen2.5 \u002F Qwen3 \u002F Llama 3.x \u002F Gemma 3\n\nSupported losses: forward KL \u002F reverse KL \u002F JSD (full vocabulary, not approximated)\n\n## Design Philosophy\n\nLite-OPD targets research workflows that require deep customization of the training loop. The framework prioritizes **modifiability** and **maintainability** over feature completeness.\n\nCore design choices:\n\n- **Single-process synchronous architecture**: Training and inference run in the same process with zero-copy weight sharing, eliminating communication overhead. No multi-worker coordination, no cross-node scheduling, no async state synchronization.\n- **Minimal abstraction**: No callback system, no plugin mechanism, no multi-layer configuration abstraction. The core training logic lives in a single file. To modify behavior, you edit the code directly — no need to understand framework extension points.\n- **Low hardware barrier**: A single GPU can run the full on-policy distillation loop (rollout → teacher scoring → student backward), suitable for resource-constrained lab environments.\n\n## Code Scale\n\nThe entire framework is ~9000 lines of Python and ~2100 lines of C\u002FCUDA kernels.\n\n```\nsrc\u002Fliteopd\u002F\n├── data\u002F       Data loading\n├── eval\u002F       Evaluation scoring\n├── losses\u002F     Distillation losses (forward KL \u002F reverse KL \u002F JSD)\n├── train\u002F      Training loop, ZeRO-2, sequence packing\n├── runtime\u002F    Inference engine lifecycle management\n└── inference\u002F  Embedded inference engine\n```\n\n## Training Flow\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                        Training Loop                             │\n│                                                                 │\n│  1. Rollout (embedded inference engine)                         │\n│     - Student generates N responses                             │\n│     - Inference engine shares weights with training (zero-copy) │\n│                                                                 │\n│  2. Teacher Forward                                             │\n│     - Teacher computes logits over student responses            │\n│     - Sequence packing + torch.compile acceleration             │\n│                                                                 │\n│  3. Loss + Backward (two-stage chunk)                           │\n│     - Compute KL loss gradient w.r.t. hidden per chunk          │\n│     - Backpropagate gradients through student backbone at once  │\n│     - Peak memory = O(chunk_size), not O(total_response_tokens) │\n│                                                                 │\n│  4. Optimizer Step                                              │\n│     - Weight updates are instantly visible to inference engine  │\n│       (shared memory)                                           │\n│                                                                 │\n│  5. Repeat                                                      │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n## Performance Comparison\n\n2× H20 96GB, `global_batch_size=128`, `reverse_kl`, 500 steps. Both frameworks use essentially the same training configuration — see [benchmarks\u002Fms-swift\u002F](benchmarks\u002Fms-swift\u002F) for details.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd>\u003Cimg src=\"asset\u002Fbenchmark_qwen3_1b7_8b.png\" width=\"400\"\u002F>\u003C\u002Ftd>\n\u003Ctd>\u003Cimg src=\"asset\u002Fbenchmark_qwen25_1b5_7b.png\" width=\"400\"\u002F>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">Qwen3-1.7B → Qwen3-8B\u003Cbr\u002F>Lite-OPD ~374s\u002Fstep vs ms-swift ~474s\u002Fstep\u003C\u002Ftd>\n\u003Ctd align=\"center\">Qwen2.5-1.5B → Qwen2.5-7B\u003Cbr\u002F>Lite-OPD ~93s\u002Fstep vs ms-swift ~96s\u002Fstep\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Acceleration Techniques\n\nSee [docs\u002Facceleration_techniques.md](docs\u002Facceleration_techniques.md) for details.\n\n**Lite-OPD-specific techniques:**\n\n| Technique | Benefit |\n|-----------|---------|\n| Zero-copy weight sharing | Eliminates weight sync overhead + saves one copy of model memory |\n| Two-stage chunk backward | Full-vocab KL with ~8x reduction in logits peak memory |\n| ZeRO-2 gradient buffer dynamic release | ~3GB\u002FGPU freed for KV cache during rollout |\n| Shortest-first scheduling | ~10% reduction in batch makespan |\n| VMM KV cache release | Frees all KV cache memory during training |\n| Teacher compile + bucket packing | Bounds compilation count, prevents memory growth |\n\n**Standard techniques:** Prefix cache (radix tree), Chunked prefill, Paged attention, CUDA graph, FlashInfer \u002F sgl_kernel, ZeRO-2 data parallelism, Sequence packing\n\n## Quick Start\n\n### Requirements\n\n- Python 3.10+\n- PyTorch 2.4+\n- CUDA 12.1+\n\n### Installation\n\n```bash\npip install -e .\n```\n\n### Training\n\n```bash\n# Single GPU\nCUDA_VISIBLE_DEVICES=0 NPROC_PER_NODE=1 bash scripts\u002Ftrain.sh configs\u002Fqwen25_1b5_7b.yaml\n\n# Multi-GPU (ZeRO-2)\nCUDA_VISIBLE_DEVICES=0,1 NPROC_PER_NODE=2 bash scripts\u002Ftrain.sh configs\u002Fqwen25_1b5_7b.yaml\n```\n\n### Configuration\n\nSee [configs\u002FREADME.md](configs\u002FREADME.md) for all configuration parameters.\n\nExample configs using bundled sample datasets are provided for quick validation:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 NPROC_PER_NODE=1 bash scripts\u002Ftrain.sh configs\u002Fexample_qwen25_1b5_7b.yaml\n```\n\n## Documentation\n\n- [Configuration Reference](configs\u002FREADME.md)\n- [Acceleration Techniques](docs\u002Facceleration_techniques.md)\n- [Code Architecture](docs\u002Farchitecture.md)\n\n## Acknowledgements\n\nLite-OPD's inference engine is built upon ideas and code from the following open-source projects:\n\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) — High-performance LLM serving framework.\n- [mini-sglang](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fmini-sglang) — A minimal reimplementation of SGLang's core inference loop.\n- [FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer) — High-performance kernels.\n","Lite-OPD 是一个专为研究设计的在线策略蒸馏训练框架。其核心功能包括通过KL散度损失让生成实时rollout的学生模型学习教师模型的分布，支持多种模型（如Qwen2.5、Qwen3等）和不同类型的损失函数（前向KL、反向KL、JSD）。该项目采用单进程同步架构，消除了多进程通信开销，并以最少的抽象层次确保了代码的可修改性和维护性，适合需要深度定制训练循环的研究工作流。此外，它对硬件要求较低，仅需单个GPU即可运行整个在线策略蒸馏流程，非常适合资源受限的研究环境。","2026-06-11 04:02:41","CREATED_QUERY"]