[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80769":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},80769,"ROPD_official","Peregrine123\u002FROPD_official","Peregrine123",null,"Python",52,4,40,0,2,10,12,6,49.3,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:30","﻿# ROPD: Rubric-based On-policy Distillation\n\nThis repository contains the official implementation of **ROPD**, a rubric-based on-policy distillation framework for scalable black-box distillation of large language models.\n\n> **Paper:** Rubric-based On-policy Distillation  \n> **Method:** Rubric-based On-policy Distillation (ROPD)  \n> **Codebase:** ROPD trainer built on a customized [verl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) backend\n\n## Overview\n\nOn-policy distillation (OPD) is an effective paradigm for transferring capabilities from a teacher model to a student policy. Conventional OPD methods, however, typically depend on access to the teacher's token-level logits. This white-box requirement limits their applicability to proprietary frontier models and complicates distillation across heterogeneous architectures.\n\nROPD studies a complementary direction:\n\n> **Can we preserve the on-policy nature of OPD without accessing teacher logits?**\n\nWe answer this question with **rubric-based on-policy distillation**. Instead of supervising student trajectories with token-level probability distributions, ROPD converts teacher-student behavioral differences into prompt-specific semantic rubrics. These rubrics are then used to score student rollouts and provide rewards for on-policy optimization.\n\n## Key Idea\n\nROPD replaces token-level imitation with structured semantic supervision.\n\nFor each input prompt:\n\n1. The student policy generates multiple on-policy rollouts.\n2. The teacher provides one or more reference responses.\n3. A **Rubricator** induces prompt-specific rubrics by contrasting teacher and student responses.\n4. A **Verifier** evaluates each rollout against the induced rubric.\n5. The weighted rubric score is used as the reward for GRPO-style policy optimization.\n\nThis design allows ROPD to operate in black-box teacher settings, requiring only textual teacher responses rather than logits, hidden states, or tokenizer alignment.\n\n## Method\n\n### Problem Setting\n\nLet $x$ denote an input prompt, $\\pi_T$ a teacher model, and $\\pi_\\theta$ a trainable student policy. Traditional OPD supervises student-generated trajectories using the teacher's next-token distribution. In contrast, ROPD assumes that only teacher-generated text is available.\n\nThe goal is to construct an effective reward signal from black-box teacher interactions while retaining the on-policy training dynamics of OPD.\n\n### Rubric Induction\n\nGiven a prompt $x$, ROPD collects:\n\n- **Teacher responses** $\\mathcal{Y}^T_x$\n- **Student rollouts** $\\mathcal{Y}^S_x$\n\nA Rubricator then produces a prompt-specific rubric set:\n\n$$\n\\mathcal{C}_x = \\{c_k\\}_{k=1}^{K}\n$$\n\nwhere each rubric item contains a textual criterion and an importance weight. These rubrics are shared across all student rollouts for the same prompt, providing a consistent group-level reward signal.\n\n### Rubric-based Verification\n\nFor each student rollout, the Verifier determines whether the response satisfies each rubric criterion. The final rollout score is computed as a weighted pass rate:\n\n$$\ns_i =\n\\frac{\n\\sum_{k=1}^{K} w_k v_{i,k}\n}{\n\\sum_{k=1}^{K} w_k + \\epsilon\n}\n$$\n\nwhere $v_{i,k} \\in \\{0,1\\}$ indicates whether rollout $i$ satisfies criterion $k$. This score is used as the reward for on-policy optimization.\n\n## Why Rubrics?\n\nROPD reframes distillation from token-level imitation to semantic principle transfer.\n\nCompared with logit-based OPD, rubric-based OPD offers several practical and conceptual advantages:\n\n- **Black-box compatibility**  \n  ROPD only requires teacher text outputs, making it applicable to proprietary API-based teachers.\n\n- **Cross-architecture flexibility**  \n  ROPD does not require aligned tokenizers, shared vocabularies, or comparable output distributions.\n\n- **Improved sample efficiency**  \n  By filtering out surface-form token-level noise, rubrics emphasize task-level reasoning principles and can substantially improve data utilization.\n\n- **Interpretable supervision**  \n  The reward signal is decomposed into explicit criteria, making the training signal easier to inspect, debug, and analyze.\n\n## Main Results\n\nROPD is evaluated across mathematical reasoning, scientific reasoning, medical reasoning, and instruction-following benchmarks.\n\nThe evaluation suite includes:\n\n- **Mathematics:** AIME 2024, AIME 2025, HMMT 2025\n- **Science:** GPQA-Diamond\n- **Medical reasoning:** HealthBench\n- **Instruction following:** IFEval\n\nThe experiments study both black-box and white-box teacher settings, including teacher-student pairs based on Qwen, Gemma, and GPT-series models.\n\n### Black-box Distillation\n\nIn black-box scenarios, ROPD consistently outperforms representative black-box distillation baselines, including static teacher-output SFT, Teacher-as-Judge rewards, OVD, and GAD.\n\n### White-box Comparison\n\nEven when teacher logits are available, ROPD remains competitive with, and often surpasses, advanced logit-based OPD methods while still using only textual teacher responses.\n\n### Sample Efficiency\n\nROPD achieves up to a **10x improvement in sample efficiency**, suggesting that high-level semantic rubrics can provide more useful supervision than dense token-level logits for complex reasoning tasks.\n\n## Repository Structure\n\n```text\n.\n├── algo\u002F\n│   ├── ropd\u002F                    # Core ROPD implementation\n│   │   ├── client.py            # Rubricator and verifier clients\n│   │   ├── prompts.py           # Prompt rendering utilities\n│   │   └── reward_manager.py    # ROPD reward manager\n│   ├── ropd_pipeline.py         # Rollout and group bookkeeping\n│   ├── ropd_scheduler.py        # Bounded judge request scheduler\n│   ├── ropd_teacher_index.py    # Offline teacher response index\n│   └── openai_env.py            # OpenAI-compatible environment resolution\n├── prompts\u002F\n│   ├── rubricator.txt           # Rubric induction template\n│   └── verifier.txt             # Rubric verification template\n├── verl\u002F                        # Vendored\u002Fcustomized verl trainer\n├── training\u002F\n│   ├── train.sh                 # Training entry point\n│   └── launch_judge_vllm.sh     # Optional local vLLM judge server\n├── pyproject.toml               # Dependency and package configuration\n└── README.md\n```\n\n## Installation\n\nROPD uses `uv` and `pyproject.toml` for dependency management.\n\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nuv sync --extra math --extra gpu-generic\n```\n\n## Configuration\n\nBefore training, configure the environment variables:\n\n```bash\ncp .env.example .env\n```\n\nImportant variables include:\n\n- `ROPD_MODEL_PATH`\n- `ROPD_TEACHER_INDEX_PATH`\n- `OPENAI_API_KEY`\n- `OPENAI_BASE_URL`\n- `ROPD_RUBRICATOR_MODEL`\n- `ROPD_VERIFIER_MODEL`\n- `ROPD_TEACHER_MODEL`\n\n## Training\n\nTo launch ROPD training:\n\n```bash\nbash training\u002Ftrain.sh\n```\n\nThis wraps:\n\n```bash\nuv run --no-sync python -m verl.trainer.main_ppo --config-name ropd\n```\n\nHydra overrides can be passed directly:\n\n```bash\nbash training\u002Ftrain.sh \\\n    trainer.total_training_steps=100 \\\n    data.train_batch_size=16 \\\n    actor_rollout_ref.rollout.n=4\n```\n\n## Teacher Response Index\n\nROPD supports an offline teacher-response index to decouple teacher generation from policy optimization. The default training configuration uses an offline teacher provider, where teacher answers are keyed by prompt fingerprints and replayed deterministically during training.\n\nThis design makes it possible to:\n\n- run teacher inference outside the training loop;\n- reduce GPU memory pressure during policy optimization;\n- reproduce training rewards from a fixed teacher-response cache.\n\n## Prompt Templates\n\nROPD uses two core prompt templates:\n\n- **`prompts\u002Frubricator.txt`**  \n  Generates prompt-specific rubrics from teacher-student contrasts.\n\n- **`prompts\u002Fverifier.txt`**  \n  Scores each response against the generated rubric.\n\nThese templates implement the Rubricator-Verifier pipeline described in the paper and are central to the method.\n\n## Reproducing Paper Experiments\n\nThe main experiments use:\n\n- GRPO optimization\n- learning rate $1 \\times 10^{-6}$\n- batch size 32\n- 8 student rollouts per prompt\n- 4 teacher references\n- 4 to 12 rubric items\n\nTraining data:\n\n- DAPO-Math-17K\n- RaR-Science-20K\n- RaR-Medical-20K\n\nEvaluation settings:\n\n- temperature 1.0\n- top-p 0.95\n- 16 samples per problem\n- maximum generation length 32,768 tokens\n\n## Citation\n\nIf you find ROPD useful, please cite:\n\n```bibtex\n@article{ropd2026,\n  title   = {Rubric-based On-policy Distillation},\n  author  = {Fang, Junfeng and Hong, Zhepei and Zheng, Mao and Song, Mingyang and Li, Gengsheng and Jiang, Houcheng and Zhang, Dan and Guo, Haiyun and Wang, Xiang and Chua, Tat-Seng},\n  journal = {arXiv preprint},\n  year    = {2026}\n}\n```\n\n## Acknowledgements\n\nThis repository builds upon [verl](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl). We thank the open-source community for making scalable reinforcement learning training infrastructure available.\n\n## License\n\nApache License 2.0. See [`LICENSE`](LICENSE).\n","ROPD是一个基于规则的在线策略蒸馏框架，用于大规模黑盒语言模型的蒸馏。其核心功能在于通过将教师和学生模型的行为差异转化为特定提示的语义规则来指导学生模型的学习过程，而无需访问教师模型的token级logits，从而实现黑盒条件下的知识迁移。技术上，ROPD利用Rubricator生成针对每个输入提示的具体规则，并通过Verifier评估学生模型输出与这些规则的符合程度，以此作为优化学生策略的奖励信号。这种设计特别适用于需要在不直接访问复杂或专有教师模型内部细节的情况下进行知识传递的情景，如跨架构模型训练或商业环境中对前沿模型能力的复制。","2026-06-11 04:01:57","CREATED_QUERY"]