[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80985":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":13,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},80985,"ZEDA","TsinghuaC3I\u002FZEDA","TsinghuaC3I","Post-Trained MoE Can Skip Half Experts via Self-Distillation","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18643",null,"Python",35,3,30,0,1,4,5,1.81,"MIT License",false,"main",[],"2026-06-12 02:04:09","\u003Cdiv align=\"center\">\n\n# Post-Trained MoE Can Skip Half Experts via Self-Distillation\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18643)  [![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FZEDA-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FZEDA) [![HuggingFace](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-%23FFD14D?style=for-the-badge&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FTsinghuaC3I\u002Fzeda)\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\" style=\"font-family: Arial, sans-serif;\">\n  \u003Cp>\n    \u003Ca href=\"#news\" style=\"text-decoration: none; font-weight: bold;\">🎉 News\u003C\u002Fa> •\n    \u003Ca href=\"#introduction\" style=\"text-decoration: none; font-weight: bold;\">📖 Introduction\u003C\u002Fa> •\n    \u003Ca href=\"#zeda\" style=\"text-decoration: none; font-weight: bold;\">✨ ZEDA\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\n    \u003Ca href=\"#getting-started\" style=\"text-decoration: none; font-weight: bold;\">🚀 Getting Started\u003C\u002Fa> •\n    \u003Ca href=\"#main-results\" style=\"text-decoration: none; font-weight: bold;\">📊 Main Results\u003C\u002Fa> •\n    \u003Ca href=\"#acknowledgements\" style=\"text-decoration: none; font-weight: bold;\">💖 Acknowledgements\u003C\u002Fa> •\n    \u003Ca href=\"#contact\" style=\"text-decoration: none; font-weight: bold;\">📨 Contact\u003C\u002Fa> •\n    \u003Ca href=\"#citation\" style=\"text-decoration: none; font-weight: bold;\">🎈 Citation\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n> Fully trained Mixture-of-Experts (MoE) models are expensive to serve. Dynamic variant of MoE reduces computation by adjusting the activated experts in an input-dependent manner, while most existing dynamic MoE methods rely on pre-training from scratch or task-specific adaptation.\n> \n> **In this paper, we introduce ZEDA, a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones, eliminating over 50% of expert FLOPs at marginal accuracy loss.**\n\n# 🎉News\n\n- **[2026-05-19]** We introduce **Zero-Expert Self-Distillation Adaptation (ZEDA)**.\n\n# 📖Introduction\n\nWe introduce **Zero-Expert Self-Distillation Adaptation (ZEDA)**, a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones without substantially sacrificing their established capabilities. ZEDA targets the practical deployment scenario where MoE models have already undergone expensive pre-training and post-training, and further inference-cost reduction is desired after the main training pipeline is finalized. \n\nTo stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through **two-stage self-distillation**, utilizing the original MoE as a frozen teacher and applying a **group-level balancing loss**. \nOn Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20× end-to-end inference speedup.\n\n\u003Cp align=\"center\">\n   \u003Cimg src=\"figs\u002Fzeda.png\" alt=\"Overview of Unified Post-Training Framework.\" style=\"width: 100%;\">\n\u003C\u002Fp>\n\n\n\n# ✨ZEDA\n\n> ZEDA first injects zero experts into a post-trained MoE, architecturally converting it into a dynamic one, and then adapts it through two-stage self-distillation with the original MoE as a fixed teacher.\n\nZEDA introduces parameterless zero experts, whose outputs are identically zero, into the existing expert pool of a post-trained MoE model. This expands the router candidate pool with zero-computation experts while the activation number remains unchanged, naturally reducing active normal experts. The augmented model is then adapted through a two-stage self-distillation process:\n   - **SFT Stage**: Trains the student on responses sampled from the teacher (original MoE).\n   - **OPD Stage**: Shifts to on-policy learning, where responses are sampled from the current student and the teacher supplies token-level targets via reverse KL.\n\nZEDA incorporates the Group Auxiliary Loss $\\mathcal{L}_{GA}$ to regulate the relative activation frequency between normal experts and zero experts, while preserving the learned routing structures among normal experts. The loss is defined as:\n\n```math\n\\mathcal{L}_{GA} = \\alpha \\cdot \\frac{N + N_Z \\cdot w}{K} \\cdot \\left( \\frac{f_{\\mathcal{E}} \\cdot P_{\\mathcal{E}}}{N} + \\frac{f_{\\mathcal{Z}} \\cdot P_{\\mathcal{Z}}}{N_Z \\cdot w} \\right)\n```\n\n\n# 🚀Getting Started\n\nTo run ZEDA, follow these steps:\n\n### Env Setup\n\nZEDA is built upon large-scale MoE training and serving codebases, including [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime), [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang), and [Megatron](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM). Please use the Docker image [`slimerl\u002Fslime:20251113-v1`](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fslimerl\u002Fslime) released by [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime):\n```bash\n# Pull the image\ndocker pull slimerl\u002Fslime:20251113-v1\n\n# Start the container\ndocker run --rm --gpus all --ipc=host --shm-size=16g \\\n  --ulimit memlock=-1 --ulimit stack=67108864 \\\n  -it slimerl\u002Fslime:latest \u002Fbin\u002Fbash\n```\n\n\nAfter pull and start the docker container, you simply need to install our modified versions of SGLang and slime:\n```bash\ncd zeda\u002Fsglang\u002Fpython\npip install -e . --no-deps\ncd ..\npatch -p1 \u003C ..\u002Fslime\u002Fdocker\u002Fpatch\u002Flatest\u002Fsglang.patch\n\ncd ..\u002Ftransformers\npip install -e . --no-deps\n\ncd ..\u002Fslime\npip install -e . --no-deps\n```\n\n### Data Preparation\n\nZEDA uses 60k prompts including math, code, and chat data, and the corresponding self-distillation rollouts. \n  - **Prompts**: The prompts are used for rollout and OPD. The prompts are chosen from [AceReason-1.1-SFT](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnvidia\u002FAceReason-1.1-SFT) and [Llama-Nemotron-Post-Training-Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnvidia\u002FLlama-Nemotron-Post-Training-Dataset), and we release them in [ZEDA-prompts-60k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTsinghuaC3I\u002FZEDA\u002Fblob\u002Fmain\u002FZEDA-prompts-60k.jsonl).\n  - **Rollouts**: The rollouts are used for SFT. You need to use the specific post-trained MoE model intended for adaptation to perform the rollout. You can also directly utilize our released rollout results [ZEDA-Qwen3-30B-A3B-rollout-60k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTsinghuaC3I\u002FZEDA\u002Fblob\u002Fmain\u002FZEDA-Qwen3-30B-A3B-rollout-60k.jsonl) and [ZEDA-GLM-4.7-Flash-rollout-60k](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTsinghuaC3I\u002FZEDA\u002Fblob\u002Fmain\u002FZEDA-GLM-4.7-Flash-rollout-60k.jsonl).\n\nAfter downloading the data, please put them in the `data` folder.\n\n\n### Model Preparation\n\nAfter downloading the specific post-trained MoE model intended for adaptation from Huggingface, please first convert the model into a dynamic one through zero expert injection:\n```bash\npython scripts\u002Fconvert-hf-to-ZCE.py --input-dir path_Qwen3-30B-A3B --output-dir path_Qwen3-30B-A3B-dynamic --new-num-experts 192 # for Qwen3-30B-A3B\n\npython scripts\u002Fconvert-hf-to-ZCE.py --input-dir path_GLM-4.7-Flash --output-dir path_GLM-4.7-Flash-dynamic --new-num-experts 96 # for GLM-4.7-Flash\n```\n\nThen modify the `config.json` in the output Dynamic MoE dir, modify the `\"architectures\"` and add `\"use_zce_mask\"`, `\"zce_nums\"`, and `\"zce_types\"`. For Qwen3-30B-A3B:\n```json\n\"architectures\": [\"Qwen3MoePlusPlusForCausalLM\"],\n\"use_zce_mask\": false,\n\"zce_nums\": [64],\n\"zce_types\": [\"zero\"]\n```\n\nFor GLM-4.7-Flash:\n```json\n\"architectures\": [\"Glm4MoeLitePlusPlusForCausalLM\"],\n\"use_zce_mask\": false,\n\"zce_nums\": [32],\n\"zce_types\": [\"zero\"]\n```\n\nFinally, convert the dynamic MoE into the format compatible with Megatron:\n```bash\nbash scripts\u002Fconvert_hf_to_torch_dist.sh\n```\n\n\n\n### Training\nZEDA consists of zero-expert injection, SFT, and OPD. You can run the following scripts to start the adaptation pipeline:\n\n```bash\n# For Qwen3-30B-A3B\nbash scripts\u002Ftrain_zeda_qwen_sft.sh # SFT\nbash scripts\u002Fconvert_torch_dist_to_hf.sh # Convert Model\nbash scripts\u002Frun_teacher_server.sh # Start Teacher Server\nbash scripts\u002Ftrain_zeda_qwen_opd.sh # After the teacher server starts, run OPD\n\n# For GLM-4.7-Flash\nbash scripts\u002Ftrain_zeda_glm_sft.sh # SFT\nbash scripts\u002Fconvert_torch_dist_to_hf.sh # Convert Model\nbash scripts\u002Frun_teacher_server.sh # Start Teacher Server\nbash scripts\u002Ftrain_zeda_glm_opd.sh # After the teacher server starts, run OPD\n```\n\n### Evaluation\n\nWe provide a unified evaluation pipeline covering the released math, code, instruction-following, and science benchmarks. To reproduce the reported results, first download the benchmark data from Hugging Face to `evaluation\u002Fbenchmark`:\n\n```bash\ncd ZEDA\nhf download TsinghuaC3I\u002FZEDA-Evaluation \\\n  --repo-type dataset \\\n  --local-dir evaluation\u002Fbenchmark\n```\n\nThen install the required evaluation dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\nNext, configure `evaluation\u002Frun_sglang_server.sh` by specifying:\n\n- `MODEL_PATH`: path to the evaluated model\n- `TP_SIZE`: tensor parallel size\n- `PORT`: server port\n\nLaunch the SGLang server with:\n\n```bash\nbash evaluation\u002Frun_sglang_server.sh\n```\n\nOnce the server is ready, configure `evaluation\u002Frun_evaluation.sh` by setting:\n\n- `MODEL_NAME`: model name used in output filenames\n- `MODEL_PATH`: path to the evaluated model\n- `SERVER_URL`: server endpoint, for example `http:\u002F\u002F0.0.0.0:$PORT\u002Fgenerate`\n\nThen start the evaluation pipeline:\n\n```bash\nbash evaluation\u002Frun_evaluation.sh\n```\n\nThe script iterates over all released benchmarks, stores raw generations in `evaluation\u002Fraw_output\u002F`, and finally invokes `compute_reward.py` to aggregate the evaluation metrics.\n\n### Models and Datasets\n\nWe release our adapted dynamic MoE models and rollout data in Huggingface:\n\n| **Model**                          | **Huggingface** |  **Base Model** |\n|-----------------------------------|------------------|------------------|\n| ZEDA-Qwen3-30B-A3B-Dynamic | [TsinghuaC3I\u002FZEDA-Qwen3-30B-A3B-Dynamic](https:\u002F\u002Fhuggingface.co\u002FTsinghuaC3I\u002FZEDA-Qwen3-30B-A3B-Dynamic) |  Qwen3-30B-A3B |\n| ZEDA-GLM-4.7-Flash-Dynamic | [TsinghuaC3I\u002FZEDA-GLM-4.7-Flash-Dynamic](https:\u002F\u002Fhuggingface.co\u002FTsinghuaC3I\u002FZEDA-GLM-4.7-Flash-Dynamic) | GLM-4.7-Flash |\n\n\n| **Rollout Data**                          | **Huggingface** |\n|-----------------------------------|------------------|\n| ZEDA-Qwen3-30B-A3B-rollout-60k | [TsinghuaC3I\u002FZEDA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTsinghuaC3I\u002FZEDA) |\n| ZEDA-GLM-4.7-Flash-rollout-60k | [TsinghuaC3I\u002FZEDA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTsinghuaC3I\u002FZEDA) |\n\n\n# 📊Main Results\n\nZEDA demonstrates consistent improvements across multiple models and benchmarks:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Fperformance.png\" width=\"90%\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Ftraining_time.png\" width=\"90%\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figs\u002Finference_time.png\" width=\"90%\">\n\u003C\u002Fp>\n\n# 💖Acknowledgements\nOur project mainly builds upon [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime), [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang), and [Megatron](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM). We leverage the datasets of [AceReason](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnvidia\u002FAceReason-1.1-SFT) and [Llama-Nemotron-Post-Training-Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnvidia\u002FLlama-Nemotron-Post-Training-Dataset), and backbone models of [Qwen3-30B-A3B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-30B-A3B) and [GLM-4.7-Flash](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.7-Flash). We are grateful for these significant open-source contributions.\n\n# 📨Contact\n\nFor questions about this work, please contact:\n\n- Xingtai Lv: lvxt24@mails.tsinghua.edu.cn\n\n\n# 🎈Citation\n\nIf you find this work helpful, please cite our paper:\n\n```bibtex\n@misc{lv2026posttrainedmoeskiphalf,\n      title={Post-Trained MoE Can Skip Half Experts via Self-Distillation}, \n      author={Xingtai Lv and Li Sheng and Kaiyan Zhang and Yichen You and Siyan Gao and Xueheng Luo and Yuxin Zuo and Yuchen Fan and Junlin Yang and Ganqu Cui and Bingning Wang and Fan Yang and Youbang Sun and Ning Ding and Bowen Zhou},\n      year={2026},\n      eprint={2605.18643},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18643}, \n}\n```\n","ZEDA 是一个用于将完全训练好的静态混合专家（MoE）模型转换为高效动态模型的低成本框架。该项目通过在每个 MoE 层中注入无参数的零输出专家，并采用两阶段自蒸馏方法来适应增强模型，从而在几乎不牺牲准确性的前提下减少超过50%的专家计算量。其技术特点包括使用原始 MoE 作为固定教师模型以及应用组级平衡损失函数以稳定架构转换过程。ZEDA 特别适用于那些已经完成昂贵预训练和后训练的 MoE 模型场景，在这些场景中进一步降低推理成本是关键需求。",2,"2026-06-11 04:03:06","CREATED_QUERY"]