[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72445":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72445,"Muon","KellerJordan\u002FMuon","KellerJordan","Muon is an optimizer for hidden layers in neural networks","",null,"Python",2656,122,18,20,0,8,22,89,24,94.17,"MIT License",false,"master",true,[],"2026-06-12 04:01:05","# Muon: An optimizer for the hidden layers of neural networks\n\nThis repo contains an implementation of the `Muon` optimizer originally described in [this thread](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1842300916864844014) and [this writeup](https:\u002F\u002Fkellerjordan.github.io\u002Fposts\u002Fmuon\u002F).\n\n## Installation\n\n```\npip install git+https:\u002F\u002Fgithub.com\u002FKellerJordan\u002FMuon\n```\n\n## Usage\n\nMuon is an optimizer for the hidden weights of a neural network.\nOther parameters, such as embeddings, classifier heads, and hidden gains\u002Fbiases should be optimized using standard AdamW.\nMuon should be used as follows:\n\n```python\n# optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.90, 0.95), weight_decay=0.01)\n\n# To replace the above, do the following:\n\nfrom muon import MuonWithAuxAdam\nhidden_weights = [p for p in model.body.parameters() if p.ndim >= 2]\nhidden_gains_biases = [p for p in model.body.parameters() if p.ndim \u003C 2]\nnonhidden_params = [*model.head.parameters(), *model.embed.parameters()]\nparam_groups = [\n    dict(params=hidden_weights, use_muon=True,\n         lr=0.02, weight_decay=0.01),\n    dict(params=hidden_gains_biases+nonhidden_params, use_muon=False,\n         lr=3e-4, betas=(0.9, 0.95), weight_decay=0.01),\n]\noptimizer = MuonWithAuxAdam(param_groups)\n```\n\nYou'll have to replace `model.body`, `model.head`, and `model.embed` with whatever is appropriate for your model.\nE.g., for a ConvNet, you should use Muon to optimize all the convolutional filters except the first one, and AdamW to optimize everything else.\n\n## Example usage\n\n[Example use in the NanoGPT speedrun](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fblob\u002Fmaster\u002Frecords\u002F052525_MuonWithAuxAdamExample\u002Fb01550f9-03d8-4a9c-86fe-4ab434f1c5e0.txt#L470)\n\n[Example use in the CIFAR-10 speedrun](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fcifar10-airbench\u002Fblob\u002F28bff5f5b31e95aa45b5b20e1f48baf1ed98d5f6\u002Fairbench94_muon.py#L362)\n\n## Hyperparameter tuning\n\nTypically, the default values of momentum (0.95), nesterov (True), and ns_steps (5) work well. Only the learning rate and weight decay must be tuned.\nThe learning rate should have built-in muP scaling: That is, as you scale up the model size, you shouldn't need to retune it.\n\n## Benchmarks\n\nFor a comparison between AdamW, Shampoo, SOAP, and Muon for training a 124M-parameter transformer, see [here](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Ftree\u002Fmaster\u002Frecords\u002F102924_Optimizers).\n\n## Accomplishments\n\n* [Lowered the record for training to 94% on CIFAR-10 from 3.3 A100-seconds to 2.6 A100-seconds](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fcifar10-airbench)\n* [Used to train a transformer to GPT-2 (XL) performance in $175 of compute](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1850995958697308307)\n* [Improved the training speed record for attaining GPT-2 (small) performance by a factor of 1.35x](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1842300916864844014)\n* [Used by the Kimi.ai frontier lab for scaled LLM training](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1893379158472044623)\n* [Ashish Vaswani's lab essential.ai showed that Muon is especially good for training with large batch size](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02222)\n\n## More learning resources and results about Muon\n\n* [Blog post on Muon by Jianlin Su (the creator of RoPE)](https:\u002F\u002Fkexue.fm\u002Farchives\u002F10592)\n* [Blog post by Jeremy Bernstein on theoretical background of Muon](https:\u002F\u002Fjeremybernste.in\u002Fwriting\u002Fderiving-muon)\n* [Tech report by Kimi.ai on using Muon for scaled training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16982v1)\n* [Why we chose Muon: Our chain of thought (by Jianlin Su at Kimi.ai)](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1897929976948965870)\n\n## Citation\n\n```bibtex\n@misc{jordan2024muon,\n  author       = {Keller Jordan and Yuchen Jin and Vlado Boza and You Jiacheng and\n                  Franz Cesista and Laker Newhouse and Jeremy Bernstein},\n  title        = {Muon: An optimizer for hidden layers in neural networks},\n  year         = {2024},\n  url          = {https:\u002F\u002Fkellerjordan.github.io\u002Fposts\u002Fmuon\u002F}\n}\n```\n","Muon 是一个专为神经网络隐藏层设计的优化器。其核心功能在于专门针对隐藏层权重进行优化，同时建议使用标准AdamW优化其他参数如嵌入、分类头和隐藏增益\u002F偏置等。技术上，它通过调整动量、Nesterov加速梯度以及特定的学习率与权重衰减来提高训练效率。适合用于需要高效训练大规模神经网络模型的场景，特别是在图像识别（如CIFAR-10数据集）或自然语言处理任务中，能够显著减少达到指定性能所需的计算资源。",2,"2026-06-11 03:42:06","high_star"]