[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82168":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":12,"forks30d":12,"starsTrendScore":15,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":12,"starSnapshotCount":12,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},82168,"amuse","kjeiun\u002Famuse","kjeiun","AMUSE optimizer implementation",null,"Python",32,0,29,3,9,47.8,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:37","\n\u003Ch1 align=\"center\">AMUSE\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>AMUSE: Anytime Muon with Stable Gradient Evaluation\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  Jueun Kim* · Baekrok Shin* · Jihun Yun · Beomhan Baek · Minhak Song · Chulhee Yun\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22432\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.22432-b31b1b.svg\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"#citation\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBibTeX-Citation-orange.svg\" alt=\"BibTeX\">\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10-blue.svg\" alt=\"Python\">\n\u003C\u002Fp>\n\n## Method Overview\n\nAMUSE combines Muon with Schedule-Free updates by maintaining three sequences:\nthe fast base sequence $Z_t$, the averaged sequence $X_t$, and the gradient-evaluation\npoint $Y_t$. At each step, AMUSE evaluates the gradient at\n\n$$\nY_t = (1-\\beta_t) Z_t + \\beta_t X_t,\n$$\n\nwhere the interpolation coefficient increases after warmup as\n\n$$\n\\beta_t =\n\\begin{cases}\n\\beta_1, & t \\le T_0, \\\\\n1 - \\left(\\frac{T_0 - 1}{t - 1}\\right)^\\rho (1-\\beta_1), & t > T_0.\n\\end{cases}\n$$\n\nThe parameter $\\rho$ controls how quickly the gradient-evaluation point shifts from\nthe fast Muon trajectory $Z_t$ toward the stable averaged trajectory $X_t$.\n\nFor matrix-valued hidden parameters, AMUSE applies Muon at $Y_t$:\n\n$$\nM_t = \\mu M_{t-1} + \\nabla L(Y_t), \\qquad\nO_t = \\mathrm{NewtonSchulz}(M_t),\n$$\n\n$$\nZ_{t+1} = Z_t - \\eta O_t,\n\\qquad\nX_{t+1} = \\left(1-\\frac{1}{t+1}\\right) X_t + \\frac{1}{t+1} Z_{t+1}.\n$$\n\nThus, AMUSE preserves Muon's rapid progress in early training while gradually\nstabilizing the trajectory through Schedule-Free averaging. This preserves Muon's rapid progress while reducing valley-wall oscillations, enabling schedule-free and anytime training.\n\n**Full paper abstract**:\n> Modern deep learning commonly relies on AdamW with prescribed learning rate schedules, but recent works challenge both components: Schedule-Free optimization removes explicit schedules via iterate averaging, and Muon improves the update geometry by orthogonalizing momentum for matrix parameters. Despite Muon's strong empirical performance, its underlying mechanism remains partially understood.\n> We study Muon through the river-valley loss landscape, where useful training progress occurs along a flat, low-curvature bulk subspace, while high-curvature dominant directions form steep valley walls that induce oscillations. We empirically show that while Muon's orthogonalization accelerates river progress by increasing the bulk component, it also amplifies dominant-direction noise, causing oscillatory trajectories.\n> Building on this, we propose **Anytime MUon with Stable gradient Evaluation (AMUSE)**, which integrates Muon's rapid bulk progress with the stabilizing effect of Schedule-Free averaging. AMUSE uses a time-varying interpolation coefficient that initially evaluates gradients near the fast Muon sequence for rapid adaptation, then gradually shifts toward the stable averaged sequence to suppress valley-wall oscillations. As a result, AMUSE requires no learning rate schedules and supports anytime training.\n> Across vision tasks and large language model pretraining, AMUSE consistently improves the performance-iteration Pareto frontier over (Schedule-Free) AdamW and Muon.\n\n\n## Repository Structure\n\n```text\namuse\u002F\n├── src\u002Flm\u002F       # language model pretraining experiments\n├── src\u002Fimage\u002F    # vision\u002Fimage experiments\n├── src\u002Foptim\u002F    # AMUSE and optimizer implementations\n├── scripts\u002F      # launch scripts\n└── assets\u002F       # figures and result plots\n```\n\n\n## Installation\n\n```bash\nconda create -n amuse python=3.10\nconda activate amuse\npip install -r requirements.txt\n```\n\n## Quick Start\n\nFor language model pretraining, run AMUSE on a 124M Llama-style model with:\n\n```bash\nbash scripts\u002Flm\u002F124m\u002Famuse.sh\n```\n\nSet `YOUR_DATASET_DIR` in the script to the root directory used by the FineWeb-100B loader.\n\nFor image classification, run AMUSE on CIFAR-10 with:\n\n```bash\nbash scripts\u002Fimage\u002Fcifar10\u002Famuse.sh\n```\n\nOther image experiments are available through:\n\n```bash\nbash scripts\u002Fimage\u002Fcifar100\u002Famuse.sh\nbash scripts\u002Fimage\u002Fsvhn\u002Famuse.sh\nbash scripts\u002Fimage\u002Fimagenet\u002Famuse.sh\n```\n\nFor ImageNet, set `YOUR_DATASET_DIR` in the corresponding script. See [`src\u002Flm\u002FREADME.md`](src\u002Flm\u002FREADME.md) and [`src\u002Fimage\u002FREADME.md`](src\u002Fimage\u002FREADME.md) for task-specific optimizer and parameter grouping details.\n\n\n\n## Results\n\n### Language Model Pretraining\n\nAMUSE achieves the performance-iteration Pareto frontier in Llama-style pretraining on FineWeb-100B.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffineweb_llama_124M.png\" width=\"720\" alt=\"FineWeb Llama 124M pretraining results\">\n\u003C\u002Fp>\n\n\nThe same trend holds across model scales.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffineweb_llama_720m_13b.png\" width=\"720\" alt=\"FineWeb Llama scaling results for 720M and 1.3B models\">\n\u003C\u002Fp>\n\n\n### Image Classification\n\nAMUSE also performs strongly across standard image classification benchmarks.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fimage_benchmark.png\" width=\"720\" alt=\"image classification results\">\n\u003C\u002Fp>\n\n\n\n## Citation\n\n```bibtex\n@article{kim2026amuse,\n  title={{AMUSE}: Anytime Muon with Stable Gradient Evaluation},\n  author={Kim, Jueun and Shin, Baekrok and Yun, Jihun and Baek, Beomhan and Song, Minhak and Yun, Chulhee},\n  journal={arXiv preprint arXiv:2605.22432},\n  year={2026}\n}\n```\n","AMUSE 是一种优化器实现，旨在通过结合 Muon 和无调度更新来提高深度学习模型的训练效率。其核心功能包括维护三条序列：快速基础序列 $Z_t$、平均序列 $X_t$ 和梯度评估点 $Y_t$。AMUSE 通过在训练初期利用 Muon 的快速收敛特性，并在后期逐渐转向稳定的平均轨迹，从而减少山谷壁振荡，实现无需显式学习率调度的任意时间训练。该优化器特别适用于需要高效且稳定训练过程的深度学习任务，尤其是在处理矩阵参数时表现出色。项目采用 Python 编写，遵循 Apache License 2.0 开源协议。",2,"2026-06-11 04:07:55","CREATED_QUERY"]