[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72100":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72100,"modded-nanogpt","KellerJordan\u002Fmodded-nanogpt","KellerJordan","NanoGPT (124M) in 90 seconds","",null,"Python",5377,804,72,16,0,34,58,153,102,39.72,"MIT License",false,"master",true,[],"2026-06-12 02:02:58","# Modded-NanoGPT\n\nThis repository hosts the *NanoGPT speedrun*, in which we (collaboratively|competitively) search for the fastest algorithm to use 8 NVIDIA H100 GPUs to train a language model that attains 3.28 cross-entropy loss on the [FineWeb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceFW\u002Ffineweb) validation set.\n\n(Note: Besides the main track, there is also an [optimization track](records\u002Ftrack_3_optimization) where we try to minimize steps subject to fixed arch\u002Fdata\u002Fbsz and with unlimited wallclock budget.)\n\nThe target (3.28 validation loss on FineWeb) follows Andrej Karpathy's [GPT-2 replication in llm.c, which attains that loss after running for 45 minutes](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481#:~:text=By%20the%20end%20of%20the%20optimization%20we%27ll%20get%20to%20about%203.29).\nThe speedrun code also descends from llm.c's [PyTorch trainer](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fblob\u002Fmaster\u002Ftrain_gpt2.py), which itself descends from NanoGPT, hence the name of the repo.\nThanks to the efforts of many contributors, this repo now contains a training algorithm which attains the target performance in:\n* Under 90 seconds on 8xH100 (the llm.c GPT-2 replication needed 45 minutes)\n* under 400M tokens (the llm.c GPT-2 replication needed 10B)\n\nThis improvement in training speed has been brought about by the following techniques:\n* Modernized architecture: Rotary embeddings, QK-Norm, and ReLU²\n* The Muon optimizer [[writeup](https:\u002F\u002Fkellerjordan.github.io\u002Fposts\u002Fmuon\u002F)] [[repo](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002FMuon)]\n* Use FP8 matmul for head, and asymmetric rescale and softcap logits\n* Initialization of projections to zero (muP-like)\n* Skip connections from embedding to every block as well as from block 3 to 6\n* Extra embeddings which are mixed into the values in attention layers (inspired by Zhou et al. 2024)\n* Flash Attention 3 with long-short sliding window attention pattern (inspired by Gemma 2) and window size warmup with YaRN\n* Align training batch starts with EoS and set a max document length\n* Accumulate gradients for 2 steps for embedding and lm_head before updating parameters\n* Backout, with single activation input for last 3 attention layers\n* Polar Express implementation in Muon\n* Smear module to enable 1 token look back\n* Sparse attention gate\n* NorMuon\n* Cautious Weight Decay w\u002F schedule tied to LR\n* Exponential decay of residual stream\n* Batch size schedule\n* Max seq length schedule\n* Partial Key Offset\n* Multi token prediction\n* Untie embed and lm_head at 2\u002F3 of training\n* Additional gating on value embeddings and skip connection\n* Paired head attention\n* Bigram hash embedding\n\nAs well as many systems optimizations.\n\nContributors list (growing with each new record): [@bozavlado](https:\u002F\u002Fx.com\u002Fbozavlado); [@brendanh0gan](https:\u002F\u002Fx.com\u002Fbrendanh0gan);\n[@fernbear.bsky.social](https:\u002F\u002Fbsky.app\u002Fprofile\u002Ffernbear.bsky.social); [@Grad62304977](https:\u002F\u002Fx.com\u002FGrad62304977); \n[@jxbz](https:\u002F\u002Fx.com\u002Fjxbz); [@kellerjordan0](https:\u002F\u002Fx.com\u002Fkellerjordan0);\n[@KoszarskyB](https:\u002F\u002Fx.com\u002FKoszarskyB); [@leloykun](https:\u002F\u002Fx.com\u002F@leloykun);\n[@YouJiacheng](https:\u002F\u002Fx.com\u002FYouJiacheng); [@jadenj3o](https:\u002F\u002Fx.com\u002Fjadenj3o);\n[@KonstantinWilleke](https:\u002F\u002Fgithub.com\u002FKonstantinWilleke), [@alexrgilbert](https:\u002F\u002Fgithub.com\u002Falexrgilbert), [@adricarda](https:\u002F\u002Fgithub.com\u002Fadricarda),\n[@tuttyfrutyee](https:\u002F\u002Fgithub.com\u002Ftuttyfrutyee), [@vdlad](https:\u002F\u002Fgithub.com\u002Fvdlad); \n[@ryanyang0](https:\u002F\u002Fx.com\u002Fryanyang0), [@vagrawal](https:\u002F\u002Fgithub.com\u002Fvagrawal), [@classiclarryd](https:\u002F\u002Fx.com\u002Fclassiclarryd), \n[@byronxu99](https:\u002F\u002Fgithub.com\u002Fbyronxu99), [@varunneal](https:\u002F\u002Fx.com\u002Fvarunneal), [@EmelyanenkoK](https:\u002F\u002Fgithub.com\u002FEmelyanenkoK), \n[@bernard24](https:\u002F\u002Fgithub.com\u002Fbernard24)\u002Fhttps:\u002F\u002Fwww.hiverge.ai\u002F, [@Gusarich](https:\u002F\u002Fx.com\u002FGusarich), [@li_zichong](https:\u002F\u002Fx.com\u002Fli_zichong),\n[@akash5474](https:\u002F\u002Fgithub.com\u002Fakash5474), [@snimu](https:\u002F\u002Fx.com\u002Fomouamoua), [@roeeshenberg](https:\u002F\u002Fx.com\u002Froeeshenberg),\n[@ChrisJMcCormick](https:\u002F\u002Fx.com\u002FChrisJMcCormick), [@dominikkallusky](https:\u002F\u002Fgithub.com\u002Fdominikkallusky), [@acutkosky](https:\u002F\u002Fgithub.com\u002Facutkosky), \n[@manikbhandari](https:\u002F\u002Fgithub.com\u002Fmanikbhandari), [@andrewbriand](https:\u002F\u002Fx.com\u002Fandrewbriand8), [@jrauvola](https:\u002F\u002Fx.com\u002FJoshrav21),\n[@soren_dunn_](https:\u002F\u002Fx.com\u002Fsoren_dunn_), [@photon_mz](https:\u002F\u002Fx.com\u002Fphoton_mz), [@srashedll](https:\u002F\u002Fx.com\u002Fsrashedll), [@dhrvji](https:\u002F\u002Fx.com\u002Fdhrvji),\n[@EmmettBicker](https:\u002F\u002Fgithub.com\u002FEmmettBicker), [@dualverse-ai](https:\u002F\u002Fgithub.com\u002Fdualverse-ai), [@sisovicm](https:\u002F\u002Fx.com\u002Fsisovicm),\n[@moof2x](https:\u002F\u002Fgithub.com\u002Fmoof2x), [@samacqua](https:\u002F\u002Fgithub.com\u002Fsamacqua)\n\n\n---\n\n## Running the current record\n\nTo run the current record, run the following commands.\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt.git && cd modded-nanogpt\npip install -r requirements.txt\n# downloads only the first 900M training tokens to save time\npython data\u002Fcached_fineweb10B.py 9\n.\u002Frun.sh\n```\nAdd torchrun to path if .\u002Frun.sh gives error `torchrun: command not found`.\n\n**Note: torch.compile will add around 7 minutes of latency the first time you run the code.**\n\nOfficial records are timed on 8 NVIDIA H100 GPUs from https:\u002F\u002Fapp.primeintellect.ai\u002F. PrimeIntellect has generously sponsored recent validation runs.\n\n## Alternative: Running with Docker (recommended for precise timing)\n\nFor cases where CUDA or NCCL versions aren't compatible with your current system setup, Docker can be a helpful alternative.\nThis approach standardizes versions for CUDA, NCCL, CUDNN, and Python, reducing dependency issues and simplifying setup. \nNote: an NVIDIA driver must already be installed on the system (useful if only the NVIDIA driver and Docker are available).\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt.git && cd modded-nanogpt\nsudo docker build -t modded-nanogpt .\nsudo docker run -it --rm --gpus all -v $(pwd):\u002Fmodded-nanogpt modded-nanogpt python data\u002Fcached_fineweb10B.py 8\nsudo docker run -it --rm --gpus all -v $(pwd):\u002Fmodded-nanogpt modded-nanogpt sh run.sh\n```\n\nTo get an interactive docker, you can use\n```bash\nsudo docker run -it --rm --gpus all -v $(pwd):\u002Fmodded-nanogpt modded-nanogpt bash\n```\n\n---\n\n## World record history\n\nThe following is the historical progression of world speed records for the following competitive task:\n\n> *Train a neural network to ≤3.28 validation loss on FineWeb using 8x NVIDIA H100s.*\n\nNote: The 3.28 target was selected to match [Andrej Karpathy's GPT-2 (small) reproduction](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481).\n\n| # | Record time | Description | Date | Log | Contributors |\n| - | - | - | - | - | - |\n1 | 45 minutes | [llm.c baseline](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481) | 05\u002F28\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-10-13_llmc\u002Fmain.log) | @karpathy, llm.c contributors\n2 | 31.4 minutes | [Tuned learning rate & rotary embeddings](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1798863559243513937) | 06\u002F06\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-06-06_AdamW\u002Ff66d43d7-e449-4029-8adf-e8537bab49ea.log) | @kellerjordan0\n3 | 24.9 minutes | [Introduced the Muon optimizer](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1842300916864844014) | 10\u002F04\u002F24 | none | @kellerjordan0, @jxbz\n4 | 22.3 minutes | [Muon improvements](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1844820919061287009) | 10\u002F11\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-10-10_Muon\u002Feb5659d0-fb6a-49e5-a311-f1f89412f726.txt) | @kellerjordan0, @bozavlado\n5 | 15.2 minutes | [Pad embeddings, ReLU², zero-init projections, QK-norm](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1845865698532450646) | 10\u002F14\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-10-14_ModernArch\u002Fdabaaddd-237c-4ec9-939d-6608a9ed5e27.txt) | @Grad62304977, @kellerjordan0\n6 | 13.1 minutes | [Distributed the overhead of Muon](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1847291684016783746) | 10\u002F18\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-10-17_DistributedMuon\u002F22d24867-eb5a-4fcc-ae2c-263d0277dfd1.txt) | @kellerjordan0\n7 | 12.0 minutes | [Upgraded PyTorch 2.5.0](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1847358578686152764) | 10\u002F18\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-10-18_PyTorch25\u002Fd4bfb25f-688d-4da5-8743-33926fad4842.txt) | @kellerjordan0\n8 | 10.8 minutes | [Untied embedding and head](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1853188916704387239) | 11\u002F03\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-03_UntieEmbed\u002Fd6b50d71-f419-4d26-bb39-a60d55ae7a04.txt) | @Grad62304977, @kellerjordan0\n9 | 8.2 minutes | [Value and embedding skip connections, momentum warmup, logit softcap](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1854296101303800108) | 11\u002F06\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-06_ShortcutsTweaks\u002Fdd7304a6-cc43-4d5e-adb8-c070111464a1.txt) | @Grad62304977, @kellerjordan0\n10 | 7.8 minutes | [Bfloat16 activations](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1855267054774865980) | 11\u002F08\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-08_CastBf16\u002Fa833bed8-2fa8-4cfe-af05-58c1cc48bc30.txt) | @kellerjordan0\n11 | 7.2 minutes | [U-net pattern skip connections & double lr](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1856053121103093922) | 11\u002F10\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-10_UNetDoubleLr\u002Fc87bb826-797b-4f37-98c7-d3a5dad2de74.txt) | @brendanh0gan\n12 | 5.03 minutes | [1024-ctx dense causal attention → 64K-ctx FlexAttention](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1859331370268623321) | 11\u002F19\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-19_FlexAttention\u002F8384493d-dba9-4991-b16b-8696953f5e6d.txt) | @KoszarskyB\n13 | 4.66 minutes | [Attention window warmup](https:\u002F\u002Fx.com\u002Fhi_tysam\u002Fstatus\u002F1860851011797053450) | 11\u002F24\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-11-24_WindowWarmup\u002Fcf9e4571-c5fc-4323-abf3-a98d862ec6c8.txt) | @fernbear.bsky.social\n14 | 4.41 minutes | [Value Embeddings](https:\u002F\u002Fx.com\u002FKoszarskyB\u002Fstatus\u002F1864746625572257852) | 12\u002F04\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-12-04_ValueEmbed) | @KoszarskyB\n15 | 3.95 minutes | [U-net pattern value embeddings, assorted code optimizations](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1865761473886347747) | 12\u002F08\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-12-08_UNetValueEmbedsTweaks) | @leloykun, @YouJiacheng\n16 | 3.80 minutes | [Split value embeddings, block sliding window, separate block mask](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1866734331559071981) | 12\u002F10\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-12-10_MFUTweaks) | @YouJiacheng\n17 | 3.57 minutes | [Sparsify value embeddings, improve rotary embeddings, drop an attn layer](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1868938024731787640) | 12\u002F17\u002F24 | [log](records\u002Ftrack_1_short\u002F2024-12-17_SparsifyEmbeds) | @YouJiacheng\n18 | 3.4 minutes | [Lower logit softcap from 30 to 15](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1876048851158880624) | 01\u002F04\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-01-04_SoftCap\u002F31d6c427-f1f7-4d8a-91be-a67b5dcd13fd.txt) | @KoszarskyB\n19 | 3.142 minutes | [FP8 head, offset logits, lr decay to 0.1 instead of 0.0](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1878827972519772241) | 01\u002F13\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-01-13_Fp8LmHead\u002Fc51969c2-d04c-40a7-bcea-c092c3c2d11a.txt) | @YouJiacheng\n20 | 2.992 minutes | [Merged QKV weights, long-short attention, attention scale, lower Adam epsilon, batched Muon](https:\u002F\u002Fx.com\u002Fleloykun\u002Fstatus\u002F1880301753213809016) | 01\u002F16\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-01-16_Sub3Min\u002F1d3bd93b-a69e-4118-aeb8-8184239d7566.txt) | @leloykun, @fernbear.bsky.social, @YouJiacheng, @brendanh0gan, @scottjmaddox, @Grad62304977\n21 | 2.933 minutes | [Reduced batch size](https:\u002F\u002Fx.com\u002Fleloykun\u002Fstatus\u002F1885640350368420160) | 01\u002F26\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-01-26_BatchSize\u002Fc44090cc-1b99-4c95-8624-38fb4b5834f9.txt) | @leloykun\n21 | 2.997 minutes | 21st record with new timing | 02\u002F01\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-02-01_RuleTweak\u002Feff63a8c-2f7e-4fc5-97ce-7f600dae0bc7.txt) | not a new record, just re-timing #21 with the [updated rules](#timing-change-after-record-21)\n21 | 3.014 minutes | 21st record with latest torch | 05\u002F24\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-05-24_StableTorch\u002F89d9f224-3b01-4581-966e-358d692335e0.txt) | not a new record, just re-timing #21 with latest torch\n22 | 2.990 minutes | [Faster gradient all-reduce](https:\u002F\u002Fx.com\u002FKonstantinWille\u002Fstatus\u002F1927137223238909969) | 05\u002F24\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-05-24_FasterReduce\u002F23f40b75-06fb-4c3f-87a8-743524769a35.txt) | @KonstantinWilleke, @alexrgilbert, @adricarda, @tuttyfrutyee, @vdlad; The Enigma project\n23 | 2.979 minutes | [Overlap computation and gradient communication](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1927460573098262616) | 05\u002F25\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-05-25_EvenFasterReduce\u002F6ae86d05-5cb2-4e40-a512-63246fd08e45.txt) | @ryanyang0\n24 | 2.966 minutes | Replace gradient all_reduce with reduce_scatter | 05\u002F30\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-05-30_noallreduce\u002F8054c239-3a18-499e-b0c8-dbd27cb4b3ab.txt) | @vagrawal\n25 | 2.896 minutes | Upgrade PyTorch to 2.9.0.dev20250713+cu126 | 07\u002F13\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-07-13_UpgradeTorch190\u002F692f80e0-5e64-4819-97d4-0dc83b7106b9.txt) | @kellerjordan0\n26 | 2.863 minutes | Align training batch starts with EoS, increase cooldown frac to .45 | 07\u002F13\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-07-12_BosAlign\u002Fc1fd8a38-bb9f-45c4-8af0-d37f70c993f3.txt) | @classiclarryd\n27 | 2.817 minutes | Transpose one of the MLP matrices + add Triton kernel for symmetric matmul | 07\u002F18\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-07-18_TritonMuon\u002Frecord.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F109) | @byronxu99\n28 | 2.812 minutes | Sparse attention gate | 08\u002F23\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-08-23_SparseAttnGate\u002F020630eb-2191-4ba2-9ee4-4cdc94316943.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F117) | @classiclarryd\n29 | 2.731 minutes | Flash Attention 3, 2048 max_doc_len, update ws schedule | 09\u002F03\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-03_FA3\u002F44fc1276-0510-4961-92c0-730c65e5feba.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F118) | @varunneal\n30 | 2.717 minutes | Drop first MLP layer | 09\u002F05\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-05_SkipMLPBlocks\u002F07e7ae76-b7d0-4481-b149-01e7d81b5ad4.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F120) | @EmelyanenkoK\n31 | 2.656 minutes | Dynamically incorporate YaRN during training and validation | 09\u002F10\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-10_Yarn\u002F0ecdb695-510b-4c3b-b030-09861a162ce8.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F122) | @classiclarryd\n32 | 2.625 minutes | Optimize distributed training, improve skip connection gating, and enhance bfloat16 usage | 09\u002F11\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-11_VectSigmoidBFloat16\u002F0d0d9882-c34f-4d82-b961-a17d5659c988.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F125) | @bernard24 & AI system [hiverge.ai](https:\u002F\u002Fwww.hiverge.ai\u002F) \n33 | 2.565 minutes | Asynchronously fetch and index data batches, extend final layer attention window for validation | 09\u002F15\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-15_AsyncDataLoadAttnFinalWindow\u002F25db37c7-2bab-4ef4-ae63-d593590ef823.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F127) | @classiclarryd\n34 | 2.547 minutes | Smear token embeddings 1 position forward | 09\u002F18\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-18_Smear\u002F18a1e5c7-947e-479d-bc3a-a57a61a98fc9.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F130) | @classiclarryd\n35 | 2.527 minutes | Drop first attn layer, extend all long windows for validation, update schedule | 09\u002F21\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-21_DropAttn\u002F01fc4a96-f2a0-47a1-8a6a-c7d10bac99fe.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F131) | @classiclarryd\n36 | 2.495 minutes | MuonCustomSizing, perform mlp and attn reduce scatter in shared call | 09\u002F23\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-23_MuonCustomSizing\u002Fb067b4ac-72a6-4436-a6f8-ea51c1efeef3.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F132) | @classiclarryd\n37 | 2.483 minutes | Compute cross entropy in BF16 during training | 09\u002F27\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-27_BF16CE\u002F08c0770f-17fc-44cd-971d-734a7a28a3e3.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F133) | @Gusarich\n38 | 2.476 minutes | Polar Express, replacement for Newton-Schulz | 09\u002F29\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-29_PolarExpress\u002F0e3f0af5-ad08-47a6-813d-0c709b50d422.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F134) | @varunneal\n39 | 2.447 minutes | Only update Adam params every other step, reduce batch size | 09\u002F30\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-09-30_CustomBatching\u002F40b101b1-77ea-45ea-a089-1d3a647daa22.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F136) | @classiclarryd\n40 | 2.358 minutes | Backout, misc hyperparameter tuning, optimize lambda padding | 10\u002F04\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-10-04_Backout\u002F514e7581-fbd4-4338-a3e4-e556f9c958ce.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F140) | @classiclarryd\n41 | 2.345 minutes | [NorMuon](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.05491) | 10\u002F24\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-10-24_NorMuon\u002F088a77ee-9b67-475a-bbb9-3e92e4698799.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F144) | @li_zichong\n42 | 2.313 minutes | Update NorMuon LR, Step Logic  | 10\u002F27\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-10-27_FixMuonLR\u002F14afd380-d3d9-48d7-ad23-4c13cb96754b.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F146) | @varunneal\n43 | 2.284 minutes | Cautious Weight Decay w\u002F schedule  | 11\u002F10\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-11-10_CautiousWD\u002F1aac0132-a891-4ed9-b358-0fd2abd1b019.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F154) | @varunneal\n44 | 2.269 minutes | Backward hooks on Adam, [Profiling 101](https:\u002F\u002Fblog.underfit.ai\u002Fprofiling-101-nanogpt)  | 11\u002F16\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-10-31_AdamSyncGradientHook\u002F0c17cdfd-772c-4906-8d11-141b370599a0.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F149) | @akash5474\n45 | 2.248 minutes | Refine skip arch, update exponential decay init| 11\u002F18\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-11-18_RefineSkip\u002F00f4e1e6-0044-4a08-b88a-3b7ec0624081.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F159) | @classiclarryd\n46 | 2.203 minutes | [Batch size schedule](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F1998212158770065844) | 11\u002F29\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-11-29_BatchSizeSchedule\u002F10e8f7c6-7175-4467-bdb0-a5de25d771a6.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F163) | @varunneal\n47 | 2.193 minutes | [Multiply attn lambda with weight instead of data, fix warmup](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F1999630732814348451) | 12\u002F10\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-10_SALambdaOnWeights\u002F15ef5eaf-56e1-40e1-9ddf-af010027c9dd.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F166) | @roeeshenberg\n48 | 2.170 minutes | [Speed up Muon, additional pre-multiply lambda, reshape matrices, update lr, update NorMuon axis](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2000272495644152317) | 12\u002F11\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-11_NorMuonOptimsAndFixes\u002F82edf6be-f343-475d-b93a-47c32acf4de2.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F168) | @ChrisJMcCormick\n49 | 2.146 minutes | [Partial Key Offset](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2000841339299402142) | 12\u002F14\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-14_PartialKeyOffset\u002F150d40bf-c20b-4568-aac9-26eb919e25fd.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F169) | @classiclarryd\n50 | 2.128 minutes | [Extend Cautious Weight Decay to Adam parameters](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2002482925741486381) | 12\u002F18\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-18_CautiousWDAdam\u002F1981d492-bc65-4ba9-a0fa-2b30fc5c3eba.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F172) | @roeeshenberg\n51 | 2.075 minutes | [Retie Embed to lm_head, retune fp8 scales](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2003167208483209668) | 12\u002F19\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-19_RetieLMHead\u002F0828d309-ecfe-4442-9ee9-68fed3a4b599.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F175) | @varunneal\n52 | 2.037 minutes | [Smooth scalars via beta increase, decrease smear gate lr, freeze scalars during transitions, adam all reduce](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2003863282613190656)  | 12\u002F21\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-21_SmoothedScalars\u002F12-21-Smoothed-Scalars\u002F0bc6e909-8ee8-4ae3-ac62-0070e151a808.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F177) | @ChrisJMcCormick\n53 | 1.988 minutes | [Multi-token prediction, untie embed\u002Flm_head at 2\u002F3 training, lr update, tweak CWD](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2004248941878296580)  | 12\u002F22\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-22_MultiTokenPrediction\u002F17aaf854-f338-4d0d-9767-a5db30fd7980.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F178) | @varunneal, feat. @classiclarryd\n54 | 1.940 minutes | [Asymmetric Logit Rescale](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2004791008098480232)  | 12\u002F26\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-26_LogitRescale\u002F03e41c2d-2951-4546-a599-24cd723247fc.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F181) | @classiclarryd\n55 | 1.918 minutes | [Gates on value embeds and skip connection](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2005659526960492638)  | 12\u002F29\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-29_VeSkipGates\u002F2851d7dc-d6a5-4e74-8623-57031425db16.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F186) | @classiclarryd\n56 | 1.894 minutes | [Optimize and compile Adam, increase Adam buffer precision, move gates from Muon to Adam parameter banks](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2007882371576873445) | 12\u002F31\u002F25 | [log](records\u002Ftrack_1_short\u002F2025-12-31_GatesToCompiledAdam\u002F12-31-gates-to-adam-20stps\u002F219a5f2f-151e-4c56-ab91-3735ae4610b8.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F187) | @ChrisJMcCormick\n57 | 1.878 minutes | [Bfloat16 attn\u002Fmlp weights, mixed precision Muon, interweave Adam\u002FMuon, finer-grain Adam beta](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2008261904566022590) | 01\u002F04\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-04_MixedPrecisionInterweavedOptimizer\u002F41f606b6-1b9c-46a3-b46e-2beff1521d18.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F190) | @classiclarryd, feat. @YouJiacheng, @ChrisJMcCormick\n58 | 1.820 minutes | [Paired Head Attention](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2008963501688324228) | 01\u002F07\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-07_PairedHeadAttention\u002F2a5d5cde-db5f-4aab-a4a8-cc8e183ea671.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F191) | @classiclarryd\n59 | 1.781 minutes | [Fused triton kernel for linear relu square MLP step](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2010545452832407943) | 01\u002F10\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-10_FusedLinearReLUSquare\u002F3c47e63b-075e-4b5b-9c76-9dbe7bad9ad4.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F197) | @andrewbriand8, @Joshrav21\n60 | 1.765 minutes | [Fused triton kernel for softcapped multi-token prediction cross entropy step](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2012927211448516796) | 01\u002F16\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-16_FusedSoftcappedEntropy\u002F45beba56-93e2-4995-bc5b-caff3cb2c1b5.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F199) | @soren_dunn_ & AI System [Locus](https:\u002F\u002Fwww.intology.ai\u002Fblog\u002Fpreviewing-locus)\n61 | 1.748 minutes | [Unified Optimizers and Transposed LM Head](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2013399457841160702) | 01\u002F18\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-18_UnifiedOptimizers\u002Funified-optimizer\u002F2fc79469-a527-4bde-8540-8426ed3352d1.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F200) | @ChrisJMcCormick\n62 | 1.655 minutes | [Bigram Hash Embedding](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2013520088297558274) | 01\u002F19\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-19_BigramHashEmbedding\u002F40ec7bb6-14b3-46f8-90b7-bb5ed188faba.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F201) | @classiclarryd\n63 | 1.650 minutes | [Untie Value Embeds](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2016968386476200301) | 01\u002F26\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-26-UntieValueEmbeddings\u002F43955d93-6914-40cb-bdf8-786ace93784f.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F209) | @photon_mz\n64 | 1.630 minutes | [Tuned nonzero Attn V and O init](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2017735338601726357) | 01\u002F30\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-30_MimeticValueOutput\u002Fruns\u002F0f262f64-20c4-4268-9ae7-d7440c810abd.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F214) | @srashedll\n65 | 1.613 minutes | [Group Value Embeds into single parameter](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2018057653742920016) | 01\u002F30\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-30_VeFused\u002F0ba09d92-4ef1-440f-85e3-9d2766294db4.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F215) | @varunneal\n66 | 1.595 minutes | Torch 2.10 | 01\u002F31\u002F26 | - | -\n67 | 1.540 minutes | [Tune fused softcap kernels and fuse fp8 quantization in LM head](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2021015642472869978) | 01\u002F31\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-24_ImprovedLMHead\u002Frecord\u002F73a071ac-522d-4ce0-a4d6-cf3955a376e4.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F207) | @andrewbriand8\n68 | 1.535 minutes | [Move bigram hash to GPU](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2021450730117460439) | 01\u002F31\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-01-31-BigramHashH2D\u002F112c686e-b0d6-4dc8-814a-1ad1f5d5b274.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F216) | @dhrvji\n69 | 1.528 minutes | [Kernel Optimizations](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2023319358303510719) | 02\u002F02\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-02_KernelTuning\u002F25afb73a-332f-4d69-b9ab-f6261497f2d8.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F217) | @EmmettBicker & AI System [Aster](https:\u002F\u002Fwww.asterlab.ai\u002F)\n70 | 1.521 minutes | [Tune value embed layout and ve_gates](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2023319358303510719) | 02\u002F03\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-03_VeTuned\u002F42cbebac-0599-4a89-a00e-26d1c4cad140.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F218) | @photon_mz\n71 | 1.516 minutes | [Sparse bigram gradient comms and optimized loading on CPU](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2023319358303510719) | 02\u002F06\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-06_SparseBigramGradient\u002F02fee7bd-cd22-478b-9e8e-12e857ff3152.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F221) | @roeeshenberg\n72 | 1.496 minutes | [Increase minimum lr and add max_seq_len schedule](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2023319358303510719) | 02\u002F10\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-10_ShortWindow\u002FShort-Window_1_1.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F224) | @dualverse-ai & AI System [Station](https:\u002F\u002Fgithub.com\u002Fdualverse-ai\u002Fstation)\n73 | 1.485 minutes | [Partitioned Hyperconnections](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2026131531207761924) | 02\u002F12\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-12_ParallelResiduals\u002F451050db-d471-49db-b19b-be824bb896d0.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F230) | @sisovicm\n74 | 1.468 minutes | [Flattened GPT forward, removed post attention lambdas, added transpose kernels](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2027228782483182059) | 02\u002F16\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-16_FlattenForward\u002Fpr233\u002F2026-02-16_21-30-05_time-362_secs_F-inject-post-attn_9f12a3.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F233) | @ChrisJMcCormick\n75 | 1.453 minutes | [Cross Entropy Kernel Optimizations](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2030087884854939947) | 02\u002F23\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-23_CrossEntropyKernel\u002F1e51be6b-7dd4-41ab-b95d-e57da5814776.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F235) | @moof2x\n76 | 1.446 minutes | [Reuse and tune backward transpose kernel](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2030403421027852337) | 02\u002F28\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-02-28_TransposeCopyBackward\u002Fthis_pr\u002F14c9cefc-c840-493f-870e-61bb1d2b1d97.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F240) | @samacqua\n77 | 1.435 minutes | [Replace partitioned hyperconnections with single saved activation](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2030465730718908884) | 03\u002F06\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-03-06_SimplifyHC\u002F0ab4a843-8c3a-4fb4-9fff-8e1d39852646.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F241) | @classiclarryd\n78 | 1.426 minutes | [Tighten bounds on fa3 max_num_docs to match fineweb distribution](https:\u002F\u002Fx.com\u002Fclassiclarryd\u002Fstatus\u002F2038077427180851240) | 03\u002F22\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-03-22_VarlenMaxDocs\u002Fcombined\u002F2026-03-22_20-07-32_time-186_secs_06-mbeta2-max-docs_227ce8.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F246) | @ChrisJMcCormick\n79 | 1.411 minutes | Fuse Cross Entropy Fwd\u002FBwk Kernel, to avoid recalc on softcap sigmoid | 04\u002F04\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-04-04_FuseCEFwdAndBwd\u002Fruns\u002F19ad9161-37c0-4985-8dd4-6db4e27f34b4.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F251) | @andrewbriand8\n80 | 1.406 minutes | In Muon orthogonize Q and K matrices in pairs of heads, instead of across the full 6 head matrix  | 04\u002F08\u002F26 | [log](records\u002Ftrack_1_short\u002F2026-04-08_PairedHeadMuon\u002Flogs\u002Fsplit_qk0-1480.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F253) | samacqua\n## Rules\n\nNew records must:\n\n1. Not modify the train or validation data pipelines. (You can change the batch size, sequence length, attention structure etc.; just don't change the underlying streams of tokens.)\n2. Attain ≤3.28 mean val loss. (Due to inter-run variance, submissions must provide enough run logs to attain a statistical significance level of p\u003C0.01 that their mean val loss is ≤3.28. Example code to compute p-value can be found [here](records\u002Ftrack_1_short\u002F2025-01-04_SoftCap#softer-softcap). For submissions which improve speed by optimizing the systems performance, without touching the ML, this requirement is waived.)\n3. Not use any extra `torch._inductor.config` or `torch.compile` flags. (These can save a few seconds, but they can also make compilation take >30min. This rule was introduced after the 21st record.)\n4. Run faster than the prior record when baselined on the same hardware.\n\nDiscretionary reasons why a PR may not be accepted:\n1. Disproportionately degrades the readability of the codebase. A 200 line kernel to drop 300ms is considered worthwhile. 500 lines that convolute the optimizer layout for a 50ms gain will likely be rejected.\n2. The current record is intentionally kept roughly 0.001-0.002 loss below 3.28 to make validation simpler. If a PR substantially consumes this buffer, it should do so in a way that outperforms a simple step count decrease, when measured at equivalent loss.\n\n> Note: `torch._inductor.config.coordinate_descent_tuning` is allowed for GPT-2 Medium track (a.k.a. 2.92 track).\n\nOther than that, anything and everything is fair game!\n\n[further clarifications](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fdiscussions\u002F23?sort=new#discussioncomment-12109560)\n\n---\n\n### Comment on the target metric\n\nThe target metric is *cross-entropy loss on the FineWeb val set*. To speak mathematically, the goal of the speedrun is *to obtain a probability model of language which assigns a probability of at least `math.exp(-3.28 * 10485760)` to the first 10,485,760 tokens of the FineWeb valset. Hence, e.g., we allow evaluation at any sequence length, so long as we still have a valid probability model of language.\n\n---\n\n### Timing change after record 21\n\nAfter the 21st record, we made two changes to the timing. First, there used to be an initial \"grace period\" of 10 untimed steps to allow kernel warmup. We replaced this with an explicit kernel-warmup section which is untimed and uses dummy data. This results in an extra runtime of 850ms from the 10 extra timed steps.\nSecond, we banned the use of `torch._inductor.config.coordinate_descent_tuning`. This saves ~25min of untimed pre-run compilation, but results in an extra runtime of ~3s.\n\n\u003C!--Note: The original llm.c baseline is intended to be closer to a replication of GPT-2 than to an optimized LLM training.\nSo it's no surprise that there is room to improve; as @karpathy has said, 'llm.c still has a lot of pending optimizations.'\nIn addition, many of the techniques used in these records are completely standard, such as rotary embeddings.\nThe goal of this benchmark\u002Fspeedrun is simply to find out which techniques actually work, and maybe come up with some new ones.-->\n\u003C!--The goal of this benchmark is simply to find out all the techniques which actually work, because I'm going crazy reading all these\nLLM training papers\nwhich claim a huge benefit but then use their own idiosyncratic non-competitive benchmark and therefore no one in the community has any idea if it's legit for months.-->\n\u003C!--[LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14342) [training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17764) [papers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01131)-->\n\u003C!--I mean hello??? We're in a completely empirical field; it is insane to not have a benchmark. Ideally everyone uses the same LLM training benchmark,\nand then reviewing LLM training papers becomes as simple as checking if they beat the benchmark. It's not like this would be unprecedented, that's how things\nwere in the ImageNet days.\nThe only possible 'benefit' I can think of for any empirical field to abandon benchmarks is that it would make it easier to publish false results. Oh, I guess that's why it happened.\nHilarious to think about how, in the often-commented-upon and ongoing collapse of the peer review system, people blame the *reviewers* --\nyeah, those guys doing free labor who everyone constantly musters all of their intelligence to lie to, it's *their* fault! My bad, you caught me monologuing.-->\n\n---\n\n### Notable attempts & forks\n\n**Notable runs:**\n\n* [@alexjc's 01\u002F20\u002F2025 2.77-minute TokenMonster-based record](https:\u002F\u002Fx.com\u002Falexjc\u002Fstatus\u002F1881410039639863622).\nThis record is technically outside the rules of the speedrun, since we specified that the train\u002Fval tokens must be kept fixed.\nHowever, it's very interesting, and worth including. The run is not more data-efficient; rather, the speedup comes from the improved tokenizer allowing\nthe vocabulary size to be reduced (nearly halved!) while preserving the same bytes-per-token, which saves lots of parameters and FLOPs in the head and embeddings.\n* [@samacqua's 1\u002F23\u002F2026 test time training run](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F205). Sam found that prediction accuracy on the later portions of a given document could be improved by performing a training\nupdate on Adam parameters based on the early portion of the document. This 'parameter nudging' is repeated independently for each document. Interestingly, these gradient updates prove effective while only using ~500 tokens, substantially less than the over 200k tokens typically used on a normal training step. While technically a valid probability model, we are not allowing untimed backward passes.\n\n**Notable forks:**\n* [https:\u002F\u002Fgithub.com\u002FBlinkDL\u002Fmodded-nanogpt-rwkv](https:\u002F\u002Fgithub.com\u002FBlinkDL\u002Fmodded-nanogpt-rwkv)\n* [https:\u002F\u002Fgithub.com\u002Fnikhilvyas\u002Fmodded-nanogpt-SOAP](https:\u002F\u002Fgithub.com\u002Fnikhilvyas\u002Fmodded-nanogpt-SOAP)\n\n---\n\n## Speedrun track 2: GPT-2 Medium\n\nThe target loss for this track is lowered from 3.28 to 2.92, as per Andrej Karpathy's 350M-parameter llm.c baseline.\nThis baseline generates a model with performance similar to the original GPT-2 Medium, whereas the first track's baseline generates a model on par with GPT-2 Small.\nAll other rules remain the same.\n\n> Note: `torch._inductor.config.coordinate_descent_tuning` is turned on after the record 6 (*).\n\n| # | Record time | Description | Date | Log | Contributors |\n| - | - | - | - | - | - |\n1 | 5.8 hours | [llm.c baseline (350M parameters)](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481) | 05\u002F28\u002F24 | [log](records\u002Ftrack_2_medium\u002F2025-01-18\u002Fmain.log) | @karpathy, llm.c contributors\n2 | 29.3 minutes | [Initial record based on scaling up the GPT-2 small track speedrun](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1881959719012847703) | 01\u002F18\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-01-18\u002F241dd7a7-3d76-4dce-85a4-7df60387f32a.txt) | @kellerjordan0\n3 | 28.1 minutes | [Added standard weight decay](https:\u002F\u002Fx.com\u002Fkellerjordan0\u002Fstatus\u002F1888320690543284449) | 02\u002F08\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-02-08_WeightDecay\u002Fb01743db-605c-4326-b5b1-d388ee5bebc5.txt) | @kellerjordan0\n4 | 27.7 minutes | [Tuned Muon Newton-Schulz coefficients](https:\u002F\u002Fx.com\u002Fleloykun\u002Fstatus\u002F1892793848163946799) | 02\u002F14\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-02-14_OptCoeffs\u002F1baa66b2-bff7-4850-aced-d63885ffb4b6.txt) | @leloykun\n5 | 27.2 minutes | [Increased learning rate cooldown phase duration](records\u002Ftrack_2_medium\u002F2025-03-06_LongerCooldown\u002F779c041a-2a37-45d2-a18b-ec0f223c2bb7.txt) | 03\u002F06\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-03-06_LongerCooldown\u002F779c041a-2a37-45d2-a18b-ec0f223c2bb7.txt) | @YouJiacheng\n6 | 25.95 minutes* | [2x MLP wd, qkv norm, all_reduce\u002Fopt.step() overlap, optimized skip pattern](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1905861218138804534) | 03\u002F25\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-03-25_ArchOptTweaks\u002Ftrain_gpt-20250329.txt) | @YouJiacheng\n7 | 25.29 minutes | [Remove FP8 head; ISRU logits softcap; New sharded mixed precision Muon; merge weights](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1912570883878842527) | 04\u002F16\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-04-16_Record7\u002F223_3310d0b1-b24d-48ee-899f-d5c2a254a195.txt) | @YouJiacheng\n8 | 24.50 minutes | [Cubic sliding window size schedule, 2× max window size (24.84 minutes)](https:\u002F\u002Fx.com\u002Fjadenj3o\u002Fstatus\u002F1914893086276169754) [24.5min repro](https:\u002F\u002Fx.com\u002FYouJiacheng\u002Fstatus\u002F1915667616913645985) | 04\u002F22\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-04-22_Record8\u002F075_640429f2-e726-4e83-aa27-684626239ffc.txt) | @jadenj3o\n9 | 24.12 minutes | [Add two value embeddings](https:\u002F\u002Fsnimu.github.io\u002F2025\u002F10\u002F07\u002Fmodded-nanogpt-value-embeddings.html) | 08\u002F28\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-08-28_NewValemb\u002F036_61ef4351-7b68-4897-b440-a99221a1a629.txt), [PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F119) | @snimu\n10 | 24.07 minutes | [Second input embedding](https:\u002F\u002Fsnimu.github.io\u002F2025\u002F10\u002F10\u002Fmodded-nanogpt-x0.html) | 09\u002F11\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-09-11_SecondInputEmbed\u002F000_592014ec-6781-4f59-b274-c4af68ccfe75.txt), [PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F124) | @snimu\n11 | 23.45 minutes | Upgrade from torch 2.7 to torch==2.10.0.dev20251210+cu126 | - | - | -\n12 | 23.28 minutes | Snoo Optimizer (Outer optimizer around Adam and Muon) | 09\u002F16\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-09-16_Snoo\u002F000_01db7a67-f715-4114-a7b5-6bfe23bac1b1.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F128) | @dominikkallusky\n13 | 23.14 minutes | EMA Wrapper on Muon | 09\u002F17\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-09-17_UpdateSmoothing\u002F001_8379f695-6bc3-4f76-b58b-8fadd3b6ebb0.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F129) | @acutkosky\n14 | 23.08 minutes | Combine both records 12 & 13 | 09\u002F30\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-09-30_SmoothedSnooMedium\u002F101_5bc91cd0-cb46-428c-a5da-9d8d228f1f97.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F137) | @acutkosky\n15 | 23.03 minutes | Backout (Skip from 2\u002F3 point to pre-lm_head) | 10\u002F04\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-10-04_GPT2MediumLayerReuse\u002F000_cc3943e4-02b5-4ae3-9441-839d32dfd9b2.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F139) | @snimu\n16 | 22.99 minutes | Smear-MTP | 11\u002F02\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-11-02-Smear-MTP\u002F000_3b50518d-d542-44bc-8566-3abf633f83ad.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F151) | @snimu\n17 | 22.98 minutes | Remove Redundant Mask Op | 11\u002F12\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-11-12_BlockMaskRedundantOp\u002F000_3b22a9d4-b52e-4916-99bf-3d48b38747a7.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F157\u002F) | @manikbhandari\n18 | 17.35 minutes | Bulk transfer short track features | 12\u002F31\u002F25 | [log](records\u002Ftrack_2_medium\u002F2025-12-31_BulkSmallTrackTransfer\u002F354be270-7d41-44b7-8064-f040923f024f.txt),[PR](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F188) | -\n---\n\n### Q: What is the point of NanoGPT speedrunning?\n\nA: The officially stated goal of NanoGPT speedrunning is as follows: `gotta go fast`. But for something a little more verbose involving an argument for good benchmarking, here's some kind of manifesto, adorned with a blessing from the master. [https:\u002F\u002Fx.com\u002Fkarpathy\u002Fstatus\u002F1846790537262571739](https:\u002F\u002Fx.com\u002Fkarpathy\u002Fstatus\u002F1846790537262571739)\n\n### Q: What makes \"NanoGPT speedrunning\" not just another idiosyncratic benchmark?\n\nA: Because it is a *competitive* benchmark. In particular, if you attain a new speed record (using whatever method you want), there is an open invitation for you\nto post that record (on arXiv or X) and thereby vacuum up all the clout for yourself. I will even help you do it by reposting you as much as I can.\n\n\u003C!--On the contrary, for example, the benchmark used in the [Sophia](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.14342) paper does *not* have this property.\nThere is no such open invitation for anyone to compete on the benchmark they used. In particular, if, for a random and definitely not weirdly specific example, you happen to find better AdamW hyperparameters for their training setup than\nthe ones they used which significantly close the gap between AdamW and their proposed optimizer,\nthen there is no clear path for you to publish that result in *any* form.\nYou could try posting it on X.com, but then you would be risking being perceived as aggressive\u002Fconfrontational, which is *not a good look* in this racket.\nSo if you're rational, the result probably just dies with you and no one else learns anything\n(unless you're in a frontier lab, in which case you can do a nice internal writeup. Boy I'd love to get my hands on those writeups).-->\n\n[\"Artificial intelligence advances by inventing games and gloating to goad others to play\" - Professor Ben Recht](https:\u002F\u002Fwww.argmin.net\u002Fp\u002Ftoo-much-information)\n\n### Q: NanoGPT speedrunning is cool and all, but meh it probably won't scale and is just overfitting to val loss\n\nA: This is hard to refute, since \"at scale\" is an infinite category (what if the methods stop working only for >100T models?), making it impossible to fully prove.\nAlso, I would agree that some of the methods used in the speedrun are unlikely to scale, particularly those which *impose additional structure* on the network, such as logit softcapping.\nBut if the reader cares about 1.5B models, they might be convinced by this result:\n\n*Straightforwardly scaling up the speedrun (10\u002F18\u002F24 version) to 1.5B parameters yields a model with GPT-2 (1.5B)-level HellaSwag performance 2.5x more cheaply than [@karpathy's baseline](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F677) ($233 instead of $576):*\n\n![](img\u002Fnanogpt_speedrun51.png)\n[[reproducible log](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fblob\u002Fmaster\u002Frecords\u002Ftrack_1_short\u002F2024-10-20_ScaleUp1B\u002Fad8d7ae5-7b2d-4ee9-bc52-f912e9174d7a.txt)]\n![](img\u002Fnanogpt_speedrun52.png)\n\n---\n\n## [Muon optimizer](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002FMuon)\n\nMuon is defined as follows:\n\n![](img\u002Falgo_optimizer.png)\n\nWhere NewtonSchulz5 is the following Newton-Schulz iteration [2, 3], which approximately replaces `G` with `U @ V.T` where `U, S, V = G.svd()`.\n```python\n@torch.compile\ndef zeroth_power_via_newtonschulz5(G, steps=5, eps=1e-7):\n    assert len(G.shape) == 2\n    a, b, c = (3.4445, -4.7750,  2.0315)\n    X = G.bfloat16() \u002F (G.norm() + eps)\n    if G.size(0) > G.size(1):\n        X = X.T \n    for _ in range(steps):\n        A = X @ X.T\n        B = b * A + c * A @ A\n        X = a * X + B @ X\n    if G.size(0) > G.size(1):\n        X = X.T \n    return X.to(G.dtype)\n```\n\nFor this training scenario, Muon has the following favorable properties:\n* Lower memory usage than Adam\n* ~1.5x better sample-efficiency\n* \u003C2% wallclock overhead\n\n\n### Provenance\n\nMany of the choices made to generate this optimizer were obtained experimentally by our pursuit of [CIFAR-10 speedrunning](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fcifar10-airbench).\nIn particular, we experimentally obtained the following practices:\n* Using Nesterov momentum inside the update, with orthogonalization applied after momentum.\n* Using a specifically quintic Newton-Schulz iteration as the method of orthogonalization.\n* Using non-convergent coefficients for the quintic polynomial in order to maximize slope at zero, and thereby minimize the number of necessary Newton-Schulz iterations.\nIt turns out that the variance doesn't actually matter that much, so we end up with a quintic that rapidly converges to the range 0.68, 1.13 upon repeated application, rather than converging more slowly to 1.\n* Running the Newton-Schulz iteration in bfloat16 (whereas Shampoo implementations often depend on inverse-pth-roots run in fp32 or fp64).\n\nOur use of a Newton-Schulz iteration for orthogonalization traces to [Bernstein & Newhouse (2024)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.20325),\nwho suggested it as a way to compute Shampoo [5, 6] preconditioners, and theoretically explored Shampoo without preconditioner accumulation.\nIn particular, Jeremy Bernstein @jxbz sent us the draft, which caused us to experiment with various Newton-Schulz iterations as the\northogonalization method for this optimizer.\nIf we had used SVD instead of a Newton-Schulz iteration, this optimizer would have been too slow to be useful.\nBernstein & Newhouse also pointed out that Shampoo without preconditioner accumulation is equivalent to steepest descent in the spectral norm,\nand therefore Shampoo can be thought of as a way to smooth out spectral steepest descent.\nThe proposed optimizer can be thought of as a second way of smoothing spectral steepest descent, with a different set of memory and runtime tradeoffs\ncompared to Shampoo.\n\n---\n\n## Running on fewer GPUs\n\n* To run experiments on fewer GPUs, simply modify `run.sh` to have a different `--nproc_per_node`. This should not change the behavior of the training.\n* If you're running out of memory, you may need to reduce the sequence length for FlexAttention (which does change the training. see [here](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt\u002Fpull\u002F38) for a guide)\n\n---\n\n## References\n\n1. [Guilherme Penedo et al. \"The fineweb datasets: Decanting the web for the finest text data at scale.\" arXiv preprint arXiv:2406.17557 (2024).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17557)\n2. Nicholas J. Higham. Functions of Matrices. Society for Industrial and Applied Mathematics (2008). Equation 5.22.\n3. GÃ¼nther Schulz. Iterative Berechnung der reziproken Matrix. Z. Angew. Math. Mech., 13:57â59 (1933).\n4. [Jeremy Bernstein and Laker Newhouse. \"Old Optimizer, New Norm: An Anthology.\" arxiv preprint arXiv:2409.20325 (2024).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.20325)\n5. [Vineet Gupta, Tomer Koren, and Yoram Singer. \"Shampoo: Preconditioned stochastic tensor optimization.\" International Conference on Machine Learning. PMLR, 2018.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.09568)\n6. [Rohan Anil et al. \"Scalable second order optimization for deep learning.\" arXiv preprint arXiv:2002.09018 (2020).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2002.09018)\n7. [Alexander HÃ¤gele et al. \"Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations.\" arXiv preprint arXiv:2405.18392 (2024).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.18392)\n8. [Zhanchao Zhou et al. \"Value Residual Learning For Alleviating Attention Concentration In Transformers.\" arXiv preprint arXiv:2410.17897 (2024).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.17897)\n9. [Team, Gemma, et al. \"Gemma 2: Improving open language models at a practical size.\" arXiv preprint arXiv:2408.00118 (2024).](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00118)\n10. [Alec Radford et al. \"Language models are unsupervised multitask learners.\" OpenAI blog 1.8 (2019).](https:\u002F\u002Fcdn.openai.com\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf)\n\n## Citation\n\n```\n@misc{modded_nanogpt_2024,\n  author       = {Keller Jordan and Jeremy Bernstein and Brendan Rappazzo and\n                  @fernbear.bsky.social and Boza Vlado and You Jiacheng and\n                  Franz Cesista and Braden Koszarsky and @Grad62304977},\n  title        = {modded-nanogpt: Speedrunning the NanoGPT baseline},\n  year         = {2024},\n  url          = {https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt}\n}\n```\n\n\u003Cimg src=\"img\u002Fdofa.jpg\" alt=\"itsover_wereback\" style=\"width:100%;\">\n\n","Modded-NanoGPT 是一个专注于使用8个NVIDIA H100 GPU在90秒内训练出达到3.28交叉熵损失的语言模型的项目。其核心功能包括通过现代架构如旋转嵌入、QK-Norm和ReLU²，以及Muon优化器等技术显著提高训练效率。此外，项目还采用了FP8矩阵乘法、跳过连接、Flash Attention 3等多种先进技术和系统优化手段。适合于追求高效语言模型训练的研究者和开发者，特别是在资源有限但需要快速获得高质量模型的情况下使用。",2,"2026-06-11 03:40:21","high_star"]