[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73960":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},73960,"llm.c","karpathy\u002Fllm.c","karpathy","LLM training in simple, raw C\u002FCUDA","",null,"Cuda",30185,3635,292,92,0,46,105,320,138,45,"MIT License",false,"master",true,[],"2026-06-12 02:03:20","# llm.c\n\nLLMs in simple, pure C\u002FCUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2) and [GPT-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the \"notable forks\" section. Developer coordination happens in the [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp) channel, or on `#llmdotc` on [GPU MODE](https:\u002F\u002Fdiscord.gg\u002Fgpumode) Discord.\n\n## quick start\n\nThe best introduction to the llm.c repo today is reproducing the GPT-2 (124M) model. [Discussion #481](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481) steps through this in detail. We can reproduce other models from the GPT-2 and GPT-3 series in both llm.c and in the parallel implementation of PyTorch. Have a look at the [scripts README](scripts\u002FREADME.md).\n\ndebugging tip: when you run the `make` command to build the binary, modify it by replacing `-O3` with `-g` so you can step through the code in your favorite IDE (e.g. vscode).\n\n## quick start (1 GPU, fp32 only)\n\nIf you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were \"checkpointed\" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this:\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2fp32cu\n.\u002Ftrain_gpt2fp32cu\n```\n\nThe download_starter_pack.sh script is a quick & easy way to get started and it downloads a bunch of .bin files that help get you off the ground. These contain: 1) the GPT-2 124M model saved in fp32, in bfloat16, 2) a \"debug state\" used in unit testing (a small batch of data, and target activations and gradients), 3) the GPT-2 tokenizer, and 3) the tokenized [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset. Alternatively, instead of running the .sh script, you can re-create these artifacts manually as follows:\n\n```bash\npip install -r requirements.txt\npython dev\u002Fdata\u002Ftinyshakespeare.py\npython train_gpt2.py\n```\n\n## quick start (CPU)\n\nThe \"I am so GPU poor that I don't even have one GPU\" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example:\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2\nOMP_NUM_THREADS=8 .\u002Ftrain_gpt2\n```\n\nIf you'd prefer to avoid running the starter pack script, then as mentioned in the previous section you can reproduce the exact same .bin files and artifacts by running `python dev\u002Fdata\u002Ftinyshakespeare.py` and then `python train_gpt2.py`.\n\nThe above lines (1) download an already tokenized [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo\u002Freference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):\n\n```\n[GPT-2]\nmax_seq_len: 1024\nvocab_size: 50257\nnum_layers: 12\nnum_heads: 12\nchannels: 768\nnum_parameters: 124439808\ntrain dataset num_batches: 1192\nval dataset num_batches: 128\nnum_activations: 73323776\nval loss 5.252026\nstep 0: train loss 5.356189 (took 1452.121000 ms)\nstep 1: train loss 4.301069 (took 1288.673000 ms)\nstep 2: train loss 4.623322 (took 1369.394000 ms)\nstep 3: train loss 4.600470 (took 1290.761000 ms)\n... (trunctated) ...\nstep 39: train loss 3.970751 (took 1323.779000 ms)\nval loss 4.107781\ngenerating:\n---\nCome Running Away,\nGreater conquer\nWith the Imperial blood\nthe heaviest host of the gods\ninto this wondrous world beyond.\nI will not back thee, for how sweet after birth\nNetflix against repounder,\nwill not\nflourish against the earlocks of\nAllay\n---\n```\n\n## datasets\n\nThe data files inside `\u002Fdev\u002Fdata\u002F(dataset).py` are responsible for downloading, tokenizing and saving the tokens to .bin files, readable easily from C. So for example when you run:\n\n```bash\npython dev\u002Fdata\u002Ftinyshakespeare.py\n```\n\nWe download and tokenize the [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset. The output of this looks like this:\n\n```\nwriting 32,768 tokens to .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_val.bin\nwriting 305,260 tokens to .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_train.bin\n```\n\nThe .bin files contain a short header (1024 bytes) and then a stream of tokens in uint16, indicating the token ids with the GPT-2 tokenizer. More datasets are available in `\u002Fdev\u002Fdata`.\n\n## test\n\nI am also attaching a simple unit test for making sure our C code agrees with the PyTorch code. On the CPU as an example, compile and run with:\n\n```bash\nmake test_gpt2\n.\u002Ftest_gpt2\n```\n\nThis now loads the `gpt2_124M_debug_state.bin` file that gets written by train_gpt2.py, runs a forward pass, compares the logits and loss with the PyTorch reference implementation, then it does 10 iterations of training with Adam and makes sure the losses match PyTorch. To test the GPU version we run:\n\n```bash\n# fp32 test (cudnn not supported)\nmake test_gpt2cu PRECISION=FP32 && .\u002Ftest_gpt2cu\n# mixed precision cudnn test\nmake test_gpt2cu USE_CUDNN=1 && .\u002Ftest_gpt2cu\n```\n\nThis tests both the fp32 path and the mixed precision path. The test should pass and print `overall okay: 1`.\n\n## tutorial\n\nI attached a very small tutorial here, in [doc\u002Flayernorm\u002Flayernorm.md](doc\u002Flayernorm\u002Flayernorm.md). It's a simple, step-by-step guide to implementing a single layer of the GPT-2 model, the layernorm layer. This is a good starting point to understand how the layers are implemented in C.\n\n**flash attention**. As of May 1, 2024 we use the Flash Attention from cuDNN. Because cuDNN bloats the compile time from a few seconds to ~minute and this code path is right now very new, this is disabled by default. You can enable it by compiling like this:\n\n```bash\nmake train_gpt2cu USE_CUDNN=1\n```\n\nThis will try to compile with cudnn and run it. You have to have cuDNN installed on your system. The [cuDNN installation instructions](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn) with apt-get will grab the default set of cuDNN packages. For a minimal setup, the cuDNN dev package is sufficient, e.g. on Ubuntu 22.04 for CUDA 12.x:\n\n```bash\nwget https:\u002F\u002Fdeveloper.download.nvidia.com\u002Fcompute\u002Fcuda\u002Frepos\u002Fubuntu2204\u002Fx86_64\u002Fcuda-keyring_1.1-1_all.deb\nsudo dpkg -i cuda-keyring_1.1-1_all.deb\nsudo apt-get update\nsudo apt-get -y install libcudnn9-dev-cuda-12\n```\n\nOn top of this you need the [cuDNN frontend](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcudnn-frontend\u002Ftree\u002Fmain), but this is just header files. Simply clone the repo to your disk. The Makefile currently looks for it in either your home directory or the current directory. If you have put it elsewhere, add `CUDNN_FRONTEND_PATH=\u002Fpath\u002Fto\u002Fyour\u002Fcudnn-frontend\u002Finclude` to the `make` command-line.\n\n## multi-GPU training\n\nMake sure you install MPI and NCCL, e.g. on Linux:\n\n```bash\nsudo apt install openmpi-bin openmpi-doc libopenmpi-dev\n```\n\nFor NCCL follow the instructions from the [official website](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnccl\u002Fnccl-download) (e.g. network installer)\n\nand then:\n\n```bash\nmake train_gpt2cu\nmpirun -np \u003Cnumber of GPUs> .\u002Ftrain_gpt2cu\n```\n\nor simply run one of our scripts under `.\u002Fscripts\u002F`.\n\n## multi-node training\n\nMake sure you've installed `NCCL` following instructions from [multi-GPU](#multi-gpu-training) section.\n\nThere are 3 ways we currently support that allow you to run multi-node training:\n1) Use OpenMPI to exchange nccl id and initialize NCCL. See e.g. `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_mpi.sh` script for details.\n2) Use shared file system to init NCCL. See `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_fs.sbatch` script for details.\n3) Use TCP sockets to init NCCL. See `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_tcp.sbatch` script for details.\n\nNote:\n* If you're running in a slurm environment and your slurm doesn't support PMIx (which we assume will be a common situation given that `slurm-wlm` dropped PMIx support) you will have to use FS (2) or TCP (3) approach. To test whether your slurm supports PMIx run: `srun --mpi=list` and see whether you get `pmix` in the output.\n* If you don't have slurm set up, you can kick off a multi-node run using `mpirun` - MPI (1).\n\nNone of these 3 methods is superior, we just offer you options so that you can run in your specific environment.\n\n## experiments \u002F sweeps\n\nJust as an example process to sweep learning rates on a machine with 4 GPUs on TinyStories. Run a shell script `sweep.sh` (after you of course `chmod u+x sweep.sh`):\n\n```bash\n#!\u002Fbin\u002Fbash\n\nlearning_rates=(3e-5 1e-4 3e-4 1e-3)\n\nfor i in {0..3}; do\n    export CUDA_VISIBLE_DEVICES=$i\n    screen -dmS \"tr$i\" bash -c \".\u002Ftrain_gpt2cu -i data\u002FTinyStories -v 250 -s 250 -g 144 -l ${learning_rates[$i]} -o stories$i.log\"\ndone\n\n# you can bring these down with\n# screen -ls | grep -E \"tr[0-3]\" | cut -d. -f1 | xargs -I {} screen -X -S {} quit\n```\n\nThis example opens up 4 screen sessions and runs the four commands with different LRs. This writes the log files `stories$i.log` with all the losses, which you can plot as you wish in Python. A quick example of how to parse and plot these logfiles is in [dev\u002Fvislog.ipynb](dev\u002Fvislog.ipynb).\n\n## repo\n\nA few more words on what I want this repo to be:\n\nFirst, I want `llm.c` to be a place for education. E.g. our `dev\u002Fcuda` folder is a place for a library of kernels for all the layers that are manually hand-written and very well documented, starting from very simple kernels all the way to more complex \u002F faster kernels. If you have a new kernel with various different tradeoffs, please feel free to contribute it here.\n\nThat said, I also want `llm.c` to be very fast too, even practically useful to train networks. E.g. to start, we should be able to reproduce the big GPT-2 (1.6B) training run. This requires that we incorporate whatever fastest kernels there are, including the use of libraries such as cuBLAS, cuBLASLt, CUTLASS, cuDNN, etc. I also think doing so serves an educational purpose to establish an expert upper bound, and a unit of measurement, e.g. you could say that your manually written kernels are 80% of cuBLAS speed, etc. Then you can choose to do a super fast run, or you can choose to \"drag and drop\" whatever manual kernels you wish to use, and run with those.\n\nHowever, as a constraint, I want to keep the mainline `llm.c` in the root folder simple and readable. If there is a PR that e.g. improves performance by 2% but it \"costs\" 500 lines of complex C code, and maybe an exotic 3rd party dependency, I may reject the PR because the complexity is not worth it. As a concrete example - making cuBLAS for matmuls the default in the root training loop is a no-brainer: it makes the mainline code much faster, it is a single line of interpretable code, and it is a very common dependency. On the side of this, we can have manual implementations that can compete with cuBLAS in `dev\u002Fcuda`.\n\nLastly, I will be a lot more sensitive to complexity in the root folder of the project, which contains the main \u002F default files of the project. In comparison, the `dev\u002F` folder is a bit more of a scratch space for us to develop a library of kernels or classes and share useful or related or educational code, and some of this code could be ok to be (locally) complex.\n\n## notable forks\n\n- AMD support\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fanthonix\u002Fllm.c) by @[anthonix](https:\u002F\u002Fgithub.com\u002Fanthonix): support for AMD devices, such as the 7900 XTX\n\n- C#\n  - [llm.cs](https:\u002F\u002Fgithub.com\u002Fazret\u002Fllm.cs) by @[azret](https:\u002F\u002Fgithub.com\u002Fazret): a C# port of this project\n  - [Llm.cs](https:\u002F\u002Fgithub.com\u002Fnietras\u002FLlm.cs) by @[nietras](https:\u002F\u002Fgithub.com\u002Fnietras): a C# port of this project with focus on easy to get started on any platform. Clone and run ✅\n\n- CUDA C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fgevtushenko\u002Fllm.c) by @[gevtushenko](https:\u002F\u002Fgithub.com\u002Fgevtushenko): a port of this project using the [CUDA C++ Core Libraries](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl)\n     - A presentation this fork was covered in [this lecture](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WiB_3Csfj_Q) in the [GPU MODE Discord Server](https:\u002F\u002Fdiscord.gg\u002Fcudamode)\n\n- C++\u002FCUDA\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fzhangpiu\u002Fllm.cpp\u002Ftree\u002Fmaster\u002Fllmcpp) by @[zhangpiu](https:\u002F\u002Fgithub.com\u002Fzhangpiu): a port of this project using the [Eigen](https:\u002F\u002Fgitlab.com\u002Flibeigen\u002Feigen), supporting CPU\u002FCUDA.\n\n- WebGPU C++\n  - [gpu.cpp](https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Fgpu.cpp) by @[austinvhuang](https:\u002F\u002Fgithub.com\u002Faustinvhuang): a library for portable GPU compute in C++ using native WebGPU. Aims to be a general-purpose library, but also porting llm.c kernels to WGSL.\n  \n- C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp) by @[GaoYusong](https:\u002F\u002Fgithub.com\u002FGaoYusong): a port of this project featuring a C++ single-header [tinytorch.hpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp\u002Fblob\u002Fmain\u002Ftinytorch.hpp) library\n\n- Go\n  - [llm.go](https:\u002F\u002Fgithub.com\u002Fjoshcarp\u002Fllm.go) by @[joshcarp](https:\u002F\u002Fgithub.com\u002Fjoshcarp): a Go port of this project\n\n- Java\n  - [llm.java](https:\u002F\u002Fgithub.com\u002Fharryjackson\u002Fllm.java) by @[harryjackson](https:\u002F\u002Fgithub.com\u002Fharryjackson): a Java port of this project\n\n- Metal\n  - [llm.metal](https:\u002F\u002Fgithub.com\u002Fregrettable-username\u002Fllm.metal) by @[regrettable-username](https:\u002F\u002Fgithub.com\u002Fregrettable-username): LLM training in simple, raw C\u002FMetal Shading Language\n\n- Mojo\n  - [llm.🔥](https:\u002F\u002Fgithub.com\u002Fdorjeduck\u002Fllm.mojo) by @[dorjeduck](https:\u002F\u002Fgithub.com\u002Fdorjeduck): a Mojo port of this project\n\n- OpenCL\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj\u002Fllm.c) by @[krrishnarraj](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj): an OpenCL port of this project\n\n- Rust\n  -  [llm.rs](https:\u002F\u002Fgithub.com\u002Fyijunyu\u002Fllm.rs) by @[Yijun Yu](https:\u002F\u002Fgithub.com\u002Fyijunyu): a Rust rewrite with the aim to have same performance\n  -  [llm.rs](https:\u002F\u002Fgithub.com\u002FToJen\u002Fllm.rs) by @[ToJen](https:\u002F\u002Fgithub.com\u002FToJen): a Rust port of this project\n\n- Swift\n  - [llm.swift](https:\u002F\u002Fgithub.com\u002Fotabuzzman\u002Fllm.swift) by @[otabuzzman](https:\u002F\u002Fgithub.com\u002Fotabuzzman): a Swift port of this project\n\n- Zig\n  - [llm.zig](https:\u002F\u002Fgithub.com\u002FSaimirbaci\u002Fllm.zig) by @[saimirbaci](https:\u002F\u002Fgithub.com\u002FSaimirbaci): a Zig port of this project\n \n- Habana Gaudi2\n  - [llm.tpc](https:\u002F\u002Fgithub.com\u002Fabhilash1910\u002Fllm.tpc) by @[abhilash1910](https:\u002F\u002Fgithub.com\u002Fabhilash1910): a Habana Gaudi2 port of this project \n\n- Nim\n  - [llm.nim](https:\u002F\u002Fgithub.com\u002FVindaar\u002Fllm.nim) by @[Vindaar](https:\u002F\u002Fgithub.com\u002FVindaar): a Nim port of this project\n\n## discussions\n\nWays of organizing development:\n\n- Experiencing a concrete issue with the repo? Use [Issues](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues).\n- Have some code to contribute? Open a [PR](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fpulls)\n- Chat about the repo, ask questions, etc.? Look at [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions).\n- Something faster? I created a new `#llmc` channel on my [Zero to Hero Discord channel](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp).\n\n## license\n\nMIT\n","llm.c 是一个使用纯 C\u002FCUDA 编写的语言模型训练项目，无需依赖庞大的 PyTorch 或 cPython 库。其核心功能包括预训练 GPT-2 和 GPT-3 系列模型，并提供了一个并行的 PyTorch 参考实现。该项目通过简洁的代码实现了高效训练，目前在速度上比 PyTorch Nightly 快约 7%。此外，还提供了一个简单的 CPU fp32 实现，便于理解和调试。适合对 CUDA 编程感兴趣、希望深入了解语言模型训练细节的研究人员和开发者使用。",2,"2026-06-11 03:48:08","high_star"]