[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75113":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":8,"languages":8,"totalLinesOfCode":8,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":13,"starSnapshotCount":13,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},75113,"llm-from-scratch","angelos-p\u002Fllm-from-scratch","angelos-p",null,3041,306,25,1,0,14,35,382,42,29.46,false,"main",[],"2026-06-12 02:03:32","# Train Your Own LLM From Scratch\n\nA hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.\n\nAndrej Karpathy's [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT) was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.\n\nThis workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.\n\n## What You'll Build\n\nA working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:\n\n- **Tokenizer** — turning text into numbers the model can process\n- **Model architecture** — the transformer: embeddings, attention, feed-forward layers\n- **Training loop** — forward pass, loss, backprop, optimizer, learning rate scheduling\n- **Text generation** — sampling from your trained model\n\n## Prerequisites\n\n- Any laptop or desktop (Mac, Linux, or Windows)\n- Python 3.12+\n- Comfort reading Python code (you don't need ML experience)\n\nTraining uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on [Google Colab](https:\u002F\u002Fcolab.research.google.com\u002F) — upload the files and run with `!python train.py`.\n\n## Getting Started\n\n### Local (recommended)\n\nInstall [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) if you don't have it:\n\n```bash\n# macOS \u002F Linux\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n\n# Windows\npowershell -ExecutionPolicy ByPass -c \"irm https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.ps1 | iex\"\n```\n\nThen set up the project:\n\n```bash\nuv sync\nmkdir scratchpad && cd scratchpad\n```\n\n### Google Colab\n\nIf you don't have a local setup, upload the repo to Colab and install dependencies:\n\n```python\n!pip install torch numpy tqdm tiktoken\n```\n\nUpload `data\u002Fshakespeare.txt` to your Colab files, then write your code in notebook cells or upload `.py` files and run them with `!python train.py`.\n\n---\n\nWork through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working `model.py`, `train.py`, and `generate.py` that you wrote yourself.\n\n| Part | What You'll Write | Concepts |\n|------|-------------------|----------|\n| [Part 1: Tokenization](docs\u002F01-tokenization.md) | Character-level tokenizer | Character encoding, vocabulary size, why BPE fails on small data |\n| [Part 2: The Transformer](docs\u002F02-the-transformer.md) | Full GPT model architecture | Embeddings, self-attention, layer norm, MLP blocks |\n| [Part 3: The Training Loop](docs\u002F03-training-loop.md) | Complete training pipeline | Loss functions, AdamW, gradient clipping, LR scheduling |\n| [Part 4: Text Generation](docs\u002F04-text-generation.md) | Inference and sampling | Temperature, top-k, autoregressive decoding |\n| [Part 5: Putting It All Together](docs\u002F05-putting-it-together.md) | Train on real data, experiment | Loss curves, scaling experiments, next steps |\n| [Part 6: Competition](docs\u002F06-competition.md) | Train the best AI poet | Find datasets, scale up, submit your best poem |\n\n## Architecture: GPT at a Glance\n\n```\nInput Text\n    │\n    ▼\n┌─────────────────┐\n│   Tokenizer     │  \"hello\" → [20, 43, 50, 50, 53]  (character-level)\n└────────┬────────┘\n         ▼\n┌─────────────────┐\n│  Token Embed +  │  token IDs → vectors (n_embd dimensions)\n│  Position Embed │  + positional information\n└────────┬────────┘\n         ▼\n┌─────────────────┐\n│  Transformer    │  × n_layer\n│  Block:         │\n│  ┌────────────┐ │\n│  │ LayerNorm  │ │\n│  │ Self-Attn  │ │  n_head parallel attention heads\n│  │ + Residual │ │\n│  ├────────────┤ │\n│  │ LayerNorm  │ │\n│  │ MLP (FFN)  │ │  expand 4x, GELU, project back\n│  │ + Residual │ │\n│  └────────────┘ │\n└────────┬────────┘\n         ▼\n┌─────────────────┐\n│   LayerNorm     │\n│   Linear → logits│  vocab_size outputs (probability over next token)\n└─────────────────┘\n```\n\n## Model Configs for This Workshop\n\n| Config | Params | n_layer | n_head | n_embd | Train Time (M3 Pro) |\n|--------|--------|---------|--------|--------|---------------------|\n| Tiny | ~0.5M | 2 | 2 | 128 | ~5 min |\n| Small | ~4M | 4 | 4 | 256 | ~20 min |\n| **Medium (default)** | **~10M** | **6** | **6** | **384** | **~45 min** |\n\nAll configs use character-level tokenization (vocab_size=65) and block_size=256.\n\n## Tokenization: Characters vs BPE\n\nThis workshop uses **character-level** tokenization on Shakespeare. BPE tokenization (GPT-2's 50k vocab) doesn't work on small datasets — most token bigrams are too rare for the model to learn patterns from.\n\n| Tokenizer | Vocab Size | Dataset Size Needed |\n|-----------|-----------|-------------------|\n| **Character-level** | ~65 | Small (Shakespeare, ~1MB) |\n| **BPE (tiktoken)** | 50,257 | Large (TinyStories+, 100MB+) |\n\nPart 5 covers switching to BPE for larger datasets.\n\n## Key References\n\n- [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT) — The project this workshop is based on. Minimal GPT training in ~300 lines of PyTorch\n- [build-nanogpt video lecture](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fbuild-nanogpt) — 4-hour video building GPT-2 from an empty file\n- [Karpathy's microgpt](http:\u002F\u002Fkarpathy.github.io\u002F2026\u002F02\u002F12\u002Fmicrogpt\u002F) — A full GPT in 200 lines of pure Python, no dependencies\n- [nanochat](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat) — Full ChatGPT clone training pipeline\n- [Attention Is All You Need (2017)](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762) — The original transformer paper\n- [GPT-2 paper (2019)](https:\u002F\u002Fcdn.openai.com\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) — Language models as unsupervised learners\n- [TinyStories paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.07759) — Why small models trained on curated data punch above their weight\n","该项目旨在帮助用户从零开始训练自己的语言模型，通过亲手编写GPT训练流水线的每一部分来深入理解其工作原理。核心功能包括构建分词器、定义模型架构（如嵌入层、注意力机制、前馈网络）、实现训练循环（包括前向传播、损失计算、反向传播及优化器设置）以及文本生成。技术特点在于简化了nanoGPT项目，专注于基础组件，并调整为约10M参数规模，使得整个过程可以在笔记本电脑上快速完成（通常少于一小时）。适合对Python编程有一定了解但未必具备机器学习背景的学习者，在任何支持Python 3.12+的操作系统上进行实践，无论是本地环境还是Google Colab云平台均可轻松开展。",2,"2026-06-11 03:52:23","high_star"]