[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-27":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":42,"readmeContent":43,"aiSummary":44,"trendingCount":16,"starSnapshotCount":16,"syncStatus":45,"lastSyncTime":46,"discoverSource":47},27,"how-to-train-your-gpt","raiyanyahya\u002Fhow-to-train-your-gpt","raiyanyahya","Build a modern LLM from scratch. Every line commented. Explained like we are five.","",null,"Jupyter Notebook",2226,294,15,1,0,16,50,1203,48,29.41,"MIT License",false,"master",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],"attention-mechanism","deep-learning","educational","from-scratch","gpt","language-model","llama","llm","machine-learning","natural-language-processing","python","pytorch","tokenisation","transformers","tutorial","2026-06-12 02:00:06","# 🧠 How to Train Your GPT\n\n> *A guide to building a world-class language model from absolute scratch. Taught like you're five. Built like you're an engineer.*\n>\n> *I made this with the goal of learning something I didn't understand completely. Specifically the attention part. I use AI a lot to understand key concepts and verifying them.*\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fchapters-12-blue\" alt=\"12 chapters\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flines-3%2C900%2B-green\" alt=\"3,900+ lines\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode%20commented-100%25-brightgreen\" alt=\"100% commented\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fprerequisite-python%20basics-orange\" alt=\"Python basics only\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Farchitecture-LLaMA%203%20style-purple\" alt=\"LLaMA 3 style\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpurpose-learning%20only-lightgrey\" alt=\"Learning only\">\n\u003C\u002Fp>\n\n---\n\n## 📖 What Is This?\n\nThis is a **12-chapter, 3,900+ line interactive textbook** that teaches you how to build, train and run a modern language model from absolute scratch. The same family of architecture behind ChatGPT, Claude, LLaMA and Mistral.\n\nYou won't just read about Transformers. You'll **write every line yourself**: tokenizer, embeddings, attention, training loop, inference engine. Every single line annotated to explain **what** it does and **why** it's there.\n\n---\n\n## 🤔 Why This Exists\n\nMost ML tutorials fall into one of two traps:\n\n| ❌ Too Shallow | ❌ Too Academic | ✅ This Guide |\n|---|---|---|\n| `model = GPT().fit(data)` | 40-page papers, dense notation | 5-year-old analogies → full working code |\n| You learn to call APIs | Assumes PhD in ML | Zero ML experience required |\n| No understanding of internals | No worked examples | Every line annotated with WHAT & WHY |\n\n**The goal:** After finishing, you won't just know that attention \"works\". You'll understand the variance argument behind `1\u002F√d_k`. How RoPE captures relative position through rotation. Why pre-norm beats post-norm for deep networks. And exactly where every gradient flows during backpropagation.\n\n---\n\n## 👥 Who Is This For?\n\n| 🧑‍💻 You Are... | 📚 You Need... |\n|---|---|\n| A Python developer curious about how ChatGPT actually works | Basic Python (functions, classes, lists). No ML experience |\n| A student who wants to deeply understand Transformers | Willingness to read ~3,500 lines of commented code |\n| An engineer evaluating LLM architectures | Understanding of tradeoffs (RoPE vs learned, RMSNorm vs LayerNorm) |\n| Someone who got lost at \"attention\" in other tutorials | Party analogy + worked numeric example with real numbers |\n\n**🔧 Prerequisites:** Python basics (variables, functions, classes, `pip install`). That's it. No calculus, no linear algebra, no PyTorch experience required. We teach those as we go.\n\n---\n\n## 🗺️ Chapters\n\n| Chapter | What You'll Learn |\n|---|---|\n| **[0: Overview](chapters\u002F00_overview.md)** | What is a GPT? The big picture |\n| **[1: Setup](chapters\u002F01_setup.md)** | Install tools, GPU vs CPU, venv, PyTorch basics |\n| **[2: Tokenization](chapters\u002F02_tokenization.md)** | BPE walkthrough: how \"unbelievably\" becomes tokens |\n| **[3: Embeddings](chapters\u002F03_embeddings.md)** | How numbers become meaning. king − man + woman = queen |\n| **[4: Positional Encoding](chapters\u002F04_positional_encoding.md)** | RoPE: why LLaMA rotates vectors, not adds numbers |\n| **[5: Attention](chapters\u002F05_attention.md)** | ⭐ THE CORE. Q,K,V, scaling, causal mask, 8-step walkthrough |\n| **[6: Transformer Block](chapters\u002F06_transformer_block.md)** | RMSNorm, SwiGLU, residuals, pre-norm vs post-norm |\n| **[7: Complete GPT Model](chapters\u002F07_gpt_model.md)** | 151M parameter model (with SwiGLU), weight tying, logits explained |\n| **[8: Training Pipeline](chapters\u002F08_training.md)** | Cross-entropy, backprop, AdamW, cosine warmup, mixed precision |\n| **[9: Inference](chapters\u002F09_inference.md)** | KV cache, temperature, top-k\u002Fp, beam search, repetition penalty |\n| **[10: Full Script](chapters\u002F10_full_script.md)** | Runnable `main.py`: everything in one file |\n| **[11: Glossary](chapters\u002F11_glossary.md)** | Architecture provenance table, parameter breakdown |\n\n> ⭐ **Start with [Chapter 0](chapters\u002F00_overview.md) and read sequentially.** Each builds on the previous.\n\n---\n\n## 🏗️ What You'll Build\n\n| 🧩 Component | 📝 Lines | 💡 What You'll Understand |\n|---|---|---|\n| **BPE Tokenizer** | ~60 | How GPT-4 splits \"unbelievably\" → \"un\" + \"believ\" + \"ably\" |\n| **Embeddings** | ~30 | How \"cat\" and \"dog\" end up near each other in 768D space |\n| **RoPE** | ~70 | Why LLaMA rotates vectors instead of adding position numbers |\n| **Multi-Head Attention** | ~120 | The exact 8-step computation behind every modern LLM |\n| **Transformer Block** | ~50 | Why residual connections are the \"gradient highway\" |\n| **Full GPT Model** | ~200 | 151M parameter model with SwiGLU, weight tying and pre-norm |\n| **Training Pipeline** | ~250 | AdamW, cosine warmup, mixed precision, gradient accumulation |\n| **Inference Engine** | ~80 | KV cache, temperature, top-k\u002Fp, beam search |\n\n> 💎 **~860 lines of core model code, ~2,600 lines of explanation and diagrams**\n\n---\n\n## 🏛️ Architecture\n\nThis guide implements the **latest publicly-documented** decoder-only Transformer:\n\n| 🧬 Technique | 📦 Source Model | ⚡ Why It Matters |\n|---|---|---|\n| **RoPE** | LLaMA, Mistral, Qwen | Relative position without learned parameters |\n| **RMSNorm** | LLaMA, Mistral, Gemma | 15% faster than LayerNorm, equally effective |\n| **SwiGLU** | PaLM, LLaMA, Gemini | Learns which information to pass or block |\n| **Pre-Norm** | GPT-3, all modern | Stable training at 100+ layers |\n| **AdamW** | GPT-3+ | Better generalization than vanilla Adam |\n| **BPE** | GPT-2\u002F3\u002F4 | Handles any text. Even unseen words and emoji |\n| **Weight Tying** | GPT-2\u002F3 | Saves 30% parameters, improves training signal |\n| **Mixed Precision** | All production LLMs | 2× speed, half memory, same quality |\n\n> ℹ️ GPT-4 and Claude architectures are proprietary\u002Fundisclosed. This teaches the best publicly-confirmed architecture: what LLaMA 3, Mistral and Qwen 2.5 use.\n\n---\n\n## 🚀 Quick Start\n\n```bash\n# 1. Clone\ngit clone https:\u002F\u002Fgithub.com\u002Fraiyanyahya\u002Fhow-to-train-your-gpt.git\ncd how-to-train-your-gpt\n\n# 2. Create environment\npython -m venv gpt_env\nsource gpt_env\u002Fbin\u002Factivate          # Mac\u002FLinux\n# gpt_env\\Scripts\\activate           # Windows\n\n# 3. Install dependencies (CPU version — for GPU see below)\npip install torch tiktoken datasets numpy matplotlib --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n\n# 4. Verify GPU (optional but recommended)\npython -c \"import torch; print(f'CUDA: {torch.cuda.is_available()}')\"\n\n# 5. Start reading!\nopen chapters\u002F00_overview.md\n```\n\nRun the training script:\n\n```bash\npython main.py\n```\n\nThis uses the tiny config (d_model=256, 4 layers) by default. Training takes a few minutes on CPU. For the GPT-2 scale config (151M params, 768 dims, 12 layers), edit the config in main.py and uncomment the larger configuration.\n\n> 💻 The default config uses a tiny model (d_model=256, 4 layers, 17M params) that runs in minutes on CPU. For the full GPT-2 scale (151M params, 768 dims, 12 layers), edit the config in `main.py` and uncomment the larger configuration. You'll need a GPU for that one.\n\n---\n\n## 📓 Jupyter Notebooks\n\nAlongside the textbook, each chapter has a companion notebook you can run live. These strip away the explanations and give you pure, clean code that executes from top to bottom. If the textbook teaches you why, the notebooks let you see it happen.\n\nWe're going to run this whole project on a very small dataset so you can watch training happen in minutes rather than weeks. Every notebook is self-contained — open it, run all cells, and you'll see the model learn in real time.\n\n```bash\n# Install everything you need\npip install jupyter tiktoken torch numpy datasets matplotlib --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n\n# Start with chapter 2 (tokenization)\njupyter notebook notebooks\u002F02_tokenization.ipynb\n```\n\nNotebooks live in the `notebooks\u002F` directory, one per chapter. Open any of them and hit **Cell → Run All**.\n\n---\n\n## 📖 How to Read\n\nEach chapter follows the same **4-step structure**:\n\n| Step | Format | Purpose |\n|---|---|---|\n| 1️⃣ **Analogy** | Plain English, 5-year-old level | Build intuition before math |\n| 2️⃣ **Worked Example** | Real numbers traced through | See exactly what happens |\n| 3️⃣ **Annotated Code** | Every line: `WHAT` + `WHY` | Understand every decision |\n| 4️⃣ **Diagram** | Mermaid flowchart or ASCII | Visualize data flow |\n\n> 💡 **Tip:** Lost in the code? Jump back to the analogy. Confused by the math? Skip to the worked example.\n\n---\n\n## ✨ What Makes This Different\n\n| Aspect | 😴 Typical Tutorial | 🔥 This Guide |\n|---|---|---|\n| **Explanation depth** | \"Attention helps the model focus\" | 8-step worked example with real numbers + variance math + causal mask visualization |\n| **Code comments** | Few or none | Every single line: WHAT + WHY |\n| **Modern techniques** | GPT-2 style (2019) | LLaMA 3 style (2024): RoPE, RMSNorm, SwiGLU |\n| **Training** | Uses HuggingFace Trainer | Full custom loop: AdamW, cosine warmup, mixed precision, grad accumulation |\n| **Inference** | `model.generate()` | Temperature, top-k, top-p, beam search, KV cache explained |\n| **Target audience** | ML engineers | Python developers with zero ML experience |\n| **Diagrams** | None | Mermaid flowcharts + ASCII matrices + worked examples |\n\n---\n\n## 🎯 Skills You'll Gain\n\n- ✅ Explain how GPT-4 tokenizes text using BPE\n- ✅ Understand why RoPE, RMSNorm and SwiGLU replaced older techniques\n- ✅ Compute attention scores manually for a 3-token sentence\n- ✅ Debug a Transformer training loop (loss spikes, flat lines, overfitting)\n- ✅ Choose sampling parameters (temperature, top_k, top_p) for different use cases\n- ✅ Understand why KV caching is critical for production inference\n- ✅ Read modern ML papers with confidence (you'll recognize every component)\n\n---\n\n## 🔮 Next Steps After Finishing\n\n| Experiment | What to Change | What You'll Learn |\n|---|---|---|\n| **Bigger model** | `num_layers` 12 → 24 | How depth improves reasoning |\n| **More data** | Add BookCorpus, C4, The Pile | Impact of data quality and diversity |\n| **Flash Attention** | Install `flash-attn`, swap attention | 2-5× faster training, longer context |\n| **Grouped Query Attention** | Set `num_kv_heads` \u003C `num_heads` | How Mistral achieves efficient inference |\n| **LoRA fine-tuning** | Add low-rank adapter layers | Customize models without full retraining |\n| **RLHF \u002F DPO** | Add reward model training | How ChatGPT learns to follow instructions |\n| **KV Cache** | Implement persistent key-value storage | 500× faster text generation |\n| **Mixture of Experts** | Route tokens through different FFN experts | How GPT-4 scales to trillions of params |\n\n---\n\n## 📁 File Structure\n\n```\n📦 how-to-train-your-gpt\u002F\n├── 📄 README.md              ← You are here\n└── 📂 chapters\u002F\n    ├── 🏠 00_overview.md     ← What is a GPT? Why build one?\n    ├── 🔧 01_setup.md        ← Install tools, GPU vs CPU, venv basics\n    ├── 🔪 02_tokenization.md ← BPE walkthrough, EOS tokens, emoji handling\n    ├── 🧊 03_embeddings.md   ← How numbers become meaning, king − man + woman\n    ├── 📍 04_positional_encoding.md ← RoPE math, numerical example, theta\n    ├── 🧠 05_attention.md    ← ⭐ THE CORE (713 lines). Q,K,V, scaling, causal mask\n    ├── 🧱 06_transformer_block.md ← RMSNorm, SwiGLU, residuals, pre-norm vs post\n    ├── 🏗️ 07_gpt_model.md    ← Complete 151M model, weight tying, logits explained\n    ├── 🏋️ 08_training.md     ← Cross-entropy, backprop, AdamW, cosine warmup\n    ├── 🎤 09_inference.md    ← KV cache, temperature, top-k\u002Fp, beam search\n    ├── 📜 10_full_script.md  ← Runnable main.py\n    └── 📊 11_glossary.md     ← Architecture provenance, parameter breakdown\n```\n\n---\n\n\u003Cp align=\"center\">\n  \u003Ci>\"Any sufficiently explained technology is indistinguishable from magic. Until you build it yourself.\"\u003C\u002Fi>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Csub>⭐ Star this repo if you found it useful | 🐛 Issues & PRs welcome | 📖 Happy learning!\u003C\u002Fsub>\n\u003C\u002Fp>\n","这个项目是一个从零开始构建现代语言模型的教育指南。它通过12个章节，超过3900行代码，详细解释了如何一步步实现一个类似GPT的语言模型，包括分词器、嵌入层、注意力机制、训练循环和推理引擎等核心组件。每行代码都附有注释，以最简单的术语解释其功能与原理，特别适合Python基础开发者、对Transformer架构感兴趣的学者以及想要深入了解大语言模型内部运作机制的工程师学习使用。无需任何机器学习背景知识，只需具备基本的Python编程技能即可上手。",2,"2026-06-11 02:30:30","CREATED_QUERY"]