[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-767":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":14,"starSnapshotCount":14,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},767,"llm-internals","amitshekhariitbhu\u002Fllm-internals","amitshekhariitbhu","Learn LLM internals step by step - from tokenization to attention to inference optimization.","https:\u002F\u002Foutcomeschool.com\u002Fprogram\u002Fai-and-machine-learning",null,1064,93,12,0,23,29,91,69,18.92,"Apache License 2.0",false,"main",true,[25,26,27,28,29,5],"attention-is-all-you-need","attention-mechanism","large-language-models","learn-llm","llm","2026-06-12 02:00:18","\u003Cp align=\"center\">\n    \u003Cimg alt=\"AI Engineering Interview Questions and Answers\" src=\"https:\u002F\u002Fgithub.com\u002Famitshekhariitbhu\u002Fllm-internals\u002Fblob\u002Fmain\u002Fassets\u002Fbanner.png\">\n\u003C\u002Fp>\n\n# LLM Internals\n\n**Learn LLM internals step by step - from tokenization to attention to inference optimization.**\n\n---\n\nPrepared and maintained by the **Founder** of Outcome School: [Amit Shekhar](https:\u002F\u002Fx.com\u002Famitiitbhu)\n\n---\n\n**Note: This series will continue to grow as I write more blogs and create more videos on new topics. Keep learning.**\n\n---\n\n## Large Language Models (LLMs)\n\nBefore diving into the internals of an LLM, it’s a good idea to first understand what an LLM actually is.\n\nIn this video, we will cover the following:\n\n* LLM\n* RAG\n* MCP\n* Agent\n* Fine-tuning\n* Quantization\n\nLet's get started: [AI Engineering Explained: LLM, RAG, MCP, Agent, Fine-Tuning, Quantization](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=lnfWvX66FUk)\n\n---\n\n## Tokenization in Large Language Models (LLMs)\n\nIn this video, we will learn about Tokenization and why they are essential for Large Language Models.\n\nLet's get started: [Tokenization in Large Language Models (LLMs)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=sK2s9I84EVI)\n\n---\n\n\n## Byte Pair Encoding in LLMs\n\nIn this blog, we will learn about BPE (Byte Pair Encoding) - the tokenization algorithm used by most modern Large Language Models (LLMs) to break text into smaller pieces before processing it.\n\nWe will understand what BPE is, why it is needed, and how it works step by step with a simple example.\n\nWe will cover the following:\n\n* What is Tokenization?\n* The Problem: How to Break Text into Tokens?\n* What is BPE (Byte Pair Encoding)?\n* How BPE Works: Step by Step\n* How BPE Tokenizes New Text\n* Why BPE is Used in Modern LLMs\n\nLet's get started: [Byte Pair Encoding in LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fbpe-in-llms)\n\n---\n\n## Math behind Attention - Q, K, and V\n\nIn this blog, we will learn about the math behind Attention: Query(Q), Key(K), and Value(V) with a step-by-step numeric example.\n\nWe will cover the following:\n\n* The Attention Formula\n* Setting Up: From Words to Vectors\n* Creating Q, K, and V Matrices\n* Computing Attention Scores (Q x K^T)\n* Scaling the Scores\n* Applying Softmax\n* Computing the Final Output (Attention Weights x V)\n* Putting It All Together\n\nLet's get started: [Math behind Attention - Q, K, and V](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fmath-behind-attention-qkv)\n\n---\n\n## Math behind √dₖ Scaling Factor in Attention\n\nIn this blog, we will learn about why we scale the dot product attention by √dₖ in the Transformer architecture with a step-by-step numeric example.\n\nWe will cover the following:\n\n* The Attention Formula (Quick Recap)\n* What Happens Without Scaling?\n* Why Do Dot Products Grow with dₖ?\n* Understanding Variance of the Dot Product\n* Proving It Step by Step: Variance of the Dot Product is dₖ\n* What Large Dot Products Do to Softmax\n* Why √dₖ is the Right Scaling Factor\n* Seeing It with Real Numbers\n* Putting It All Together\n\nLet's get started: [Math behind √dₖ Scaling Factor in Attention](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fscaling-dot-product-attention)\n\n---\n\n## Causal Masking in Attention\n\nIn this blog, we will learn about causal masking in attention.\n\nWe will start with the introduction of causal masking, understand the problem of seeing future tokens through an example, and then walk through its implementation to see how masked attention prevents the model from accessing future tokens.\n\nWe will cover the following:\n\n* Without Causal Masking\n* With Causal Masking\n* Implementation of Causal Masking\n* The Causal Mask Matrix\n\nLet's get started: [Causal Masking in Attention](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fcausal-masking-in-attention)\n\n---\n\n## Math Behind Backpropagation\n\nIn this blog, we will learn about the math behind backpropagation in neural networks.\n\nBackpropagation is the core algorithm that allows neural networks to learn from their mistakes. Without it, training neural networks efficiently would not be possible. Understanding the math behind it gives us a deeper understanding of how neural networks actually learn. Do not worry, we will learn about each concept step by step so that everything is clear.\n\nWe will cover the following:\n\n* What is backpropagation?\n* The chain rule of calculus\n* Forward pass\n* Loss calculation\n* Backward pass (backpropagation)\n* Step-by-step numeric example\n* Weight update using gradient descent\n* Backpropagation in Python\n\nLet's get started: [Math Behind Backpropagation](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fmath-behind-backpropagation)\n\n---\n\n## Math Behind Cross-Entropy Loss\n\nIn this blog, we will learn about the math behind Cross-Entropy Loss with a step-by-step numeric example.\n\nWhen we train a classification model in machine learning, the model predicts probabilities for each class. For example, given an image, the model outputs something like: \"I am 70% sure this is a cat, 20% sure it is a dog, and 10% sure it is a rabbit.\" To train the model, we need a way to measure how wrong these predictions are compared to the true answer. This is exactly what Cross-Entropy Loss does. It is the most widely used loss function in classification tasks, and it powers the training of almost every modern AI model, including GPT, BERT, and image classifiers.\n\nWe will cover the following:\n\n* The Big Picture\n* What is Cross-Entropy\n* The Cross-Entropy Loss Formula\n* Why We Take the Negative Log\n* Binary Cross-Entropy Loss\n* Categorical Cross-Entropy Loss\n* Step-by-Step Numeric Example\n* Cross-Entropy Loss for Language Models\n* The Gradient of Cross-Entropy Loss\n* Quick Summary\n\nLet's get started: [Math Behind Cross-Entropy Loss](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fmath-behind-cross-entropy-loss)\n\n---\n\n## Decoding Transformer Architecture\n\nIn this blog, we will learn about the Transformer architecture by decoding it piece by piece - understanding what each component does, how they work together, and why this architecture powers every modern Large Language Model (LLM).\n\nWe will cover the following:\n\n* Why the Transformer was needed\n* The two halves of the architecture\n* Tokenization, Embedding, and Positional Encoding\n* The Attention Mechanism and Multi-Head Attention\n* Feed-Forward Networks, Residual Connections, and Layer Normalization\n* How the Encoder and Decoder work\n* How data flows through the entire architecture\n* The three variants of the Transformer\n* Why the Transformer is so powerful\n\nLet's get started: [Decoding Transformer Architecture](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fdecoding-transformer-architecture)\n\n---\n\n## Feed-Forward Networks in LLMs\n\nIn this blog, we will learn about Feed-Forward Networks in LLMs - understanding what they are, how they work inside the Transformer architecture, why every Transformer layer needs one, and what role they play in making Large Language Models so powerful.\n\nWe will cover the following:\n\n* What is a Feed-Forward Network?\n* Understanding Feed-Forward Networks with a Real-World Analogy\n* Where Does the Feed-Forward Network Sit in a Transformer?\n* How Does a Feed-Forward Network Work - Step by Step\n* The Expand-then-Contract Pattern\n* Why Does the FFN Expand and Then Contract?\n* ReLU and Activation Functions\n* What Does the Feed-Forward Network Actually Learn?\n* How Much of the Model is the Feed-Forward Network?\n* Feed-Forward Networks in Mixture of Experts\n* Why Feed-Forward Networks Are So Important\n\nLet's get started: [Feed-Forward Networks in LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Ffeed-forward-networks-in-llms)\n\n---\n\n## KV Cache in LLMs\n\nIn this blog, we will learn about KV Cache - where K stands for Key and V stands for Value - and why it is used in Large Language Models (LLMs) to speed up text generation.\n\nWe will start with how LLMs generate text one token at a time, understand the role of Key, Value, and Query inside the model, see the problem of repeated computation through an example, and then walk through how KV Cache solves this problem by storing and reusing past results.\n\nWe will cover the following:\n\n* How LLMs Generate Text\n* What Happens Inside the Model\n* The Problem: Repeated Computation\n* The Solution: KV Cache\n* Why Only Key and Value Are Cached, Not Query\n* How Much Faster Does It Get\n* The Trade-Off: Speed vs Memory\n\nLet's get started: [KV Cache in LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fkv-cache-in-llms)\n\n---\n\n## Paged Attention in LLMs\n\nIn this blog, we will learn about Paged Attention, a technique that solves the memory waste problem of KV Cache, allowing LLMs to serve many more users at the same time.\n\nWe will start with a quick recap of KV Cache, understand the memory problem it creates, see how traditional memory allocation wastes space through an example, and then walk through how Paged Attention solves this problem by borrowing an idea from how computers manage memory.\n\nWe will cover the following:\n\n* Quick Recap: KV Cache\n* The Problem: Memory Waste in KV Cache\n* What is Paged Attention?\n* How Paged Attention Works\n* Why Paged Attention Is So Effective\n* Memory Sharing Across Requests\n\nLet's get started: [Paged Attention in LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fpaged-attention-in-llms)\n\n---\n\n## Decoding Flash Attention in LLMs\n\nIn this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and why it is used in almost every modern Large Language Model (LLM).\n\nWe will cover the following:\n\n* A quick recap of standard attention\n* Why standard attention is slow\n* How GPU memory actually works (HBM vs SRAM)\n* The core idea behind Flash Attention\n* Tiling: breaking the work into small blocks\n* Online softmax: computing softmax without the full matrix\n* Recomputation in the backward pass\n* Flash Attention 2\n* Flash Attention 3\n* Advantages and impact of Flash Attention\n\nLet's get started: [Decoding Flash Attention in LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fdecoding-flash-attention)\n\n---\n\n## Speculative Decoding\n\nIn this blog, we will learn about Speculative Decoding - what it is, why LLM generation is slow without it, how a small draft model and a big target model work together to produce tokens faster, the rejection sampling math that guarantees no quality loss, real numbers showing the 2x to 3x speedup, where it is used in production, and the trade-offs to watch out for.\n\nWe will cover the following:\n\n* What problem does Speculative Decoding solve?\n* The Big Picture\n* Why is LLM generation slow?\n* The core idea behind Speculative Decoding\n* Step-by-step walkthrough\n* The verification step\n* Real numbers and speedup\n* Where it is used\n* Trade-offs\n* Quick Summary\n\nLet's get started: [Speculative Decoding](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fspeculative-decoding)\n\n---\n\n## Mixture of Experts Explained\n\nIn this blog, we will learn about the Mixture of Experts (MoE) architecture - understanding what experts are, how the router picks them, why MoE makes large models faster and cheaper, and why it powers many of today's most powerful Large Language Models (LLMs).\n\nWe will cover the following:\n\n* Why Mixture of Experts was needed\n* What an \"expert\" really means\n* The router and how it picks experts\n* Where MoE sits inside a Transformer\n* Sparse activation and why it saves compute\n* Load balancing across experts\n* Advantages and challenges of MoE\n* Why MoE powers many modern LLMs\n\nLet's get started: [Mixture of Experts Explained](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fmixture-of-experts)\n\n---\n\n## Grouped Query Attention\n\nIn this blog, we will learn about Grouped-Query Attention (GQA) and how it differs from Multi-Head Attention (MHA). We will also learn about Multi-Query Attention (MQA) along the way and see when to use which one.\n\nWe will cover the following:\n\n* The Big Picture\n* Quick Recap: Multi-Head Attention (MHA)\n* The Problem with Multi-Head Attention\n* What is Multi-Query Attention (MQA)?\n* What is Grouped-Query Attention (GQA)?\n* How Grouped-Query Attention Works\n* GQA is a Generalization of MHA and MQA\n* GQA vs MHA vs MQA\n* Real-World Use Cases\n* A Note on Terminology\n* Uptraining: Converting MHA to GQA\n* Quick Summary\n\nLet's get started: [Grouped Query Attention](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fgrouped-query-attention)\n\n---\n\n## Math Behind RoPE (Rotary Position Embedding)\n\nIn this blog, we will learn about the math behind Rotary Position Embedding (RoPE) and why it is used in modern Large Language Models.\n\nWe will cover the following:\n\n* The Big Picture\n* Why a Transformer Needs Position Information\n* Older Approaches and Their Problems\n* The Core Idea Behind RoPE\n* The 2D Rotation Math\n* How RoPE Is Applied to Q and K\n* Why the Dot Product Captures Relative Position\n* A Small Numeric Example\n* Real-World Use Cases\n* Quick Summary\n\nLet's get started: [Math Behind RoPE (Rotary Position Embedding)](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fmath-behind-rope-rotary-position-embedding)\n\n---\n\n## RMSNorm (Root Mean Square Layer Normalization)\n\nIn this blog, we will learn about RMSNorm, a faster and simpler alternative to Layer Normalization that powers most modern Large Language Models like Llama, Mistral, Gemma, Qwen, PaLM, and DeepSeek.\n\nWe will cover the following:\n\n* Why normalization is needed in deep networks\n* A quick recap of Layer Normalization (LayerNorm)\n* What RMSNorm is and how it works\n* The math behind RMSNorm with a concrete numeric example\n* LayerNorm vs RMSNorm - the key differences\n* Why modern LLMs prefer RMSNorm\n* A code example\n* Where RMSNorm fits in a Transformer\n* Quick Summary\n\nLet's get started: [RMSNorm (Root Mean Square Layer Normalization)](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Frmsnorm-root-mean-square-layer-normalization)\n\n---\n\n## LoRA: Low-Rank Adaptation of LLMs\n\nIn this blog, we will learn about LoRA - Low-Rank Adaptation of Large Language Models.\n\nWe will cover the following:\n\n* The Big Picture\n* Why Full Fine-Tuning Is Expensive\n* The Core Idea Behind LoRA\n* How LoRA Works Step by Step\n* A Small Numeric Example\n* Where LoRA Is Applied in a Transformer\n* Merging LoRA Back Into the Model\n* Real-World Use Cases\n* Quick Summary\n\nLet's get started: [LoRA: Low-Rank Adaptation of LLMs](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Flora-low-rank-adaptation-of-llms)\n\n---\n\n## Decoding DeepSeek-V4\n\nIn this blog, we will learn about DeepSeek-V4, the new family of open Mixture-of-Experts language models that natively supports a one-million-token context with dramatically lower inference cost.\n\nDeepSeek-V4 makes one-million-token context roughly a tenth as expensive as it was in DeepSeek-V3.2. It introduces a new attention design, a new way of doing residual connections, a new optimizer, and a new post-training pipeline. We will decode each of these one by one.\n\nWe will cover the following:\n\n* The Big Picture\n* Two Models: DeepSeek-V4-Pro and DeepSeek-V4-Flash\n* Hybrid Attention with CSA and HCA\n* Manifold-Constrained Hyper-Connections (mHC)\n* Muon Optimizer\n* FP4 Quantization-Aware Training\n* Pre-Training\n* Post-Training: Specialist Training and On-Policy Distillation\n* Reasoning Modes\n* Putting It All Together\n* Quick Summary\n\nLet's get started: [Decoding DeepSeek-V4](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fdecoding-deepseek-v4)\n\n---\n\n## Harness Engineering in AI\n\nIn this blog, we will learn about Harness Engineering in AI. We will understand what a harness is, why we need it, and how it is used in AI Agents and evaluation systems.\n\nWe will cover the following:\n\n* What is a Harness in AI?\n* Why do we need Harness Engineering?\n* Components of an AI Harness\n* Harness Engineering for AI Agents\n* Harness Engineering for Evaluation\n* Best Practices in Harness Engineering\n* Putting It All Together\n\nLet's get started: [Harness Engineering in AI](https:\u002F\u002Foutcomeschool.com\u002Fblog\u002Fharness-engineering-in-ai)\n\n---\n\n## More blogs and videos coming soon!\n\n### License\n```\n   Copyright (C) 2026 Outcome School\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n```\n","该项目旨在逐步学习大型语言模型（LLM）的内部机制，从分词到注意力机制再到推理优化。核心功能包括通过视频和博客文章详细解析LLM的关键组成部分和技术细节，如字节对编码（BPE）、注意力机制中的QKV计算及√dₖ缩放因子的作用。项目适合希望深入了解LLM工作原理、优化方法以及相关数学基础的研究人员、工程师或学生使用。",2,"2026-06-11 02:39:11","CREATED_QUERY"]