[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73991":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":16,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":16,"starSnapshotCount":16,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},73991,"llama3-from-scratch","naklecha\u002Fllama3-from-scratch","naklecha","llama3 implementation one matrix multiplication at a time","",null,"Jupyter Notebook",15231,1284,103,16,0,44.33,"MIT License",false,"main",true,[],"2026-06-12 02:03:20","# llama3 implemented from scratch\nin this file, i implemented llama3 from scratch, one tensor and matrix multiplication at a time.\n\u003Cbr>\nalso, im going to load tensors directly from the model file that meta provided for llama3, you need to download the weights before running this file.\nhere is the offical link to download the weights: https:\u002F\u002Fllama.meta.com\u002Fllama-downloads\u002F\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Farchi.png\"\u002F>\n\u003C\u002Fdiv>\n\n## tokenizer\nim not going to implement a bpe tokenizer (but andrej karpathy has a really clean implementation)\n\u003Cbr>\nlink to his implementation: https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fkarpathyminbpe.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n\n```python\nfrom pathlib import Path\nimport tiktoken\nfrom tiktoken.load import load_tiktoken_bpe\nimport torch\nimport json\nimport matplotlib.pyplot as plt\n\ntokenizer_path = \"Meta-Llama-3-8B\u002Ftokenizer.model\"\nspecial_tokens = [\n            \"\u003C|begin_of_text|>\",\n            \"\u003C|end_of_text|>\",\n            \"\u003C|reserved_special_token_0|>\",\n            \"\u003C|reserved_special_token_1|>\",\n            \"\u003C|reserved_special_token_2|>\",\n            \"\u003C|reserved_special_token_3|>\",\n            \"\u003C|start_header_id|>\",\n            \"\u003C|end_header_id|>\",\n            \"\u003C|reserved_special_token_4|>\",\n            \"\u003C|eot_id|>\",  # end of turn\n        ] + [f\"\u003C|reserved_special_token_{i}|>\" for i in range(5, 256 - 5)]\nmergeable_ranks = load_tiktoken_bpe(tokenizer_path)\ntokenizer = tiktoken.Encoding(\n    name=Path(tokenizer_path).name,\n    pat_str=r\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\",\n    mergeable_ranks=mergeable_ranks,\n    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},\n)\n\ntokenizer.decode(tokenizer.encode(\"hello world!\"))\n```\n\n\n\n\n    'hello world!'\n\n\n\n## reading the model file\nnormally, reading this depends on how the model classes are written and the variable names inside them.\n\u003Cbr>\nbut since we are implementing llama3 from scratch we will read the file one tensor at a time.\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fmodel.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nmodel = torch.load(\"Meta-Llama-3-8B\u002Fconsolidated.00.pth\")\nprint(json.dumps(list(model.keys())[:20], indent=4))\n```\n\n    [\n        \"tok_embeddings.weight\",\n        \"layers.0.attention.wq.weight\",\n        \"layers.0.attention.wk.weight\",\n        \"layers.0.attention.wv.weight\",\n        \"layers.0.attention.wo.weight\",\n        \"layers.0.feed_forward.w1.weight\",\n        \"layers.0.feed_forward.w3.weight\",\n        \"layers.0.feed_forward.w2.weight\",\n        \"layers.0.attention_norm.weight\",\n        \"layers.0.ffn_norm.weight\",\n        \"layers.1.attention.wq.weight\",\n        \"layers.1.attention.wk.weight\",\n        \"layers.1.attention.wv.weight\",\n        \"layers.1.attention.wo.weight\",\n        \"layers.1.feed_forward.w1.weight\",\n        \"layers.1.feed_forward.w3.weight\",\n        \"layers.1.feed_forward.w2.weight\",\n        \"layers.1.attention_norm.weight\",\n        \"layers.1.ffn_norm.weight\",\n        \"layers.2.attention.wq.weight\"\n    ]\n\n\n\n```python\nwith open(\"Meta-Llama-3-8B\u002Fparams.json\", \"r\") as f:\n    config = json.load(f)\nconfig\n```\n\n\n\n\n    {'dim': 4096,\n     'n_layers': 32,\n     'n_heads': 32,\n     'n_kv_heads': 8,\n     'vocab_size': 128256,\n     'multiple_of': 1024,\n     'ffn_dim_multiplier': 1.3,\n     'norm_eps': 1e-05,\n     'rope_theta': 500000.0}\n\n\n\n## we use this config to infer details about the model like\n1. the model has 32 transformer layers\n2. each multi-head attention block has 32 heads\n3. the vocab size and so on\n\n\n```python\ndim = config[\"dim\"]\nn_layers = config[\"n_layers\"]\nn_heads = config[\"n_heads\"]\nn_kv_heads = config[\"n_kv_heads\"]\nvocab_size = config[\"vocab_size\"]\nmultiple_of = config[\"multiple_of\"]\nffn_dim_multiplier = config[\"ffn_dim_multiplier\"]\nnorm_eps = config[\"norm_eps\"]\nrope_theta = torch.tensor(config[\"rope_theta\"])\n```\n\n## converting text to tokens\nhere we use tiktoken (i think an openai library) as the tokenizer\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Ftokens.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nprompt = \"the answer to the ultimate question of life, the universe, and everything is \"\ntokens = [128000] + tokenizer.encode(prompt)\nprint(tokens)\ntokens = torch.tensor(tokens)\nprompt_split_as_tokens = [tokenizer.decode([token.item()]) for token in tokens]\nprint(prompt_split_as_tokens)\n```\n\n    [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]\n    ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n\n\n## converting tokens to their embedding\nIM SORRY but this is the only part of the codebase where i use an inbuilt neural network module\n\u003Cbr>\nanyway, so our [17x1] tokens are now [17x4096], i.e. 17 embeddings (one for each token) of length 4096\n\u003Cbr>\n\u003Cbr>\nnote: keep track of the shapes, it makes it much easier to understand everything\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fembeddings.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nembedding_layer = torch.nn.Embedding(vocab_size, dim)\nembedding_layer.weight.data.copy_(model[\"tok_embeddings.weight\"])\ntoken_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)\ntoken_embeddings_unnormalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## we then normalize the embedding using rms normalization\nplease, note after this step the shapes dont change, the values are just normalized\n\u003Cbr>\nthings to keep in mind, we need a norm_eps (from config) because we dont want to accidently set rms to 0 and divide by 0\n\u003Cbr>\nhere is the formula:\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Frms.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# def rms_norm(tensor, norm_weights):\n#     rms = (tensor.pow(2).mean(-1, keepdim=True) + norm_eps)**0.5\n#     return tensor * (norm_weights \u002F rms)\ndef rms_norm(tensor, norm_weights):\n    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights\n```\n\n# building the first first layer of the transformer\n\n### normalization\nyou will see me accessing layer.0 from the model dict (this is the first layer)\n\u003Cbr>\nanyway, so after normalizing our shapes are still [17x4096] same as embedding but normalized \n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fnorm.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\ntoken_embeddings = rms_norm(token_embeddings_unnormalized, model[\"layers.0.attention_norm.weight\"])\ntoken_embeddings.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n### attention implemented from scratch\nlet's load the attention heads of the first layer of the transformer\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fqkv.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n&gt; when we load the query, key, value and output vectors from the model we notice the shapes to be [4096x4096], [1024x4096], [1024x4096], [4096x4096]\n\u003Cbr>\n&gt; at first glance this is weird because ideally we want each q,k,v and o for each head individually\n\u003Cbr>\n&gt; the authors of the code bundled them togeather because its easy it helps parallize attention head multiplication.\n\u003Cbr>\n&gt; im going to unwrap everything... \n\n\n```python\nprint(\n    model[\"layers.0.attention.wq.weight\"].shape,\n    model[\"layers.0.attention.wk.weight\"].shape,\n    model[\"layers.0.attention.wv.weight\"].shape,\n    model[\"layers.0.attention.wo.weight\"].shape\n)\n```\n\n    torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])\n\n\n### unwrapping query\nin the next section we will unwrap the queries from multiple attention heads, the resulting shape is [32x128x4096]\n\u003Cbr>\u003Cbr>\nhere, 32 is the number of attention heads in llama3, 128 is the size of the query vector and 4096 is the size of the token embedding\n\n\n```python\nq_layer0 = model[\"layers.0.attention.wq.weight\"]\nhead_dim = q_layer0.shape[0] \u002F\u002F n_heads\nq_layer0 = q_layer0.view(n_heads, head_dim, dim)\nq_layer0.shape\n```\n\n\n\n\n    torch.Size([32, 128, 4096])\n\n\n\n### im going to implement the first head of the first layer\nhere i access the query weight matrix first head of the first layer, the size of this query weight matrix is [128x4096]\n\n\n```python\nq_layer0_head0 = q_layer0[0]\nq_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n### we now multiply the query weights with the token embedding, to recive a query for the token\nhere you can see the resulting shape is [17x128], this is because we have 17 tokens and for each token there is a 128 length query.\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fq_per_token.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nq_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)\nq_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n## positioning encoding\nwe are now at a stage where we have a query vector for each token in our prompt, but if you think about it -- the indivitually query vector has no idea about the position in the prompt.\n\u003Cbr>\u003Cbr>\nquery: \"the answer to the ultimate question of life, the universe, and everything is \"\n\u003Cbr>\u003Cbr>\nin our prompt we have used \"the\" three times, we need the query vectors of all 3 \"the\" tokens to have different query vectors (each of size [1x128]) based on their positions in the query. we perform these rotations using RoPE (rotory positional embedding).\n\u003Cbr>\u003Cbr>\n### RoPE\nwatch this video (this is what i watched) to understand the math.\nhttps:\u002F\u002Fwww.youtube.com\u002Fwatch?v=o29P0Kpobz0&t=530s\n\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Frope.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nq_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)\nq_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\nin the above step, we split the query vectors into pairs, we apply a rotational angle shift to each pair!\n\u003Cbr>\u003Cbr>\nwe now have a vector of size [17x64x2], this is the 128 length queries split into 64 pairs for each token in the prompt! each of those 64 pairs will be rotated by m*(theta) where m is the position of the token for which we are rotating the query!\n\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fqsplit.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n## using dot product of complex numbers to rotate a vector\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Ffreq_cis.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nzero_to_one_split_into_64_parts = torch.tensor(range(64))\u002F64\nzero_to_one_split_into_64_parts\n```\n\n\n\n\n    tensor([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250,\n            0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656,\n            0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062,\n            0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469,\n            0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875,\n            0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281,\n            0.8438, 0.8594, 0.8750, 0.8906, 0.9062, 0.9219, 0.9375, 0.9531, 0.9688,\n            0.9844])\n\n\n\n\n```python\nfreqs = 1.0 \u002F (rope_theta ** zero_to_one_split_into_64_parts)\nfreqs\n```\n\n\n\n\n    tensor([1.0000e+00, 8.1462e-01, 6.6360e-01, 5.4058e-01, 4.4037e-01, 3.5873e-01,\n            2.9223e-01, 2.3805e-01, 1.9392e-01, 1.5797e-01, 1.2869e-01, 1.0483e-01,\n            8.5397e-02, 6.9566e-02, 5.6670e-02, 4.6164e-02, 3.7606e-02, 3.0635e-02,\n            2.4955e-02, 2.0329e-02, 1.6560e-02, 1.3490e-02, 1.0990e-02, 8.9523e-03,\n            7.2927e-03, 5.9407e-03, 4.8394e-03, 3.9423e-03, 3.2114e-03, 2.6161e-03,\n            2.1311e-03, 1.7360e-03, 1.4142e-03, 1.1520e-03, 9.3847e-04, 7.6450e-04,\n            6.2277e-04, 5.0732e-04, 4.1327e-04, 3.3666e-04, 2.7425e-04, 2.2341e-04,\n            1.8199e-04, 1.4825e-04, 1.2077e-04, 9.8381e-05, 8.0143e-05, 6.5286e-05,\n            5.3183e-05, 4.3324e-05, 3.5292e-05, 2.8750e-05, 2.3420e-05, 1.9078e-05,\n            1.5542e-05, 1.2660e-05, 1.0313e-05, 8.4015e-06, 6.8440e-06, 5.5752e-06,\n            4.5417e-06, 3.6997e-06, 3.0139e-06, 2.4551e-06])\n\n\n\n\n```python\nfreqs_for_each_token = torch.outer(torch.arange(17), freqs)\nfreqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)\nfreqs_cis.shape\n\n# viewing tjhe third row of freqs_cis\nvalue = freqs_cis[3]\nplt.figure()\nfor i, element in enumerate(value[:17]):\n    plt.plot([0, element.real], [0, element.imag], color='blue', linewidth=1, label=f\"Index: {i}\")\n    plt.annotate(f\"{i}\", xy=(element.real, element.imag), color='red')\nplt.xlabel('Real')\nplt.ylabel('Imaginary')\nplt.title('Plot of one row of freqs_cis')\nplt.show()\n```\n\n\n    \n![png](images\u002Fimplllama3_30_0.png)\n    \n\n\n### now that we have a complex number (the angle change vector) for every token's query element\nwe can convert our queries (the one we split into pairs) as complex numbers and then dot product to rotate the query based on the position\n\u003Cbr>\nhoneslty this is beautiful to think about :)\n\n\n```python\nq_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)\nq_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\nq_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis\nq_per_token_as_complex_numbers_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n### after rotated vector is obtained\nwe can get back our the queries as pairs by viewing the complex numbers as real numbers again\n\n\n```python\nq_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)\nq_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\nthe rotated pairs are now merged, we now have a new query vector (rotated query vector) that is of the shape [17x128] where 17 is the number of tokens and the 128 is the dim of the query vector\n\n\n```python\nq_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)\nq_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n# keys (almost the same as queries)\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fkeys.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nim lazy as fuck, so im not going to go through the math for keys, the only things you need to keep in mind are:\n\u003Cbr>\n&gt; keys generate key vectors also of dimention 128\n\u003Cbr>\n&gt; keys have only 1\u002F4th the number of the weights as queries, this is because the weights for keys are shared across 4 heads at a time, to reduce the number of computations need\n\u003Cbr>\n&gt; keys are also rotated to add positional info, just like queries because of the same reasons \n\n\n```python\nk_layer0 = model[\"layers.0.attention.wk.weight\"]\nk_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0] \u002F\u002F n_kv_heads, dim)\nk_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\n\n```python\nk_layer0_head0 = k_layer0[0]\nk_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n\n```python\nk_per_token = torch.matmul(token_embeddings, k_layer0_head0.T)\nk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n\n```python\nk_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)\nk_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\nk_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)\nk_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\nk_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)\nk_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\nk_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)\nk_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n## at this stage now have both the rotated values of queries and keys, for each token. \n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fkeys0.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\neach of the queries and keys are now of shape [17x128]. \n\n## in the next step we will multiply the queries and key matrices\ndoing this will give us a score mapping each token with one another\n\u003Cbr>\nthis score describes how well each token's query relates to the each tokens's key. \nTHIS IS SELF ATTENTION :)\n\u003Cbr>\nthe shape of the attention score matrix (qk_per_token) is [17x17] where 17 is the number of tokens in the prompt\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fqkmatmul.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nqk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(head_dim)**0.5\nqk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 17])\n\n\n\n# we now have to mask query key scores\nduring the training process of llama3, the future token qk scores are masked.\n\u003Cbr>\nwhy? because during training we only learn to predict tokens using past tokens.\n\u003Cbr>\nas a result, during inference we set the future tokens to zero.\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fmask.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\ndef display_qk_heatmap(qk_per_token):\n    _, ax = plt.subplots()\n    im = ax.imshow(qk_per_token.to(float).detach(), cmap='viridis')\n    ax.set_xticks(range(len(prompt_split_as_tokens)))\n    ax.set_yticks(range(len(prompt_split_as_tokens)))\n    ax.set_xticklabels(prompt_split_as_tokens)\n    ax.set_yticklabels(prompt_split_as_tokens)\n    ax.figure.colorbar(im, ax=ax)\n    \ndisplay_qk_heatmap(qk_per_token)\n```\n\n\n    \n![png](images\u002Fimplllama3_50_0.png)\n    \n\n\n\n```python\nmask = torch.full((len(tokens), len(tokens)), float(\"-inf\"), device=tokens.device)\nmask = torch.triu(mask, diagonal=1)\nmask\n```\n\n\n\n\n    tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],\n            [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])\n\n\n\n\n```python\nqk_per_token_after_masking = qk_per_token + mask\ndisplay_qk_heatmap(qk_per_token_after_masking)\n```\n\n\n    \n![png](images\u002Fimplllama3_52_0.png)\n    \n\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fsoftmax.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nqk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)\ndisplay_qk_heatmap(qk_per_token_after_masking_after_softmax)\n```\n\n\n    \n![png](images\u002Fimplllama3_54_0.png)\n    \n\n\n## values (almost the end of attention)\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fvalue.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nthese scores (0-1) are used to determine how much of value matrix is used per token\n\u003Cbr>\n&gt; just like keys, value weights are also shared acorss every 4 attention heads (to save computation)\n\u003Cbr>\n&gt; as a result, the shape of the value weight matrix below is [8x128x4096]\n\n\n\n```python\nv_layer0 = model[\"layers.0.attention.wv.weight\"]\nv_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] \u002F\u002F n_kv_heads, dim)\nv_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\nthe first layer, first head value weight matrix is given below\n\n\n```python\nv_layer0_head0 = v_layer0[0]\nv_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n## value vectors\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fv0.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nwe now use the value weghts to get the attention values per token, this is of size [17x128] where 17 is the number of tokens in the prompt and 128 is the dim of the value vector per token\n\n\n```python\nv_per_token = torch.matmul(token_embeddings, v_layer0_head0.T)\nv_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n## attention\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fattention.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nthe resultant attention vector after multipying with the values per token is of shape [17*128]\n\n\n```python\nqkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)\nqkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n# multi head attention\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fheads.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nWE NOW HAVE THE ATTENTION VALUE OF THE FIRST LAYER AND FIRST HEAD\n\u003Cbr>\nnow im going to run a loop and perform the exact same math as the cells above but for every head in the first layer\n\n\n```python\nqkv_attention_store = []\n\nfor head in range(n_heads):\n    q_layer0_head = q_layer0[head]\n    k_layer0_head = k_layer0[head\u002F\u002F4] # key weights are shared across 4 heads\n    v_layer0_head = v_layer0[head\u002F\u002F4] # value weights are shared across 4 heads\n    q_per_token = torch.matmul(token_embeddings, q_layer0_head.T)\n    k_per_token = torch.matmul(token_embeddings, k_layer0_head.T)\n    v_per_token = torch.matmul(token_embeddings, v_layer0_head.T)\n\n    q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)\n    q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)\n    q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis[:len(tokens)])\n    q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)\n\n    k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)\n    k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)\n    k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis[:len(tokens)])\n    k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)\n\n    qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(128)**0.5\n    mask = torch.full((len(tokens), len(tokens)), float(\"-inf\"), device=tokens.device)\n    mask = torch.triu(mask, diagonal=1)\n    qk_per_token_after_masking = qk_per_token + mask\n    qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)\n    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)\n    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)\n    qkv_attention_store.append(qkv_attention)\n\nlen(qkv_attention_store)\n```\n\n\n\n\n    32\n\n\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fstacked.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nwe now have a the qkv_attention matrix for all 32 heads on the first layer, next im going to merge all attention scores into one large matrix of size [17x4096]\n\u003Cbr>\nwe are almost at the end :)\n\n\n```python\nstacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)\nstacked_qkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# weight matrix, one of the final steps\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fweightmatrix.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\none of the last things to do for a layer 0 attention is, is to multiply the weight matrix of the \n\n\n```python\nw_layer0 = model[\"layers.0.attention.wo.weight\"]\nw_layer0.shape\n```\n\n\n\n\n    torch.Size([4096, 4096])\n\n\n\n### this is a simple linear layer, so we just matmul\n\n\n```python\nembedding_delta = torch.matmul(stacked_qkv_attention, w_layer0.T)\nembedding_delta.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fafterattention.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nwe now have the change in the embedding value after attention, that should be adding to the original token embeddings\n\n\n```python\nembedding_after_edit = token_embeddings_unnormalized + embedding_delta\nembedding_after_edit.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## we normalize and then run a feed forward neural network through the embedding delta\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fnorm_after.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nembedding_after_edit_normalized = rms_norm(embedding_after_edit, model[\"layers.0.ffn_norm.weight\"])\nembedding_after_edit_normalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## loading the ff weights and implementing the feed forward network\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fswiglu.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nin llama3, they used a SwiGLU feedforward network, this network architecture is really good at adding non linearity when needed by the model.\n\u003Cbr>\nits pretty standard to use this feed forward network architecture in llms these days\n\n\n```python\nw1 = model[\"layers.0.feed_forward.w1.weight\"]\nw2 = model[\"layers.0.feed_forward.w2.weight\"]\nw3 = model[\"layers.0.feed_forward.w3.weight\"]\noutput_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\noutput_after_feedforward.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# WE FINALLY HAVE NEW EDITED EMBEDDINGS FOR EACH TOKEN AFTER THE FIRST LAYER\njust 31 more layers to go before we are done (one for loop away)\n\u003Cbr>\nyou can imagine this edited embedding as having information about all queries asked on the first layer\n\u003Cbr>\nnow each layer will encode more and more complex queries on the quesions asked, until we have an embedding that knows everything about the next token that we need.\n\n\n```python\nlayer_0_embedding = embedding_after_edit+output_after_feedforward\nlayer_0_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# god, everything all at once\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fgod.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nyep, this is it. everything we did before, all at once, for every single layer.\n\u003Cbr>\n\n# have fun reading :)\n\n\n```python\nfinal_embedding = token_embeddings_unnormalized\nfor layer in range(n_layers):\n    qkv_attention_store = []\n    layer_embedding_norm = rms_norm(final_embedding, model[f\"layers.{layer}.attention_norm.weight\"])\n    q_layer = model[f\"layers.{layer}.attention.wq.weight\"]\n    q_layer = q_layer.view(n_heads, q_layer.shape[0] \u002F\u002F n_heads, dim)\n    k_layer = model[f\"layers.{layer}.attention.wk.weight\"]\n    k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] \u002F\u002F n_kv_heads, dim)\n    v_layer = model[f\"layers.{layer}.attention.wv.weight\"]\n    v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] \u002F\u002F n_kv_heads, dim)\n    w_layer = model[f\"layers.{layer}.attention.wo.weight\"]\n    for head in range(n_heads):\n        q_layer_head = q_layer[head]\n        k_layer_head = k_layer[head\u002F\u002F4]\n        v_layer_head = v_layer[head\u002F\u002F4]\n        q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)\n        k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)\n        v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)\n        q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)\n        q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)\n        q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers * freqs_cis)\n        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)\n        k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)\n        k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)\n        k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)\n        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)\n        qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(128)**0.5\n        mask = torch.full((len(token_embeddings_unnormalized), len(token_embeddings_unnormalized)), float(\"-inf\"))\n        mask = torch.triu(mask, diagonal=1)\n        qk_per_token_after_masking = qk_per_token + mask\n        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)\n        qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)\n        qkv_attention_store.append(qkv_attention)\n\n    stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)\n    w_layer = model[f\"layers.{layer}.attention.wo.weight\"]\n    embedding_delta = torch.matmul(stacked_qkv_attention, w_layer.T)\n    embedding_after_edit = final_embedding + embedding_delta\n    embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f\"layers.{layer}.ffn_norm.weight\"])\n    w1 = model[f\"layers.{layer}.feed_forward.w1.weight\"]\n    w2 = model[f\"layers.{layer}.feed_forward.w2.weight\"]\n    w3 = model[f\"layers.{layer}.feed_forward.w3.weight\"]\n    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\n    final_embedding = embedding_after_edit+output_after_feedforward\n```\n\n# we now have the final embedding, the best guess the model could make about the next token\nthe shape of the embedding is the same as regular token embeddings [17x4096] where 17 is the number of tokens and 4096 is the embedding dim\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Flast_norm.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\nfinal_embedding = rms_norm(final_embedding, model[\"norm.weight\"])\nfinal_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# finally, lets decode the embedding into the token value\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Ffinallayer.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nwe will use the output decoder to convert the final embedding into a token\n\n\n```python\nmodel[\"output.weight\"].shape\n```\n\n\n\n\n    torch.Size([128256, 4096])\n\n\n\n# we use the embedding of the last token to predict the next value\nhopefully in our case, 42 :)\nnote: 42 is the answer to \"the answer to the ultimate question of life, the universe, and everything is \", according to the book \"hitchhiker's guide to the galaxy\", most mordern llms would answer with 42 here, which should validate our entire code! wish me luck :)\n\n\n```python\nlogits = torch.matmul(final_embedding[-1], model[\"output.weight\"].T)\nlogits.shape\n```\n\n\n\n\n    torch.Size([128256])\n\n\n\n### the model predicted token number 2983 as the next token, is this the token number for 42?\nIM HYPING YOU UP, this is the last cell of code, hopefully you had fun :)\n\n\n```python\nnext_token = torch.argmax(logits, dim=-1)\nnext_token\n```\n\n\n\n\n    tensor(2983)\n\n\n\n# lets fucking go\n\u003Cdiv>\n    \u003Cimg src=\"images\u002F42.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\ntokenizer.decode([next_token.item()])\n```\n\n\n\n\n    '42'\n\n\n\n# thank you, i love you :)\n\nThis is the end. Hopefully you enjoyed reading it!\n\nIf you want to support my work\n\n1. follow me on twitter https:\u002F\u002Ftwitter.com\u002Fnaklecha \n2. or, buy me a coffee [https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha](https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha)\n\nHonestly, if you made it this far you already made my day :)\n\n## what motivates me?\n\nMy friends and I are on a mission - to make research more accessible!\nWe created a research lab called A10 - [AAAAAAAAAA.org](http:\u002F\u002Faaaaaaaaaa.org\u002F)\n\nA10 twitter - https:\u002F\u002Ftwitter.com\u002Faaaaaaaaaaorg\n\nour thesis:\n\u003Cdiv>\n    \u003Cimg src=\"images\u002Fa10.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n","该项目是从零开始逐步实现Llama3模型，通过逐个张量和矩阵乘法来构建。它使用Jupyter Notebook作为开发环境，并直接从Meta提供的模型文件中加载权重。项目不包括BPE分词器的实现，但提供了链接指向Karpathy的简洁实现。该实现适合对深度学习模型有深入了解的研究人员或开发者，特别是那些希望深入理解大型语言模型内部机制的人。此外，对于需要自定义修改或扩展Llama3功能的场景也非常适用。",2,"2026-06-11 03:48:18","high_star"]