[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72185":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},72185,"tiny-llm","skyzh\u002Ftiny-llm","skyzh","A course of learning LLM inference serving on Apple Silicon for systems engineers: build a tiny vLLM + Qwen.","https:\u002F\u002Fskyzh.github.io\u002Ftiny-llm\u002F",null,"Python",4266,330,36,7,0,14,46,97,42,29.56,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34],"course","large-language-model","llm","python","qwen","qwen2","serving","vllm","2026-06-12 02:02:59","# tiny-llm - LLM Serving in a Week\n\n[![CI (main)](https:\u002F\u002Fgithub.com\u002Fskyzh\u002Ftiny-llm\u002Factions\u002Fworkflows\u002Fmain.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fskyzh\u002Ftiny-llm\u002Factions\u002Fworkflows\u002Fmain.yml)\n\nA course on LLM serving using MLX for system engineers. The codebase\nis solely (almost!) based on MLX array\u002Fmatrix APIs without any high-level neural network APIs, so that we\ncan build the model serving infrastructure from scratch and dig into the optimizations.\n\nThe goal is to learn the techniques behind efficiently serving a large language model (e.g., Qwen3 models).\n\nIn week 1, you will implement the necessary components in Python (only Python!) to use the Qwen3 model to generate responses (e.g., attention, RoPE, etc). In week 2, you will implement the inference system which is similar to but a much simpler version of vLLM (e.g., KV cache, continuous batching, flash attention, etc). In week 3, we will cover more advanced topics and how the model interacts with the outside world.\n\nWhy MLX: nowadays it's easier to get a macOS-based local development environment than setting up an NVIDIA GPU.\n\nWhy Qwen3: it keeps the dense decoder architecture small enough for a local MLX course while adding modern details such as QK norm and bfloat16 weights. The official MLX 4-bit model files also make the setup predictable on Apple Silicon.\n\n## Book\n\nThe tiny-llm book is available at [https:\u002F\u002Fskyzh.github.io\u002Ftiny-llm\u002F](https:\u002F\u002Fskyzh.github.io\u002Ftiny-llm\u002F). You can follow the guide and start building.\n\n## Community\n\nYou may join skyzh's Discord server and study with the tiny-llm community.\n\n[![Join skyzh's Discord Server](book\u002Fsrc\u002Fdiscord-badge.svg)](https:\u002F\u002Fskyzh.dev\u002Fjoin\u002Fdiscord)\n\n## Roadmap\n\nWeek 1 and 2 is complete. Week 3 is in progress.\n\n| Week + Chapter | Topic                                                       | Code | Test | Doc |\n| -------------- | ----------------------------------------------------------- | ---- | ---- | --- |\n| 1.1            | Attention                                                   | ✅    | ✅   | ✅  |\n| 1.2            | RoPE                                                        | ✅    | ✅   | ✅  |\n| 1.3            | Grouped Query Attention                                     | ✅    | ✅   | ✅  |\n| 1.4            | RMSNorm and MLP                                             | ✅    | ✅   | ✅  |\n| 1.5            | Load the Model                                              | ✅    | ✅   | ✅  |\n| 1.6            | Generate Responses (aka Decoding)                           | ✅    | ✅   | ✅  |\n| 1.7            | Sampling                                                    | ✅    | ✅   | ✅  |\n| 2.1            | Key-Value Cache                                             | ✅    | ✅   | ✅  |\n| 2.2            | Quantized Matmul and Linear - CPU                           | ✅    | ✅   | ✅  |\n| 2.3            | Quantized Matmul and Linear - GPU                           | ✅    | ✅   | ✅  |\n| 2.4            | Flash Attention 2 - CPU                                     | ✅    | ✅   | ✅  |\n| 2.5            | Flash Attention 2 - GPU                                     | ✅    | ✅   | ✅  |\n| 2.6            | Continuous Batching                                         | ✅    | ✅   | ✅  |\n| 2.7            | Chunked Prefill                                             | ✅    | ✅   | ✅  |\n| 3.1            | Paged Attention - Part 1                                    | ✅    | ✅   | 🚧  |\n| 3.2            | Paged Attention - Part 2                                    | ✅    | ✅   | 🚧  |\n| 3.3            | MoE (Mixture of Experts)                                    | 🚧    | 🚧   | 🚧  |\n| 3.4            | Speculative Decoding                                        | 🚧    | ✅   | 🚧  |\n| 3.5            | RAG Pipeline                                                | 🚧    | 🚧   | 🚧  |\n| 3.6            | AI Agent     \u002F Tool Calling                                 | 🚧    | 🚧   | 🚧  |\n| 3.7            | Long Context                                                | 🚧    | 🚧   | 🚧  |\n\nOther topics not covered: quantized\u002Fcompressed kv cache, prefix\u002Fprompt cache; sampling, fine tuning; smaller kernels (softmax, silu, etc)\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=skyzh\u002Ftiny-llm&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#skyzh\u002Ftiny-llm&Date)\n","tiny-llm 是一个针对系统工程师的课程，旨在教授如何在Apple Silicon上进行大语言模型（如Qwen3）的推理服务。该项目使用MLX数组\u002F矩阵API从零开始构建模型服务基础设施，并深入探讨优化技术。核心功能包括实现注意力机制、RoPE、KV缓存等关键组件，以及构建类似于vLLM但更简单的推理系统。通过三周的学习，参与者将掌握高效服务大型语言模型的技术细节。适合希望深入了解LLM推理过程及其底层实现原理的开发人员或研究者。整个项目基于Python编写，易于在macOS环境下运行，无需额外配置NVIDIA GPU。",2,"2026-06-11 03:40:46","high_star"]