[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80652":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":25,"discoverSource":26},80652,"gpu-kernel-engineer-from-scratch","vukrosic\u002Fgpu-kernel-engineer-from-scratch","vukrosic","Become GPU kernel engineer step by step.",null,"Python",54,19,50,0,2,4,3.9,false,"main",true,[],"2026-06-12 02:04:05","# GPU Kernel Engineer From Scratch\n\nA 12-month CUDA, Triton, and AI systems course where you build a public GPU\nkernels portfolio one week at a time.\n\nWatch YouTube course here - https:\u002F\u002Fyoutu.be\u002FF5v_7OCwKHs\n\n## Start Here\n\nThis repo is the 1-year roadmap to become a GPU kernel engineer.\n\nIf you do not watch any video, follow this order:\n\n1. Open [Week 01: GPU Mental Model And Baseline](weeks\u002Fweek-01-gpu-mental-model.md)\n2. Do every task in that file from top to bottom.\n3. When Week 01 is done, open [weeks\u002FREADME.md](weeks\u002FREADME.md) and continue in order through Week 10.\n4. Use the scaffolded weekly files in `weeks\u002F` for the rest of the year.\n5. Use [course\u002Fsyllabus.md](course\u002Fsyllabus.md) only as the full map.\n6. Use [course\u002Frecovery-system.md](course\u002Frecovery-system.md) if you fall behind.\n7. Use [FINISH_PLAN.md](FINISH_PLAN.md) when you want the repo brought to its\n   finished public-project state.\n\nRun the starter repo:\n\n```bash\npython -m pip install -e \".[dev]\"\npytest\npython examples\u002Freference_bench.py\nmake bootstrap-results\nmake bench\n```\n\n## A To B Path\n\nPoint A:\n\n- you know Python\n- you may use PyTorch\n- you do not yet understand GPU kernels deeply\n- you do not have a GPU-systems portfolio\n\nPoint B:\n\n- you can write CUDA and Triton kernels\n- you can test kernels against trusted baselines\n- you can benchmark and explain performance\n- you can build AI-relevant kernels like softmax, matmul, layer norm, and attention pieces\n- you have a public portfolio repo with results, notes, and interview-ready explanations\n\nHow you get there:\n\n1. Follow one week file at a time in `weeks\u002F`.\n2. Each week, produce one artifact: code, test, benchmark, note, or portfolio section.\n3. Each month, use the fourth week to catch up and package your work.\n4. By Month 12, turn the artifacts into a final capstone and interview story.\n\nDo not try to speedrun the whole roadmap. The course works because the skills\ncompound week by week.\n\n## Course Promise\n\nEvery week, you build one GPU systems skill and ship one portfolio artifact.\n\nBy the end, you should be able to demonstrate:\n\n- CUDA kernels, grids, blocks, threads, and warps\n- GPU memory hierarchy and performance bottlenecks\n- correctness testing against CPU, NumPy, or PyTorch references\n- benchmarking, profiling, and performance reports\n- reductions, scans, softmax, layer norm, matmul, and attention-style kernels\n- CUDA and Triton implementations of AI-relevant operations\n- a public repo that can be discussed in ML systems and AI infrastructure interviews\n\n## 12-Month Roadmap\n\nThere are 48 weekly files because 12 months x 4 weeks = 48 weeks.\n\nMonth 1: GPU Foundations\n- Week 01: GPU mental model and baseline\n- Week 02: CUDA setup and vector add\n- Week 03: Tensor shapes, memory layout, indexing\n- Week 04: Elementwise kernel patterns\n\nMonth 2: Memory And Benchmarking\n- Week 05: Memory bandwidth and AXPY\n- Week 06: Coalescing vs strides\n- Week 07: Timing harness and benchmarking\n- Week 08: Reading performance results\n\nMonth 3: Reductions\n- Week 09: Reductions mental model\n- Week 10: Naive reduction kernels\n- Week 11: Block-level reductions with shared memory\n- Week 12: Warp-level reductions\n\nMonth 4: Scans, Atomics, Synchronization\n- Week 13: Synchronization and barriers\n- Week 14: Atomics and contention\n- Week 15: Prefix sum and scan mental model\n- Week 16: Parallel scan implementation\n\nMonth 5: Softmax And Normalization\n- Week 17: Softmax math for kernels\n- Week 18: Fused row-wise softmax\n- Week 19: LayerNorm kernel mental model\n- Week 20: RMSNorm kernel\n\nMonth 6: Matmul Foundations\n- Week 21: Naive matrix multiplication\n- Week 22: Tiled matrix multiplication\n- Week 23: Matmul memory reuse\n- Week 24: Occupancy, registers, and tile size\n\nMonth 7: Triton For AI Kernels\n- Week 25: Triton mental model\n- Week 26: Triton vector add and masks\n- Week 27: Triton reductions\n- Week 28: Triton row-wise softmax\n\nMonth 8: Triton Matmul And Tuning\n- Week 29: Triton matmul basics\n- Week 30: Triton matmul performance knobs\n- Week 31: Batched matmul indexing\n- Week 32: Profiling GPU kernels\n\nMonth 9: PyTorch Integration\n- Week 33: PyTorch baselines\n- Week 34: Custom op wrapper\n- Week 35: GPU test matrix\n- Week 36: Debugging GPU kernels\n\nMonth 10: Transformer Kernels\n- Week 37: GELU fusion\n- Week 38: Residual and norm fusion\n- Week 39: Attention scores and masks\n- Week 40: Transformer kernel dataflow\n\nMonth 11: Attention And Inference\n- Week 41: Attention forward pass\n- Week 42: FlashAttention concepts\n- Week 43: KV cache\n- Week 44: Attention capstone plan\n\nMonth 12: Portfolio And Interviews\n- Week 45: Benchmark dashboard\n- Week 46: Interview explanations\n- Week 47: Resume and story\n- Week 48: Final capstone\n\nThe detailed week-by-week plan is in [course\u002Fsyllabus.md](course\u002Fsyllabus.md),\nand the first ten weekly lessons live in [weeks\u002F](weeks\u002F).\n\n## What To Do Each Week\n\nEach rewritten week follows the same shape:\n\n1. Read the current week file.\n2. Study the mental model and code-shaped examples.\n3. Use the matching `results\u002F` file to capture the main takeaway.\n4. Move to the next lesson.\n\nThe weekly file is the source of truth. The syllabus tells you where the course\nis going, but the weekly file tells you what to do today.\n\n## How The Course Prevents Burnout\n\n- Each lesson focuses on one GPU engineering idea.\n- Result notes stay lightweight.\n- Later implementation work builds on the lesson files instead of replacing them.\n- If you fall behind, use [course\u002Frecovery-system.md](course\u002Frecovery-system.md) instead of quitting.\n\nThe rule is simple: correct and finished beats perfect and abandoned.\n\n## Community\n\nThe repo is the free roadmap. The community is for feedback, accountability, and\nhelp finishing the work.\n\nJoin here: [Become AI Researcher](https:\u002F\u002Fskool.com\u002Fbecome-ai-researcher-2669\u002Fabout)\n\nInside the community, the goal is to help you:\n\n- stay on pace with the weekly roadmap\n- ask questions when a kernel, benchmark, or setup step breaks\n- get feedback on portfolio notes, benchmark tables, and repo structure\n- join office hours and implementation review sessions\n- compare your work with other builders following the same path\n- turn finished assignments into resume bullets and interview explanations\n\n## Repo Structure\n\n- `course\u002F` contains the full 12-month roadmap, weekly rhythm, and recovery system.\n- `weeks\u002F` contains one follow-it-top-to-bottom file per course week.\n- `assignments\u002F` contains the assignment index and reusable assignment template.\n- `cuda\u002F` contains standalone CUDA C++ starter kernels and their notes.\n- `triton\u002F` contains Triton docs and implementation notes.\n- `triton_kernels\u002F` contains executable Triton Python kernels.\n- `kernels\u002F` organizes AI-kernel topics independent of implementation language.\n- `gputriton\u002F` contains current portable reference implementations.\n- `examples\u002F` contains runnable demos.\n- `tests\u002F` contains correctness checks.\n- `results\u002F` is where benchmark tables and charts should go.\n- `portfolio\u002F` contains resume, interview, and project-packaging material.\n- `creator\u002F` contains channel cadence, content packaging, and publishing workflow.\n- `bonus\u002F10-day-sprint\u002F` contains optional compressed practice material.\n- `FINISH_PLAN.md` describes the path from scaffold to finished project.\n","该项目旨在通过为期12个月的课程逐步培养GPU内核工程师，涵盖了CUDA、Triton及AI系统等内容。其核心功能包括每周构建一个GPU系统技能并产出相应的作品集素材，如代码、测试、基准测试报告等。技术特点在于提供了一个从基础到高级的全面学习路径，帮助学员深入理解GPU架构、内存层次结构及其性能瓶颈，并能够实现诸如softmax、矩阵乘法等AI相关内核。适合希望转型为GPU内核开发者的Python开发者或对高性能计算感兴趣的人员使用，尤其是那些希望通过实际项目经验来增强自己在机器学习系统和AI基础设施领域竞争力的人士。","2026-06-11 04:01:30","CREATED_QUERY"]