[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74166":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74166,"lectures","gpu-mode\u002Flectures","gpu-mode","Material for gpu-mode lectures","https:\u002F\u002Fwww.youtube.com\u002F@GPUMODE",null,"Jupyter Notebook",6163,623,81,4,0,13,42,95,39,39.39,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:23","# Supplementary Material for Lectures\n[![](https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002Fgpumode?style=flat)](https:\u002F\u002Fdiscord.gg\u002Fgpumode)\n\n[YouTube Channel](https:\u002F\u002Fwww.youtube.com\u002F@GPUMODE)\n\nThe PMPP Book: [Programming Massively Parallel Processors: A Hands-on Approach](https:\u002F\u002Fa.co\u002Fd\u002F2S2fVzt) (Amazon link)\n\n\n## Lecture 1: Profiling and Integrating CUDA kernels in PyTorch\n- Speaker: [Mark Saroufim](https:\u002F\u002Ftwitter.com\u002Fmarksaroufim)\n- Notebook and slides in [lecture_001](.\u002Flecture_001\u002F) folder\n\n## Lecture 2: Recap Ch. 1-3 from the PMPP book\n- Speaker: [Andreas Koepf](https:\u002F\u002Ftwitter.com\u002Fneurosp1ke)\n- Slides: The powerpoint file [lecture_002\u002Fcuda_mode_lecture2.pptx](.\u002Flecture_002\u002Fcuda_mode_lecture2.pptx) can be found in the root directory of this repository. Alternatively [here](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1deqvEHdqEC4LHUpStO6z3TT77Dt84fNAvTIAxBJgDck\u002Fedit#slide=id.g2b1444253e5_1_75) as Google docs presentation.\n\n## Lecture 3: Getting Started With CUDA\n- Speaker: [Jeremy Howard](https:\u002F\u002Ftwitter.com\u002Fjeremyphoward)\n- Notebook: See the [lecture_003](.\u002Flecture_003\u002F) folder, or run the [Colab version](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F180uk6frvMBeT4tywhhYXmz3PJaCIA_uk?usp=sharing)\n\n## Lecture 4: Intro to Compute and Memory Architecture\n- Speaker: [Thomas Viehmann](https:\u002F\u002Flernapparat.de\u002F)\n- Notebook and slides in the [lecture_004](.\u002Flecture_004\u002F) folder.\n\n## Lecture 5: Going Further with CUDA for Python Programmers\n- Speaker: [Jeremy Howard](https:\u002F\u002Ftwitter.com\u002Fjeremyphoward)\n- Notebook in the [lecture_005](.\u002Flecture_005\u002F) folder.\n\n## Lecture 6: Optimizing PyTorch Optimizers\n- Speaker: [Jane Xu](https:\u002F\u002Fgithub.com\u002Fjaneyx99)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F13WLCuxXzwu5JRZo0tAfW0hbKHQMvFw4O\u002Fedit#slide=id.p1)\n\n## Lecture 7: Advanced Quantization\n- Speaker: [Charles Hernandez](https:\u002F\u002Fgithub.com\u002FHDCharles)\n- [Slides](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fhzfx1l267m8gwyhcjvfk4\u002FQuantization-Cuda-vs-Triton.pdf?rlkey=s4j64ivi2kpp2l0uq8xjdwbab&dl=0)\n\n## Lecture 8: CUDA Performance Checklist\n- Speaker: [Mark Saroufim](https:\u002F\u002Fgithub.com\u002Fmsaroufim)\n- Code in the [lecture_008](.\u002Flecture_008\u002F) folder\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1cvVpf3ChFFiY4Kf25S4e4sPY6Y5uRUO-X-A4nJ7IhFE\u002Fedit?usp=sharing)\n\n## Lecture 9: Reductions\n- Speaker: [Mark Saroufim](https:\u002F\u002Fgithub.com\u002Fmsaroufim)\n- Code in the [lecture_009](.\u002Flecture_009\u002F) folder\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1s8lRU8xuDn-R05p1aSP6P7T5kk9VYnDOCyN5bWKeg3U\u002Fedit?usp=drive_link)\n\n## Lecture 10: Build a Prod Ready CUDA Library\n* Speaker: [Oscar Amoros Huguet](https:\u002F\u002Fgithub.com\u002Fmorousg)\n* [slides](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F158V8BzGj-IkdXXDAdHPNwUzDLNmr971_?usp=drive_link)\n\n## Lecture 11: Sparsity\n* Speaker: [Jesse Cai](https:\u002F\u002Fgithub.com\u002Fjcaip)\n* [Slides](.\u002Flecture_011\u002Fsparsity.pptx)\n\n## Lecture 12: Flash Attention\n- Speaker: [Thomas Viehmann](https:\u002F\u002Flernapparat.de\u002F)\n- Code in the [lecture_012](.\u002Flecture_012\u002F) folder\n\n## Lecture 13: Ring Attention\n- Speaker: [Andreas Koepf](https:\u002F\u002Ftwitter.com\u002Fneurosp1ke)\n- [Slides](.\u002Flecture_013\u002Fring_attention.pptx)\n\n## Lecture 14: Practitioner's Guide to Triton\n- Date: 2024-04-13, Speaker: [Umer Adil](https:\u002F\u002Ftwitter.com\u002FUmerHAdil)\n- [Notebook](.\u002Flecture_014\u002FA_Practitioners_Guide_to_Triton.ipynb)\n\n## Lecture 15: CUTLASS\n- Speaker: [Eric Auld](https:\u002F\u002Fgithub.com\u002Fericauld)\n\n## Lecture 16: On Hands profiling\n- Speaker: [Taylor Robbie](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Ftaylor-robie\u002F)\n\n## Bonus Lecture: CUDA C++ llm.cpp\n- Speaker: [Jake Hemstad & Georgii Evtushenko]()\n- [Slides](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1T-t0d_u0Xu8w_-1E5kAwmXNfF72x-HTA)\n\n## Lecture 17: GPU Collective Communication (NCCL)\n- Speaker: [Dan Johnson](https:\u002F\u002Fphysbam.stanford.edu\u002F~dansj\u002F)\n- Code in the [lecture_017](.\u002Flecture_017\u002F) folder\n\n## Lecture 18: Fused Kernels\n- Speaker: [Kapil Sharma](https:\u002F\u002Fwww.kapilsharma.dev\u002F)\n- Code in the [lecture_018](.\u002Flecture_018\u002F) folder\n\n## Lecture 19: Data Processing on GPUs\n- Speaker: [Devavret Makkar](https:\u002F\u002Fgithub.com\u002Fdevavret)\n\n## Lecture 20: Scan Algorithm\n- Speaker: [Izzat El Haj](https:\u002F\u002Fielhajj.github.io\u002F)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL\u002Fedit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)\n\n## Lecture 21: Scan Algorithm Part 2\n- Speaker: [Izzat El Haj](https:\u002F\u002Fielhajj.github.io\u002F)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1MEMsE5LKi6ush_60hlYu3-cz4DUCFzSL\u002Fedit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)\n\n## Lecture 22: Hacker's Guide to Speculative Decoding in VLLM\n- Speaker: [Cade Daniel](https:\u002F\u002Fx.com\u002Fcdnamz)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1p1xE-EbSAnXpTSiSI0gmy_wdwxN5XaULO3AnCWWoRe4\u002Fedit#slide=id.p)\n\n## Lecture 23: Tensor Cores\n- Speaker: Vijay Thakkar & Pradeep Ramani\n- [Slides](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F18sthk6IUOKbdtFphpm_jZNXoJenbWR8m\u002Fview)\n\n## Lecture 24: Scan at the Speed of Light\n- Speaker: Jake Hemstad & Georgii Evtushenko\n\n## Lecture 25: Speaking Composable Kernel\n- Speaker: Haocong Wang\n- [Slides](.\u002Flecture_025\u002FAMD_ROCm_Speaking_Composable_Kernel_July_20_2024.pdf)\n\n## Lecture 26: SYCL MODE (Intel GPU)\n- Speaker: Patric Zhao\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1SW4XKomAJhhJSH5-jpZI9Qlwp7TEunbV\u002Fedit?usp=sharing&ouid=106222972308395582904&rtpof=true&sd=true)\n\n## Lecture 27: gpu.cpp\n- Speaker: [Austin Huang](https:\u002F\u002Fx.com\u002Faustinvhuang)\n- [Slides](https:\u002F\u002Fgpucpp-presentation.answer.ai\u002F)\n\n## Lecture 28: Liger Kernel\n- Speaker: [Byron Hsu](https:\u002F\u002Fx.com\u002Fhsu_byron)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1CGTV-uKw9crrBo13q1jAzAFCFzlpZFjeL4bnK67pTd8\u002Fedit?usp=sharing)\n- Hands-on  Notebooks\n  1. [RMSNorm: Verifying Correctness and Performance](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1CQYhul7MVG5F0gmqTBbx1O1HgolPgF0M?usp=sharing)\n  2. [FusedLinearCrossEntropy: Verifying Memory Reduction](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Z2QtvaIiLm5MWOs7X6ZPS1MN3hcIJFbj?usp=sharing)\n  3. [Convergence Comparison: Triton Kernel Patched vs. Original Model Layer-by-Layer](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1e52FH0BcE739GZaVp-3_Dv7mc4jF1aif?usp=sharing)\n  4. [Contiguity is the hidden killer](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1llnAdo0hc9FpxYRRnjih0l066NCp7Ylu?usp=sharing)\n  5. [Address int32 overflow](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1WgaU_cmaxVzx8PcdKB5P9yHB6_WyGd4T?usp=sharing)\n\n## Lecture 29: Triton Internals\n- Speaker: [Kapil Sharma](https:\u002F\u002Fwww.kapilsharma.dev\u002F)\n- Code\u002Fpresentation in the [lecture_029](.\u002Flecture_029\u002F) folder\n\n## Lecture 30: Quantized training\n- Speaker: [Thien Tran](https:\u002F\u002Fgithub.com\u002Fgau-nernst)\n- Code\u002Fpresentation in the [lecture_030](.\u002Flecture_030\u002F) folder\n\n## Lecture 31: Beginners Guide to Metal Kernels\n- Speaker: [Nikita Shulga](https:\u002F\u002Fgithub.com\u002Fgau-nernst)\n- Code\u002Fpresentation in the [lecture_031](.\u002Flecture_031\u002F) folder\n\n## Lecture 32: Unsloth - LLM Systems Engineering\n- Speaker: [Daniel Han](https:\u002F\u002Fx.com\u002Fdanielhanchen)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1BvgbDwvOY6Uy6jMuNXrmrz_6Km_CBW0f2espqeQaWfc\u002Fedit?usp=sharing)\n\n## Lecture 33: BitBLAS\n- Speaker: [Wang Lei](https:\u002F\u002Fgithub.com\u002FLeiWang1999)\n- Code\u002Fpresentation in the [lecture_033](.\u002Flecture_033\u002F) folder\n\n## Lecture 34: Low Bit Triton Kernels\n- Speaker: [Hicham Badri](https:\u002F\u002Fgithub.com\u002Fmobicham)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1R9B6RLOlAblyVVFPk9FtAq6MXR1ufj1NaT0bjjib7Vc\u002Fedit)\n\n## Lecture 35: SGLang Performance Optimization\n- Speaker: [Yineng Zhang](https:\u002F\u002Flinkedin.com\u002Fin\u002Fzhyncs)\n- [Slides](https:\u002F\u002Fgithub.com\u002Fzhyncs\u002Flectures\u002Fblob\u002Fmain\u002Flecture_035\u002FSGLang-Performance-Optimization-YinengZhang.pdf)\n\n## Lecture 36: CUTLASS and Flash ATtention 3\n- Speaker: [Jay Shah](https:\u002F\u002Fresearch.colfax-intl.com\u002Fblog\u002F)\n- [Slides](lecture_036\u002F)\n\n## Lecture 37: Introduction to SASS & GPU Microarchitecture\n- Speaker: [Arun Demeure](https:\u002F\u002Fgithub.com\u002Fademeure)\n- [Slides](lecture_037\u002F)\n\n## Lecture 38: Lowbit kernels for ARM CPU\n- Speaker: [Scott Roy](https:\u002F\u002Fgithub.com\u002Fmetascroy)\n- [Slides](lecture_038\u002F)\n\n## Lecture 39: TorchTitan\n- Speaker: Mark Saroufim and Tianyu Liu\n\n## Lecture 40: Flash Infer\n- Speaker: [Zihao Ye](https:\u002F\u002Fhomes.cs.washington.edu\u002F~zhye\u002F)\n\n## Lecture 41: CUDA Docs for Humans\n- Speaker: [Charles Frye](https:\u002F\u002Fx.com\u002Fcharles_irl\u002Fstatus\u002F1867306225706447023)\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F15lTG6aqf72Hyk5_lqH7iSrc8aP1ElEYxCxch-tD37PE\u002Fedit#slide=id.g326210b960f_0_42)\n \n## Lecture 42: Mosaic GPU\n- Speaker: [Adam Paszke](https:\u002F\u002Fx.com\u002Fapaszke)\n\n## Lecture 43:\n- Speaker: Erik Schultheis\n- [Slides](lecture_042)\n\n## Lecture 57: CuTE\n- Speaker: Cris Cecka\n- [Slides](lecture_057)\n\n## Lecture 67: NCCL & NVSHMEM\n- Speaker: Jeff Hammond\n- [Slides](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1T8uHhFIeVa_g1oYb_O4d2Ltb8YQly1zK\u002Fview?usp=sharing)\n- [Code](https:\u002F\u002Fgithub.com\u002FParRes\u002FKernels\u002Ftree\u002Fmain\u002FCxx11)\n\n## Lecture 69: Quartet 4 bit training\n- Speakers: Roberto Castro and Andrei Panferov\n- Code: https:\u002F\u002Fgithub.com\u002FIST-DASLab\u002FQuartet and https:\u002F\u002Fgithub.com\u002FisT-DASLab\u002Fqutlass Roberto Castro and Andrei Panferov\n- [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14669)\n\n## Lecture 70: Fault tolerant communication collectives\n- Speaker: mike64_t\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1MKB51lhNOsV-Y_hscSaJk7wZskzxft2pFJQZKyvcMyo\u002Fedit?usp=sharing)\n\n## Lecture 71: [ScaleML Series] FlexOlmo: Open Language Models for Flexible Data Use\n- Speaker: [Sewon Min](https:\u002F\u002Fwww.sewonmin.com)\n- [Slides](lecture_071)\n\n## Lecture 72: [ScaleML Series] Efficient & Effective Long-Context Modeling for Large Language Models\n- Speaker: [Guangxuan Xiao](https:\u002F\u002Fguangxuanx.com)\n- [Slides](lecture_072)\n\n## Lecture 74: [ScaleML Series] Positional Encodings and PaTH Attention\n- Speaker: [Songlin Yang](https:\u002F\u002Fsustcsonglin.github.io)\n- [Slides](lecture_074)\n\n## Lecture 75: [ScaleML Series] GPU Programming Fundamentals + ThunderKittens\n- Speaker 1: William Brandon\n  - [Slides 1](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1ypi4IjEF36PUZGOJSaFxjNzk7BpO61TicdTBBf77oqc\u002F)\n- Speaker 2: [Simran Arora](https:\u002F\u002Farorasimran.com)\n  - [Slides 2](lecture_075)\n\n## Lecture 78: Iris: Multi-GPU Programming in Triton\nSpeakers: Muhammad Awad, Muhammad Osama & Brandon Potter\n- [Slides](lecture_078)\n\n## Lecture 79: Mirage (MPK): Compiling LLMs into Mega Kernels\nSpeakers: Mengdi Wu, Xinhao Cheng\n- [Slides](lecture_079)\n\n## Lecture 84: Numerics and AI\nSpeaker: Paulius Micikevicius\n- [Slides](lecture_084)\n\n## Lecture 86: Introduction to CuTeDSL (for NVIDIA competition)\nSpeaker: Vicki Wang\n- [Slides](lecture_086)\n\n## Lecture 103: Fundamentals of CuTe Layout Algebra and Category-theoretic Interpretation\nSpeaker: Jack Carlisle and Jay Shah\n- [Slides](lecture_103)\n\n## Lecture 104: Gluon: Tile-Based GPU Programming with Low-Level Control\nSpeakers: Peter Bell, Mario Lezcano, Keren Zhou\n- [Slides and notes](lecture_104)\n\n## Lecture 106: HF kernels\n- [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1RibAIrOJv0BcAx2QjNYHDZCrMfGYifTggtKT6uwv7CY\u002Fedit)\n","gpu-mode\u002Flectures 是一个提供GPU编程相关讲座材料的项目。该项目包含一系列关于CUDA和并行计算的Jupyter Notebook、幻灯片以及代码示例，涵盖从基础到高级的主题，如CUDA内核集成、内存架构介绍、优化策略等。技术特点包括使用Python与PyTorch框架进行CUDA编程的教学，并且提供了可直接运行的Colab版本以方便学习者实践。适合对高性能计算、深度学习加速感兴趣的开发者或学生，在学习如何利用GPU加速程序执行时作为参考资源。",2,"2026-06-11 03:49:06","high_star"]