[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72877":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":39,"readmeContent":40,"aiSummary":41,"trendingCount":16,"starSnapshotCount":16,"syncStatus":42,"lastSyncTime":43,"discoverSource":44},72877,"LeetCUDA","xlite-dev\u002FLeetCUDA","xlite-dev","📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉","https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA",null,"Cuda",11230,1147,56,1,0,31,96,289,93,119.18,"GNU General Public License v3.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38],"cuda","cuda-12","cuda-cpp","cuda-demo","cuda-kernel","cuda-kernels","cuda-library","cuda-toolkit","flash-attention","hgemm","learn-cuda","leet-cuda","2026-06-12 04:01:07","\u003Cdiv align=\"center\">\n  \u003Cp align=\"center\">\n    \u003Ch2>📚 LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners 🐑\u003C\u002Fh2>\n    \u003Cimg src='https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb2578723-b7a7-4d8f-bcd1-5008947b808a' >\n  \u003C\u002Fp>\n  \u003Cdiv align='center'>\n      \u003Cimg src=https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg >\n      \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLanguage-CUDA-brightgreen.svg >\n      \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fxlite-dev\u002FLeetCUDA.svg?style=dark >\n      \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fxlite-dev\u002FLeetCUDA.svg?style=dark >\n      \u003Cimg src=https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-GPLv3.0-turquoise.svg >\n      \u003Ca href=\"https:\u002F\u002Fhellogithub.com\u002Frepository\u002F98348655a96640ca8ddcbc298edc901d\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fapi.hellogithub.com\u002Fv1\u002Fwidgets\u002Frecommend.svg?rid=98348655a96640ca8ddcbc298edc901d&claim_uid=ofSCbzTmdeQk3FD&theme=small\" alt=\"Featured｜HelloGitHub\" \u002F>\u003C\u002Fa>\n  \u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n📚 **LeetCUDA**: It includes **Tensor\u002FCUDA Cores, TF32\u002FF16\u002FBF16\u002FF8**, [📖200+ CUDA Kernels🔥](#cuda-kernel) with PyTorch, [📖100+ LLM\u002FCUDA🔥](#my-blogs-part-1) blogs, [📖HGEMM⚡️](.\u002Fkernels\u002Fhgemm) which can achieve `98%~100%` TFLOPS of **cuBLAS**, and [📖flash-attn⚡️](.\u002Fkernels\u002Fflash-attn) using Tensor Cores with pure MMA PTX. ♥️ Please consider to leave a ⭐️ Star to support me, my bro ~ ♥️\n\n\u003Cdiv align=\"center\">\n  \u003Cp align=\"center\">\n    \u003Ca href=\"#contribute\">🔥🔥 PR Welcome: Add Your Kernel to LeetCUDA! Let's make it Awesome together! 🎉🎉\u003C\u002Fa> \u003Cbr>\n    \u003Ca href=https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA\u002Fgraphs\u002Fcontributors > \u003Cimg src=https:\u002F\u002Fopencollective.com\u002Fleetcuda\u002Fcontributors.svg height=40px > \u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## ©️Citations🎉🎉\n\n```BibTeX\n@misc{LeetCUDA@2025,\n  title={LeetCUDA: A Modern CUDA Learn Notes with PyTorch for Beginners},\n  url={https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA.git},\n  note={Open-source software available at https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FLeetCUDA.git},\n  author={DefTruth and Many Others},\n  year={2025}\n}\n```\n\n\n## 📖 News 🔥🔥\n\u003Cdiv id=\"news\">\u003C\u002Fdiv>\n\n- [2026\u002F03] Cache-DiT **[🎉v1.3.0](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit)** release is ready, the major updates including: [Ring](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL) Attention w\u002F [batched P2P](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL), [USP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FCONTEXT_PARALLEL\u002F) (Hybrid Ring and Ulysses), Hybrid 2D and 3D Parallelism (💥[USP + TP](https:\u002F\u002Fcache-dit.readthedocs.io\u002Fen\u002Flatest\u002Fuser_guide\u002FHYBRID_PARALLEL\u002F)),  VAE-P Comm overhead reduce.\n\n![arch](https:\u002F\u002Fgithub.com\u002Fvipshop\u002Fcache-dit\u002Fraw\u002Fmain\u002Fassets\u002Farch_v2.png)\n\n- [2026\u002F04]: **[🤖ffpa-attn](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn.git)** is released! Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for large headdim, **1.8x~3x↑**🎉 vs SDPA EA: [📈L20 ~1.9x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-l20), [📈A30 ~1.8x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-a30),[📈4090 ~2.1x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-4090). Currently, FFPA supports self-attention, cross-attention, grouped\u002Fmulti-query attention, causal attention with large headdim (D=320~1024). While the standard FlashAttention-2 only support headdim \u003C= 256.\n\n\u003Cdiv align='center'>\n\u003Cimg height=\"320px\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fed30185b-2e11-4293-832f-43e9003d6ad9\" \u002F>\n\u003C\u002Fdiv>\n\n- [2024\u002F12]: **[⚡️HGEMM](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FHGEMM.git)** is released! Write HGEMM from scratch using Tensor Cores with **WMMA, MMA and CuTe** API, achieve peak🎉 performance.\n\n## 📖 Contents\n\u003Cdiv id=\"contents\">\u003C\u002Fdiv>\n\n- [📖 HGEMM-MMA 🎉🎉](#HGEMM-bench)\n- [📖 FlashAttention-MMA 🎉🎉](#fa-mma-bench)\n  - [📚 Split KV (Basic, FA-1)](#mma-split-kv)\n  - [📚 Split Q (Faster, FA-2)](#mma-split-q)\n  - [📚 Split Q + Shared KV](#mma-share-kv)\n  - [📚 Split Q + Shared QKV](#mma-share-qkv)\n  - [📚 Split Q + QK Tiling](#mma-tiling-qk)\n  - [📚 Split Q + QKV Tiling](#mma-tiling-qkv)\n- [📖 200+ CUDA Kernels 🔥🔥](#cuda-kernel)\n  - [📚 Easy ⭐️](#cuda-kernel-easy-medium)\n  - [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)\n  - [📚 Hard ⭐️⭐️⭐️](#cuda-kernel-hard)\n  - [📚 Hard+ ⭐️⭐️⭐️⭐️](#cuda-kernel-hard-plus)\n  - [📚 Hard++ ⭐⭐⭐️⭐️⭐️](#cuda-kernel-hard-plus)\n  - [📚 Triton ⭐⭐⭐️](#triton-kernel)\n  - [📚 CUTLASS ⭐⭐⭐️](#cutlass-kernel)\n- [📖 100+ LLM\u002FCUDA Blogs 🔥](#my-blogs-part-1)\n- [📖 How to Contribute 👀👇](#contribute)\n\n\n## 📖 HGEMM Benchmark 🎉🎉\n\n\u003Cdiv id=\"HGEMM-bench\">\u003C\u002Fdiv>\n\nCurrently, on NVIDIA L20, RTX 4090 and RTX 3080 Laptop, compared with cuBLAS's default Tensor Cores algorithm, the `HGEMM (WMMA\u002FMMA\u002FCuTe)` in this repo (`blue`🔵) can achieve `98%~100%` of its (`orange`🟠) performance. Please check [toy-hgemm library⚡️⚡️](.\u002Fkernels\u002Fhgemm) or [HGEMM⚡️⚡️](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002FHGEMM) repo for more details.\n\n![toy-hgemm-library](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F962bda14-b494-4423-b8eb-775da9f5503d)\n\n|📚Feature |📚Feature |📚Feature |📚Feature|\n|:---:|:---:|:---:|:---:|\n|✔️CUDA\u002F**Tensor Cores**|✔️Loop over K|✔️Tile Block(BMxBK)|✔️Tile Threads(T 8x8)|\n|✔️WMMA(m16n16k16)|✔️MMA(m16n8k16)|✔️Pack LDST(128 bits)|✔️SMEM Padding|\n|✔️Copy Async|✔️Tile MMAs|✔️Tile Warps|✔️**Multi Stages(2~4)**|\n|✔️Register Double Buffers|✔️**Block Swizzle**|✔️**Warp Swizzle**|✔️**SMEM Swizzle**(CuTe\u002FMMA)|\n|✔️Collective Store(Shfl)|✔️Layout NN|✔️Layout TN|✔️SGEMM FP32\u002FTF32|\n\n## 📖 FA2-MMA Benchmark 🎉🎉\n\n\u003Cdiv id=\"fa-mma-bench\">\u003C\u002Fdiv>\n\nI have also implemented **FlashAttention-2** using pure MMA PTX instructions, which supports features such as Multi-Stages, Tile MMA, Tile Warp, Shared KV SMEM, **Fully Shared QKV SMEM**, **Prefetch Q s2r**, **Prefetch K\u002FV g2s**, **QKV Fine-grained Tiling**, Collective Store, etc. Please refer to [flash-attn⚡️⚡️](.\u002Fkernels\u002Fflash-attn) for more details.\n\n![flash-attn-mma](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6f66796d-44d5-4ec1-b224-af997bd152b2)\n\n|📚Feature |📚Feature |📚Feature |📚Feature|\n|:---:|:---:|:---:|:---:|\n|✔️Tensor Cores|✔️Loop over N\u002FD |✔️Tile Block(Br, Bc)|✔️MMA(m16n8k16)|\n|✔️Pack LDST(128 bits)|✔️SMEM **Swizzle**\u002FPadding |✔️Copy Async|✔️Tile MMAs|\n|✔️Tile Warps|✔️Multi Stages(1\u002F2)|✔️Collective Store(Shfl)|✔️**Split KV\u002FQ**|\n|✔️**Shared QKV** SMEM|✔️**Prefetch Q** s2r|✔️**Prefetch KV** g2s|✔️**QKV Fine-grained Tiling**|\n\nCurrently, for small-scale attention `(B\u003C=4, H \u003C=48, SeqLen \u003C= 8192, D \u003C= 64)` it can run faster than FA2\u002FSDPA on some Devices. For example, on NVIDIA RTX 3080 Laptop, [📚 Split Q + Fully Shared QKV SMEM](#mma-share-qkv) method can achieve **55 TFLOPS (D=64)** that almost **~1.5x** 🎉 faster than FA2. On NVIDIA L20, 🤖[ffpa-attn](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn) method can achieve **104 TFLOPS (D=512)** that almost **~1.8x** 🎉 faster than SDPA (EFFICIENT ATTENTION). However, for large-scale attention, there remains a performance gap. Stay tuned for updates ~ (MMA Acc F16\u002FF32, softmax Acc F32 vs FA2 MMA\u002Fsoftmax Acc F32, 👇Benchmark)\n\n|Algorithm| (B,H,N,D) | RTX 3080 Laptop | L20 | RTX 4090 |\n|:---:|:---:|:---:|:---:|:---:|\n|FlashAttention-2|(1,8,8192,64)|37 TFLOPS|100 TFLOPS|145 TFLOPS|\n|share-qkv+stage2|(1,8,8192,64)|**55 TFLOPS**|99 TFLOPS|**221 TFLOPS**|\n|FlashAttention-2|(1,48,8192,64)|37 TFLOPS|109 TFLOPS|163 TFLOPS|\n|share-qkv+stage2|(1,48,8192,64)|**48 TFLOPS**|107 TFLOPS|**224 TFLOPS**|\n|SDPA(EFFICIENT ATTENTION)|(1,48,8192,512)|16 TFLOPS|58 TFLOPS|85 TFLOPS|\n|🤖[ffpa-attn](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|(1,48,8192,512)|**39 TFLOPS**|**104 TFLOPS**|**200 TFLOPS**|\n|Precision Errors vs FA2\u002FSDPA| \u002F | max: \u003C ~1e-3 | min: ~0.0 | mean: \u003C ~1e-5 |\n\nThe `Split KV` and `Split Q` implementations have been carried out in [flash-attn⚡️⚡️](.\u002Fkernels\u002Fflash-attn) for performance comparison. The `Split KV` method, which involves splitting all QKV across MMA (Warps), is slower than `Split Q` method, which splitting Q across MMA(Warps) and keep access KV for all MMA(Warps).\n\n- 📚 Split KV (Basic, FlashAttention-1)\n\u003Cdiv id=\"mma-split-kv\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F Split QKV across MMA(Warps) using naive matmul MMA&Warp tiling policy.\n\u002F\u002F case: The layout of 8 MMA(2x4)  [after] kWarpTileSeqLenQxkWarpTileSeqLenK(2x2) -> 32x2,32x2=64x64:\n\u002F\u002F |  [64,64]  |    warp_KV 0    |    warp_KV 1    |    warp_KV 2    |    warp_KV 3    |\n\u002F\u002F | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|\n\u002F\u002F | warp_QP 0 |-- MMA 0,MMA 0 --|-- MMA 2,MMA 2 --|-- MMA 4,MMA 4 --|-- MMA 6,MMA 6 --|\n\u002F\u002F | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|\n\u002F\u002F | warp_QP 1 |-- MMA 1,MMA 1 --|-- MMA 3,MMA 2 --|-- MMA 5,MMA 5 --|-- MMA 7,MMA 7 --|\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_kv_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n- 📚 Split Q (Faster, FlashAttention-2)\n\u003Cdiv id=\"mma-split-q\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F Split Q across MMA(Warps) and keep access KV for all MMA(Warps),\n\u002F\u002F in order to reduce the comm between warps via smem and warp shuffle.\n\u002F\u002F case: MMA = m16n8k16, Br=16x4=64, Bc=8x8=64, layout: 4 warps\n\u002F\u002F |   64x64   |      warp_KV 0       |\n\u002F\u002F | warp_QP 0 | MMA 0 ... MMA 0 (x8) |\n\u002F\u002F | warp_QP 1 | MMA 1 ... MMA 1 (x8) |\n\u002F\u002F | warp_QP 2 | MMA 2 ... MMA 2 (x8) |\n\u002F\u002F | warp_QP 3 | MMA 3 ... MMA 3 (x8) |\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_q_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n- 📚 Split Q + Shared KV SMEM (**1\u002F2 SRAM** vs FA2)\n\u003Cdiv id=\"mma-share-kv\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F K, V shared the same shared memory, improve block occupancy.\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_q_shared_kv_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n- 📚 Split Q + Fully Shared QKV SMEM (**1\u002F4 SRAM** vs FA2)\n\n\u003Cdiv id=\"mma-share-qkv\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F Q, K, V fully shared the same shared memory and prefetch Q s2r, improve block occupancy\n\u002F\u002F and reduce Q SMEM IO-Access.\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_q_shared_qkv_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n- 📚 Split Q + QK Fine-grained Tiling (**O(16xd) SRAM** vs FA2 **O(4xBrxd) SRAM**, `Headdim -> 1024`)\n\n\u003Cdiv id=\"mma-tiling-qk\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F Fine-grained tiling at the MMA level for Q@K^T results in a constant SRAM usage of\n\u002F\u002F 64 * kMmaAtomK for Q and K. For V, the SRAM complexity is O(kMmaAtomK * d), leading to\n\u002F\u002F an overall SRAM complexity of O(kMmaAtomK * d). Consequently, this approach allows us to\n\u002F\u002F extend D (head dimension) up to 1024.\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_q_tiling_qk_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n- 📚 Split Q + Fully QKV Fine-grained Tiling (**O(2xBrx16)~O(1) SRAM** vs FA2 **O(4xBrxd) SRAM**)\n\n\u003Cdiv id=\"mma-tiling-qkv\">\u003C\u002Fdiv>\n\n```C++\n\u002F\u002F Fine-grained tiling at the MMA level for all Q@K^T and P@V results in a constant SRAM usage of\n\u002F\u002F Br * 16 or Bc * 16 for Q, K, V, leading to an overall SRAM complexity of O(Br * 16). Consequently,\n\u002F\u002F this approach allows us to run faster than SDPA w or w\u002Fo MMA Acc F32.\n__global__ void \u002F\u002F Q, K, V, O -> [B, H, N, D]\nflash_attn_mma_stages_split_q_tiling_qkv_kernel(half* Q, half* K, half* V, half* O, ...);\n```\n\n💡NOTE: [📚Split Q + Fully QKV Fine-grained Tiling](#mma-tiling-qkv) has been refactored into 🤖[ffpa-attn](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn).\n\n## 📖 200+ CUDA Kernels 🔥🔥 (Easy -> Hard++) ([©️back👆🏻](#contents))\n\n\u003Cdiv id=\"cuda-kernel\">\u003C\u002Fdiv>\n\nThe kernels listed here will guide you through a step-by-step progression, ranging from easy to very challenging topics. The **workflow** for each topic will be as follows: custom **CUDA kernel** implementation -> PyTorch **Python bindings** -> Run tests. 👉TIPS: `*` = Tensor Cores (WMMA, MMA, CuTe), otherwise, CUDA Cores; `\u002F` = not supported; `✔️` = supported; `❔` = TODO. Contents are listed as follows:\n\n- [📚 Easy ⭐️](#cuda-kernel-easy-medium)\n- [📚 Medium ⭐️⭐️](#cuda-kernel-easy-medium)\n- [📚 Hard ⭐️⭐️⭐️](#cuda-kernel-hard)\n- [📚 Hard+ ⭐️⭐️⭐️⭐️](#cuda-kernel-hard-plus)\n- [📚 Hard++ ⭐⭐⭐️⭐️⭐️](#cuda-kernel-hard-plus)\n- [📚 Triton ⭐⭐⭐️](#triton-kernel)\n- [📚 CUTLASS ⭐⭐⭐️](#cutlass-kernel)\n\n[📚 Easy](#cuda-kernel-easy-medium) and [📚 Medium](#cuda-kernel-easy-medium) sections cover operations such as `element-wise, mat_trans, warp\u002Fblock reduce, nms, relu, gelu, swish, layer-norm, rms-norm, online-softmax, dot-prod, embedding` and basic usage for `FP32`, `FP16`, `BF16` and `FP8` . [📚 Hard](#cuda-kernel-hard), [📚 Hard+](#cuda-kernel-hard-plus) and [📚 Hard++](#cuda-kernel-hard-plus) sections delve deeper into advanced topics, primarily focusing on operations like `sgemv, sgemm, hgemv, hgemm and flash-attention`. These sections also provide numerous kernels implemented using Tensor Cores with pure MMA PTX.\n\n### 📚 Easy ⭐️ & Medium ⭐️⭐️  ([©️back👆🏻](#cuda-kernel))\n\u003Cdiv id=\"cuda-kernel-easy-medium\">\u003C\u002Fdiv>\n\n|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [elementwise_f32](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f32|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️|\n| ✔️ [elementwise_f32x4](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f32|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️|\n| ✔️ [elementwise_f16](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️|\n| ✔️ [elementwise_f16x2](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️|\n| ✔️ [elementwise_f16x8](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️|\n| ✔️ [elementwise_f16x8_pack](.\u002Fkernels\u002Felementwise\u002Felementwise.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felementwise\u002F)|⭐️⭐️|\n| ✔️ [histogram_i32](.\u002Fkernels\u002Fhistogram\u002Fhistogram.cu)|i32|\u002F|[link](.\u002Fkernels\u002Fhistogram\u002F)|⭐️|\n| ✔️ [histogram_i32x4](.\u002Fkernels\u002Fhistogram\u002Fhistogram.cu)|i32|\u002F|[link](.\u002Fkernels\u002Fhistogram\u002F)|⭐️|\n| ✔️ [sigmoid_f32](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️|\n| ✔️ [sigmoid_f32x4](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️|\n| ✔️ [sigmoid_f16](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|16|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️|\n| ✔️ [sigmoid_f16x2](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️|\n| ✔️ [sigmoid_f16x8](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️|\n| ✔️ [sigmoid_f16x8_pack](.\u002Fkernels\u002Fsigmoid\u002Fsigmoid.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fsigmoid\u002F)|⭐️⭐️|\n| ✔️ [relu_f32](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️|\n| ✔️ [relu_f32x4](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️|\n| ✔️ [relu_f16](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️|\n| ✔️ [relu_f16x2](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️|\n| ✔️ [relu_f16x8](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️|\n| ✔️ [relu_f16x8_pack](.\u002Fkernels\u002Frelu\u002Frelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Frelu\u002F)|⭐️⭐️|\n| ✔️ [elu_f32](.\u002Fkernels\u002Felu\u002Felu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️|\n| ✔️ [elu_f32x4](.\u002Fkernels\u002Felu\u002Felu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️|\n| ✔️ [elu_f16](.\u002Fkernels\u002Felu\u002Felu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️|\n| ✔️ [elu_f16x2](.\u002Fkernels\u002Felu\u002Felu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️|\n| ✔️ [elu_f16x8](.\u002Fkernels\u002Felu\u002Felu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️|\n| ✔️ [elu_f16x8_pack](.\u002Fkernels\u002Felu\u002Felu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Felu\u002F)|⭐️⭐️|\n| ✔️ [gelu_f32](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️|\n| ✔️ [gelu_f32x4](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️|\n| ✔️ [gelu_f16](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️|\n| ✔️ [gelu_f16x2](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️|\n| ✔️ [gelu_f16x8](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️|\n| ✔️ [gelu_f16x8_pack](.\u002Fkernels\u002Fgelu\u002Fgelu.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fgelu\u002F)|⭐️⭐️|\n| ✔️ [swish_f32](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️|\n| ✔️ [swish_f32x4](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️|\n| ✔️ [swish_f16](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️|\n| ✔️ [swish_f16x2](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️|\n| ✔️ [swish_f16x8](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️|\n| ✔️ [swish_f16x8_pack](.\u002Fkernels\u002Fswish\u002Fswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fswish\u002F)|⭐️⭐️|\n| ✔️ [hardswish_f32](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️|\n| ✔️ [hardswish_f32x4](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️|\n| ✔️ [hardswish_f16](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️|\n| ✔️ [hardswish_f16x2](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️|\n| ✔️ [hardswish_f16x8](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️|\n| ✔️ [hardswish_f16x8_pack](.\u002Fkernels\u002Fhardswish\u002Fhardswish.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardswish\u002F)|⭐️⭐️|\n| ✔️ [hardshrink_f32](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️|\n| ✔️ [hardshrink_f32x4](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️|\n| ✔️ [hardshrink_f16](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️|\n| ✔️ [hardshrink_f16x2](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️|\n| ✔️ [hardshrink_f16x8](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️|\n| ✔️ [hardshrink_f16x8_pack](.\u002Fkernels\u002Fhardshrink\u002Fhardshrink.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fhardshrink\u002F)|⭐️⭐️|\n| ✔️ [embedding_f32](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f32x4](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f32x4_pack](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f16](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f16x2](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f16x8](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️|\n| ✔️ [embedding_f16x8_pack](.\u002Fkernels\u002Fembedding\u002Fembedding.cu)|f16|\u002F|[link](.\u002Fkernels\u002Fembedding\u002F)|⭐️⭐️|\n| ✔️ [mat_trans_f32_col2row{2d}](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️|\n| ✔️ [mat_trans_f32_row2col{2d}](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️|\n| ✔️ [mat_trans_f32_diagonal2d](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️⭐️|\n| ✔️ [mat_trans_f32x4_col2row{2d}](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️⭐️|\n| ✔️ [mat_trans_f32x4_row2col{2d}](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️⭐️|\n| ✔️ [mat_trans_cute](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose_cute.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️⭐️|\n| ✔️ [warp_reduce_{all}](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|all|all|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f32_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f32|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f32x4_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f32|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16x2_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16x2_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16x8_pack_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_f16x8_pack_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|f16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16_bf16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|bf16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16x2_bf16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|bf16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16x2_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16x8_pack_bf16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|bf16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_bf16x8_pack_f32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|bf16|f32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_fp8_e4m3_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|fp8_e4m3|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️⭐️|\n| ✔️ [block_all_reduce_fp8_e5m2_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|fp8_e5m2|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️⭐️|\n| ✔️ [block_all_reduce_fp8_e4m3x16_pack_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|fp8_e4m3|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️⭐️|\n| ✔️ [block_all_reduce_fp8_e5m2x16_pack_f16](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|fp8_e5m2|f16|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️⭐️|\n| ✔️ [block_all_reduce_i8_i32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|i8|i32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [block_all_reduce_i8x16_pack_i32](.\u002Fkernels\u002Freduce\u002Fblock_all_reduce.cu)|i8|i32|[link](.\u002Fkernels\u002Freduce\u002F)|⭐️⭐️|\n| ✔️ [dot_product_f32](.\u002Fkernels\u002Fdot-product\u002Fdot_product.cu)|f32|f32|[link](.\u002Fkernels\u002Fdot-product\u002F)|⭐️⭐️|\n| ✔️ [dot_product_f32x4](.\u002Fkernels\u002Fdot-product\u002Fdot_product.cu)|f32|f32|[link](.\u002Fkernels\u002Fdot-product\u002F)|⭐️⭐️|\n| ✔️ [dot_product_f16_f32](.\u002Fkernels\u002Fdot-product\u002Fdot_product.cu)|f16|f32|[link](.\u002Fkernels\u002Fdot-product\u002F)|⭐️⭐️|\n| ✔️ [dot_product_f16x2_f32](.\u002Fkernels\u002Fdot-product\u002Fdot_product.cu)|f16|f32|[link](.\u002Fkernels\u002Fdot-product\u002F)|⭐️⭐️|\n| ✔️ [dot_product_f16x8_pack_f32](.\u002Fkernels\u002Fdot-product\u002Fdot_product.cu)|f16|f32|[link](.\u002Fkernels\u002Fdot-product\u002F)|⭐️⭐️|\n| ✔️ [softmax_f32_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [softmax_f32x4_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [safe_softmax_f32_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [safe_softmax_f32x4_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [safe_softmax_f16_f32_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f16|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [safe_softmax_f16x2_f32_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f16|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [safe_softmax_f16x8_pack_f32_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f16|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [online_safe_softmax_f32_per_token](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [online_safe_softmax_f32x4_pack_per_tok](.\u002Fkernels\u002Fsoftmax\u002Fsoftmax.cu)|f32|f32|[link](.\u002Fkernels\u002Fsoftmax\u002F)|⭐️⭐️|\n| ✔️ [rope_f32](.\u002Fkernels\u002Frope\u002Frope.cu)|f32|f32|[link](.\u002Fkernels\u002Frope\u002F)|⭐️⭐️|\n| ✔️ [rope_f32x4_pack](.\u002Fkernels\u002Frope\u002Frope.cu)|f32|f32|[link](.\u002Fkernels\u002Frope\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f32](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f32|f32|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f32x4](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f32|f32|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16_f16](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16x2_f16](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16x8_f16](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16x8_pack_f16](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16x8_pack_f32](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f32|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [layer_norm_f16_f32](.\u002Fkernels\u002Flayer-norm\u002Flayer_norm.cu)|f16|f32|[link](.\u002Fkernels\u002Flayer-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f32](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f32|f32|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f32x4](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f32|f32|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16_f16](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16x2_f16](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16x8_f16](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16x8_f32](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f32|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16x8_pack_f16](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f16|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16x8_pack_f32](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f32|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [rms_norm_f16_f32](.\u002Fkernels\u002Frms-norm\u002Frms_norm.cu)|f16|f32|[link](.\u002Fkernels\u002Frms-norm\u002F)|⭐️⭐️|\n| ✔️ [nms_f32](.\u002Fkernels\u002Fnms\u002Fnms.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fnms)|⭐️⭐️|\n| ✔️ [merge_attn_states](.\u002Fkernels\u002Fopenai-triton\u002Fmerge-attn-states\u002Fcuda_merge_attn_states.cu)|f16\u002Fbf16\u002Ff32|f32|[link](.\u002Fkernels\u002Fopenai-triton\u002Fmerge-attn-states)|⭐️⭐️|\n| ✔️ [notes v1(deprecated)](.\u002Fkernels\u002Fnotes-v1.cu)|f32|f32|\u002F|⭐️⭐️|\n| ✔️ [How to use nsys\u002Fncu(timeline\u002Fptx\u002Fsass)](.\u002Fkernels\u002Fnvidia-nsight\u002F)|\u002F|\u002F|[link](.\u002Fkernels\u002Fnvidia-nsight\u002F)|⭐️⭐️|\n\n### 📚 Hard ⭐⭐⭐️ ([©️back👆🏻](#cuda-kernel))\n\n\u003Cdiv id=\"cuda-kernel-hard\">\u003C\u002Fdiv>\n\n|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [sgemv_k32_f32](.\u002Fkernels\u002Fsgemv\u002Fsgemv.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemv_k128_f32x4](.\u002Fkernels\u002Fsgemv\u002Fsgemv.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemv_k16_f32](.\u002Fkernels\u002Fsgemv\u002Fsgemv.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_k32_f16](.\u002Fkernels\u002Fhgemv\u002Fhgemv.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_k128_f16x4](.\u002Fkernels\u002Fhgemv\u002Fhgemv.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_k16_f16](.\u002Fkernels\u002Fhgemv\u002Fhgemv.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_naive_f32](.\u002Fkernels\u002Fsgemm\u002Fsgemm.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️|\n| ✔️ [sgemm_sliced_k_f32](.\u002Fkernels\u002Fsgemm\u002Fsgemm.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_t_8x8_sliced_k_f32x4](.\u002Fkernels\u002Fsgemm\u002Fsgemm.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_t_8x8_sliced_k...bcf](.\u002Fkernels\u002Fsgemm\u002Fsgemm.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_t_8x8_sliced_k...dbuf](.\u002Fkernels\u002Fsgemm\u002Fsgemm.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_t_8x8_sliced_k16...dbuf](.\u002Fkernels\u002Fsgemm\u002Fsgemm_async.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_t_8x8_sliced_k16...async](.\u002Fkernels\u002Fsgemm\u002Fsgemm_async.cu)|f32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_wmma_m16n16k8...stages*](.\u002Fkernels\u002Fsgemm\u002Fsgemm_wmma_tf32_stage.cu)|tf32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [sgemm_wmma_m16n16k8...swizzle*](.\u002Fkernels\u002Fsgemm\u002Fsgemm_wmma_tf32_stage.cu)|tf32|f32|[link](.\u002Fkernels\u002Fsgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_naive_f16](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️|\n| ✔️ [hgemm_sliced_k_f16](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8x8_sliced_k_f16x4](.\u002Fkernels\u002Fhgemm\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8x8_sliced_k_f16x4_pack](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8x8_sliced_k_f16x8_pack](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8x8_sliced_k...dbuf](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8\u002F16x8...k16\u002F32...dbuf](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm_async.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_t_8\u002F16x8...k16\u002F32...async](.\u002Fkernels\u002Fhgemm\u002Fnaive\u002Fhgemm_async.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...naive*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...mma4x2*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...mma4x4*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...dbuf*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m32n8k16....dbuf*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...stages*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma_stage.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wmma_m16n16k16...swizzle*](.\u002Fkernels\u002Fhgemm\u002Fwmma\u002Fhgemm_wmma_stage.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...naive*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fbasic\u002Fhgemm_mma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...mma2x4*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fbasic\u002Fhgemm_mma.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...stages*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fbasic\u002Fhgemm_mma_stage.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...swizzle*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fbasic\u002Fhgemm_mma_stage.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...swizzle{smem}*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fswizzle\u002Fhgemm_mma_stage_swizzle.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_m16n8k16...swizzle{tn}{smem}*](.\u002Fkernels\u002Fhgemm\u002Fmma\u002Fswizzle\u002Fhgemm_mma_stage_tn_swizzle_x4.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](.\u002Fkernels\u002Fhgemm\u002Fcutlass\u002Fhgemm_mma_stage_tn_cute.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_cublas*](.\u002Fkernels\u002Fhgemm\u002Fcublas\u002Fhgemm_cublas.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️|\n| ✔️ [hgemm_wgmma_m64n128k16...tma{ws}{tn}*](.\u002Fkernels\u002Fhgemm\u002Fwgmma\u002Fhgemm_wgmma_fp16acc_stages_tn.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_wgmma_m64n128k16_fp32...tma*](.\u002Fkernels\u002Fhgemm\u002Fwgmma\u002Fhgemm_wgmma_fp32acc_stages_tn.cu)|f16|f32|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n\n### 📚 Hard+ ⭐️⭐️⭐️⭐️ & Hard++ ⭐️⭐️⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel))\n\n- 📚 FlashAttention-2 MMA (MMA Acc F32\u002FF16, swizzle, QKV smem share, fine-grained tiling, etc.🎉)\n\n\u003Cdiv id=\"cuda-kernel-hard-plus\">\u003C\u002Fdiv>\n\n|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [flash_attn_cute(naive)](.\u002Fkernels\u002Fflash-attn\u002Fcutlass\u002Fflash_attn_cute.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn\u002F)|⭐️⭐️⭐️|\n| ✔️ [How to implement MMA smem swizzle*](.\u002Fkernels\u002Fswizzle\u002Fmma_simple_swizzle.cu)|f16|f16|[link](.\u002Fkernels\u002Fswizzle)|⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages_split_kv*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_split_kv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages_split_q*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_split_q.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...shared_kv*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_share_kv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...shared_qkv*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_share_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...tiling_qk*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_tiling_qk.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...tiling_qkv*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_tiling_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...shared_kv{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_share_kv_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...shared_qkv{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_share_qkv_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...tiling_qk{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_tiling_qk_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma_stages...tiling_qkv{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fbasic\u002Fflash_attn_mma_tiling_qkv_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_kv{f32}{rr}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fothers\u002Fflash_attn_mma_share_kv_F32F16F16F32_rr.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_qkv{f32}{rr}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fothers\u002Fflash_attn_mma_share_qkv_F32F16F16F32_rr.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_kv_swizzle{q}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_kv_swizzle_q.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_kv_swizzle{qk}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_kv_swizzle_qk.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_kv_swizzle{qkv}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_kv_swizzle_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_qkv_swizzle{q}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_qkv_swizzle_q.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_qkv_swizzle{qk}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_qkv_swizzle_qk.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...shared_qkv_swizzle{qkv}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_share_qkv_swizzle_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qk_swizzle{q}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qk_swizzle_q.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qk_swizzle{qk}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qk_swizzle_qk.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qk_swizzle{qkv}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qk_swizzle_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qkv_swizzle{q}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_q.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qkv_swizzle{qk}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_qk.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn_mma...tiling_qkv_swizzle{qkv}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_qkv.cu)|f16|f16|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn...tiling_qkv_swizzle{q}{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_q_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn...tiling_qkv_swizzle{qk}{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_qk_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [flash_attn...tiling_qkv_swizzle{qkv}{f32}*](.\u002Fkernels\u002Fflash-attn\u002Fmma\u002Fswizzle\u002Fflash_attn_mma_tiling_qkv_swizzle_qkv_F32F16F16F32.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn)|⭐️⭐️⭐️⭐️|\n\n💡NOTE: **rr**: means reduce registers usage (for `d>128`); **f32**: means MMA accumulate with FP32 dtype, otherwise, FP16. softmax Acc dtype is always be FP32 for high precision; **swizzle**: now, only support smem swizzle for MMA.\n\n- 📚 FFPA Attention MMA (**1.8x~3x**🎉faster vs SDPA EA, D > 256, FA2 not supported)\n\n|📖 CUDA Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F16](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F16_L1.cu)|f16|f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [ffpa_mma_stages_split_q_L1_F16F16F32](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L1.cu)|f16|f32|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ✔️ [ffpa_mma_stages_split_q_L1_mixed_acc](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L1.cu)|f16|QK f32, PV f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F16](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F16_L2.cu)|f16|f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L2_F16F16F32](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L2.cu)|f16|f32|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L2_mixed_acc](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L2.cu)|f16|QK f32, PV f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F16](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F16_L3.cu)|f16|f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L3_F16F16F32](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L3.cu)|f16|f32|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n| ⚠️ [ffpa_mma_stages_split_q_L3_mixed_acc](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn\u002Fblob\u002Fmain\u002Fcsrc\u002Fcuffpa\u002Fffpa_attn_F16F16F32_L3.cu)|f16|QK f32, PV f16|[link](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn)|⭐️⭐️⭐️⭐️|\n\n💡NOTE: 🤖[ffpa-attn](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn): 📚FFPA - Yet another Faster Flash Prefill Attention with O(1)🎉SRAM complexity for headdim > 256, **1.8x~3x**🎉faster than SDPA EA: [📈L20 ~1.9x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-l20), [📈 A30 ~1.8x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-a30), [📈3080 ~2.9x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-3080), [📈4090 ~2.1x↑🎉](https:\u002F\u002Fgithub.com\u002Fxlite-dev\u002Fffpa-attn?tab=readme-ov-file#L1-bench-4090).\n\n### 📚 Triton Kernel (OpenAI Triton) ⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel))\n\n\u003Cdiv id=\"triton-kernel\">\u003C\u002Fdiv>\n\n|📖 Triton Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [triton_vector_add_kernel](.\u002Fkernels\u002Fopenai-triton\u002Fvector-add\u002F)|all|all|[link](.\u002Fkernels\u002Fopenai-triton\u002Fvector-add\u002F)|⭐️⭐️|\n| ✔️ [triton_fused_softmax(multi-stages)](.\u002Fkernels\u002Fopenai-triton\u002Ffused-softmax\u002F)|f16\u002Fbf16\u002Ff32|f32|[link](.\u002Fkernels\u002Fopenai-triton\u002Ffused-softmax\u002F)|⭐️⭐️⭐️|\n| ✔️ [triton_fused_layer_norm(forward-pass)](.\u002Fkernels\u002Fopenai-triton\u002Flayer-norm\u002F)|f16\u002Fbf16\u002Ff32|f32|[link](.\u002Fkernels\u002Fopenai-triton\u002Flayer-norm\u002F)|⭐️⭐️⭐️|\n| ✔️ [triton_fused_layer_norm(backward-pass)](.\u002Fkernels\u002Fopenai-triton\u002Flayer-norm\u002F)|f16\u002Fbf16\u002Ff32|f32|[link](.\u002Fkernels\u002Fopenai-triton\u002Flayer-norm\u002F)|⭐️⭐️⭐️|\n| ✔️ [triton_merge_attn_states_kernel(w\u002F CUDA)](.\u002Fkernels\u002Fopenai-triton\u002Fmerge-attn-states\u002F)|f16\u002Fbf16\u002Ff32|f32|[link](.\u002Fkernels\u002Fopenai-triton\u002Fmerge-attn-states\u002F)|⭐️⭐️⭐️|\n\n### 📚 CUTLASS\u002FCuTe Kernel ⭐️⭐️⭐️ ([©️back👆🏻](#cuda-kernel))\n\n\u003Cdiv id=\"cutlass-kernel\">\u003C\u002Fdiv>\n\n|📖 CUTLASS\u002FCuTe Kernel| 📖 Elem DType| 📖 Acc DType| 📖 Docs | 📖 Level |\n|:---|:---|:---|:---|:---|\n| ✔️ [mat_transpose_cute](.\u002Fkernels\u002Fmat-transpose\u002Fmat_transpose_cute.cu)|f32|\u002F|[link](.\u002Fkernels\u002Fmat-transpose\u002F)|⭐️⭐️|\n| ✔️ [flash_attn_cute(naive)](.\u002Fkernels\u002Fflash-attn\u002Fcutlass\u002Fflash_attn_cute.cu)|f16|f32|[link](.\u002Fkernels\u002Fflash-attn\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_f16_cute_kernel](.\u002Fkernels\u002Fhgemv\u002Fhgemv_cute.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_f16x8_cute_kernel](.\u002Fkernels\u002Fhgemv\u002Fhgemv_cute.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemv_tensor_core_cute_kernel](.\u002Fkernels\u002Fhgemv\u002Fhgemv_cute.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemv\u002F)|⭐️⭐️⭐️|\n| ✔️ [hgemm_mma_stages_swizzle{smem}...cute*](.\u002Fkernels\u002Fhgemm\u002Fcutlass\u002Fhgemm_mma_stage_tn_cute.cu)|f16|f16|[link](.\u002Fkernels\u002Fhgemm\u002F)|⭐️⭐️⭐️|\n| ✔️ [ws_hgemm_naive_cute_kernel](.\u002Fkernels\u002Fws-hgemm\u002Fnaive_ws_hgemm_sm8x.cu)|f16|f16|[link](.\u002Fkernels\u002Fws-hgemm\u002F)|⭐️⭐️⭐️|\n\n## 📖 100+ 高性能计算与分布式-技术博客\n\n\u003Cdiv id=\"my-blogs-part-1\">\u003C\u002Fdiv>\n\n### 📚 高性能计算与分布式-个人技术专栏 ([©️back👆🏻](#contents))\n\n|📖 类型-标题|📖 作者| 📖 推荐 |\n|:---|:---|:---|\n| [[Diffusion推理]📖简短的2025年总结，写在Cache-DiT v1.2.1之际](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F2001692370358539662)|@DefTruth|⭐️⭐️|\n| [[Diffusion推理]📖CacheDiT支持Z-Image分布式推理和缓存加速​​](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1978490962742374735)|@DefTruth|⭐️⭐️|\n| [[Diffusion推理]📖cache-dit支持FLUX.2分布式推理和Cache](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1977698505834379041)|@DefTruth|⭐️⭐️|\n| [[Diffusion推理]📖Cache加速-FoCa公式理解记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1952056591068144338)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖cache-dit: BlockAdapter支持HunyuanImage-2.1 Cache加速!](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1950849526400263083)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖cache-dit + Qwen-Image-Lightning 实现 3.5 steps 推理!](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1948696529180295613)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖cache-dit: Wan2.2-MoE 2.4x 推理加速!](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1943976514321380955)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖cache-dit: Qwen-Image-Edit 2x 无损加速!](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1941503245764792443)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖cache-dit: Qwen-Image 1.5x 无损加速!](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1938547315221705644)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖Cache加速-TaylorSeer算法简析](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1937477466475197176)|@DefTruth|⭐️⭐️⭐|\n| [[Diffusion推理]📖DiT推理加速综述: Caching](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F711223667)|@DefTruth|⭐️⭐️⭐|\n| [[Triton编程][基础]📖Triton极简入门: Triton Vector Add](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1902778199261291694)|@DefTruth|⭐️⭐️⭐|\n| [[Triton编程][基础]📖Triton Fused Softmax Kernel详解: 从Python源码到PTX](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1899562146477609112)|@DefTruth|⭐️⭐️⭐|\n| [[Triton编程][基础]📖vLLM Triton Merge Attention States Kernel详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1904937907703243110)|@DefTruth|⭐️⭐️⭐|\n| [[Triton编程][进阶]📖vLLM Prefix Prefill Triton Kernel图解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695799736)|@DefTruth|⭐️⭐️⭐️|\n| [[张量\u002F序列并行]📖序列并行: BPT、Ring-Attention及Striped-Attention笔记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F6456708235)|@DefTruth|⭐️⭐️⭐|\n| [[vLLM实践][算子]📖vLLM算子开发流程：”保姆级“详细记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1892966682634473987)|@DefTruth|⭐️⭐️⭐|\n| [[vLLM实践][万字]📖vLLM + DeepSeek-R1 671B 多机部署及修Bug笔记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F29950052712)|@DefTruth|⭐️⭐️⭐|\n| [[Attention优化]📖FFPA(Split-D): FA2无限HeadDim扩展，2x↑🎉 vs SDPA EA](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F13975660308)|@DefTruth|⭐️⭐️⭐️|\n| [[CUDA基础][开篇]📖LeetCUDA: v3.0 大升级-面试刷题不迷路](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F19862356369)|@DefTruth|⭐️⭐️⭐⭐️|\n| [[分布式训推][张量\u002F序列并行]📖图解DeepSpeed-Ulysses&Megatron-LM TP\u002FSP](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F5750410146)|@DefTruth|⭐️⭐️|\n| [[VLM推理优化][InternVL系列]📖InternLM2\u002F...\u002FInternVL1.5系列笔记: 核心点解析](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F702481058)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][TensorRT-LLM][5w字]📖TensorRT-LLM部署调优-指北](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F699333691)|@DefTruth|⭐️⭐️⭐️|\n| [[LLM推理优化][KV Cache优化]📖GQA\u002FYOCO\u002FCLA\u002FMLKV: 层内和层间KV Cache共享](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F697311739)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][Prefill优化][万字]📖图解vLLM Automatic Prefix Caching: TTFT优化](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F693556044)|@DefTruth|⭐️⭐️⭐️|\n| [[LLM推理优化][Attention优化]📖图解:从Online-Softmax到FlashAttention V1\u002FV2\u002FV3](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F668888063)|@DefTruth|⭐️⭐️⭐️|\n| [[LLM推理优化][Decoding优化]📖原理&图解FlashDecoding\u002FFlashDecoding++](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F696075602)|@DefTruth|⭐️⭐️|\n| [[VLM推理优化][LLaVA系列]📖CLIP\u002FLLaVA\u002FLLaVA1.5\u002FVILA笔记: 核心点解析](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F683137074)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][Attention优化][万字]📖TensorRT MHA\u002FMyelin vs FlashAttention-2](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F678873216)|@DefTruth|⭐️⭐️⭐️|\n| [[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: PRMT指令详解-通用模式](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F660630414)|@DefTruth|⭐️|\n| [[LLM推理优化][PTX汇编]📖CUDA 12 PTX汇编: LOP3指令详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F659741469)|@DefTruth|⭐️|\n| [[LLM推理优化][CUDA][3w字]📖高频面试题汇总-大模型手撕CUDA](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F678903537)|@DefTruth|⭐️⭐️⭐️|\n| [[LLM推理优化][Weight Only]📖WINT8\u002F4-(00): 通俗易懂讲解-快速反量化算法](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F657072856)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][Weight Only]📖WINT8\u002F4-(01): PRMT指令详解及FT源码解析](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F657070837)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][Weight Only]📖WINT8\u002F4-(02): 快速反量化之INT8转BF16](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F657073159)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][Weight Only]📖WINT8\u002F4-(03): LOP3指令详解及INT4转FP16\u002FBF16](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F657073857)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][LLM Infra整理]📖100+篇: 大模型推理各方向新发展整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F693680304)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][LLM Infra整理]📖30+篇: LLM推理论文集-500页PDF](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F669777159)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][LLM Infra整理]📖FlashDecoding++: 比FlashDecoding还要快！](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F665022589)|@DefTruth|⭐️|\n| [[LLM推理优化][LLM Infra整理]📖TensorRT-LLM开源，TensorRT 9.1也来了](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F662361469)|@DefTruth|⭐️|\n| [[LLM推理优化][LLM Infra整理]📖20+篇: LLM推理论文集-300页PDF](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F658091768)|@DefTruth|⭐️⭐️|\n| [[LLM推理优化][LLM Infra整理]📖PagedAttention论文新鲜出炉](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F617015570)|@DefTruth|⭐️|\n| [[推理部署][CV\u002FNLP]📖FastDeploy三行代码搞定150+ CV、NLP模型部署](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F581326442)|@DefTruth|⭐️|\n| [[推理部署][CV]📖如何在lite.ai.toolkit(3.6k+ stars)中增加您的模型？](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F523876625)|@DefTruth|⭐️⭐️|\n| [[推理部署][CV]📖美团 YOLOv6 ORT\u002FMNN\u002FTNN\u002FNCNN C++推理部署](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F533643238)|@DefTruth|⭐️⭐️|\n| [[推理部署][ONNX]📖ONNX推理加速技术文档-杂记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F524023964)|@DefTruth|⭐️|\n| [[推理部署][TensorFlow]📖Mac源码编译TensorFlow C++指北](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F524013615)|@DefTruth|⭐️|\n| [[推理部署][CV]📖1Mb!头部姿态估计: FSANet，一个小而美的模型(C++)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F447364201)|@DefTruth|⭐️|\n| [[推理部署][CV]📖opencv+ffmpeg编译打包全解指南](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F472115312)|@DefTruth|⭐️⭐️|\n| [[推理部署][CV]📖RobustVideoMatting视频抠图静态ONNX模型转换](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F459088407)|@DefTruth|⭐️|\n| [[推理部署][CV]📖190Kb!SSRNet年龄检测详细解读（含C++工程）](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F462762797)|@DefTruth|⭐️|\n| [[推理部署][CV]📖MGMatting(CVPR2021)人像抠图C++应用记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F464732042)|@DefTruth|⭐️|\n| [[推理部署][CV]📖超准确人脸检测(带关键点)YOLO5Face C++工程详细记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F461878005)|@DefTruth|⭐️⭐️|\n| [[推理部署][ORT]📖解决: ONNXRuntime(Python) GPU 部署配置记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F457484536)|@DefTruth|⭐️|\n| [[推理部署][CV]📖记录SCRFD(CVPR2021)人脸检测C++工程化(含docker镜像)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F455165568)|@DefTruth|⭐️⭐️|\n| [[推理部署][NCNN]📖野路子：记录一个解决onnx转ncnn时op不支持的trick](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F451446147)|@DefTruth|⭐️|\n| [[推理部署][CV]📖升级版NanoDet-Plus MNN\u002FTNN\u002FNCNN\u002FORT C++工程记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F450586647)|@DefTruth|⭐️⭐️|\n| [[推理部署][CV]📖超轻量级NanoDet MNN\u002FTNN\u002FNCNN\u002FORT C++工程记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F443419387)|@DefTruth|⭐️|\n| [[推理部署][CV]📖详细记录MGMatting之MNN、TNN和ORT C++移植](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F442949027)|@DefTruth|⭐️⭐️|\n| [[推理部署][CV]📖YOLOX NCNN\u002FMNN\u002FTNN\u002FONNXRuntime C++工程简记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F447364122)|@DefTruth|⭐️|\n| [[推理部署][TNN]📖手动修改YoloX的tnnproto记录-TNN](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F425668734)|@DefTruth|⭐️|\n| [[推理部署][ORT]📖全网最详细 ONNXRuntime C++\u002FJava\u002FPython 资料！](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F414317269)|@DefTruth|⭐️|\n| [[推理部署][CV]📖RobustVideoMatting: C++工程化记录-实现篇](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F413280488)|@DefTruth|⭐️⭐️|\n| [[推理部署][CV]📖RobustVideoMatting: C++工程化记录-应用篇](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F412491918)|@DefTruth|⭐️⭐️|\n| [[推理部署][ORT]📖ONNXRuntime C++ CMake 工程分析及编译](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F411887386)|@DefTruth|⭐️⭐️|\n| [[推理部署][ORT]📖如何使用ORT C++ API处理NCHW和NHWC输入？](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F524230808)|@DefTruth|⭐️|\n| [[推理部署][TNN]📖tnn-convert搭建简记-YOLOP转TNN](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F431418709)|@DefTruth|⭐️|\n| [[推理部署][CV]📖YOLOP ONNXRuntime C++工程化记录](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F411651933)|@DefTruth|⭐️⭐️|\n| [[推理部署][NCNN]📖超有用NCNN参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449765328)|@DefTruth|⭐️|\n| [[推理部署][MNN]📖超有用MNN参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449761992)|@DefTruth|⭐️|\n| [[推理部署][TNN]📖超有用TNN参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449769615)|@DefTruth|⭐️|\n| [[推理部署][ONNX]📖超有用ONNX参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449773663)|@DefTruth|⭐️|\n| [[推理部署][ONNX]📖超有用ONNX模型结构参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449775926)|@DefTruth|⭐️|\n| [[推理部署][OpenCV-DNN]📖超有用OpenCV-DNN参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449778377)|@DefTruth|⭐️|\n| [[推理部署][Tensorflow]📖超有用Tensorflow C++工程化知识点](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449788027)|@DefTruth|⭐️|\n| [[推理部署][模型转换]📖深度学习模型转换资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449759361)|@DefTruth|⭐️|\n| [[技术随笔][C++][CMake]📖超有用CMake参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F449779892)|@DefTruth|⭐️⭐️|\n| [[技术随笔][C++][3W字]📖静态链接和静态库实践指北-原理篇](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F595527528)|@DefTruth|⭐️⭐️⭐️|\n| [[技术随笔][C++]📖Mac下C++内存检查指北(Valgrind VS Asan)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F508470880)|@DefTruth|⭐️|\n| [[技术随笔][CV]📖torchlm: 人脸关键点检测库](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F467211561)|@DefTruth|⭐️⭐️|\n| [[技术随笔][ML]📖《统计学习方法-李航: 笔记-从原理到实现-基于R》](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F684885595)|@DefTruth|⭐️⭐️|\n| [[技术随笔][Git]📖如何优雅地git clone和git submodule？](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F639136221)|@DefTruth|⭐️|\n| [[技术随笔][3D]📖人脸重建3D参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F524034741)|@DefTruth|⭐️|\n| [[技术随笔][3D]📖BlendShapes参考资料整理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F524036145)|@DefTruth|⭐️|\n| [[技术随笔][3D]📖从源码安装Pytorch3D详细记录及学习资料](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F512347464)|@DefTruth|⭐️|\n| [[技术随笔][ML]📖200页:《统计学习方法：李航》笔记 -从原理到实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F461520847)|@DefTruth|⭐️⭐️|\n\n### 📚 高性能计算与分布式-技术博客推荐 ([©️back👆🏻](#contents))\n\n\u003Cdiv id=\"other-blogs\">\u003C\u002Fdiv>\n\n💡说明: 本小节整理一些自己比较喜欢的文章。欢迎大家提PR推荐更多优秀的文章！\n\n|📖 类型-标题|📖 作者| 📖 推荐 |\n|:---|:---|:---|\n| [[cute系列详解][入门]📖cutlass cute 101](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F660379052)|@朱小霖|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖CUTLASS 2.x & CUTLASS 3.x Intro 学习笔记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F710516489)|@BBuf|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给大家看的 CuTe 教程：tiled copy](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1930389542784964333)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给大家看的 CuTe 教程：tiled mma](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1937145378446226159)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给大家看的 CuTe 教程：Layout Compose & Inverse](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1962625273636845008)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给大家看的 CuTe 教程: Layout Product & Divide](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1971945267294111573)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给大家看的 CuTe 教程：TMA Copy](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F2003198909405763007)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][入门]📖写给进阶开发的 CuTe 笔记：permutationMNK 参数](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1973526710105419953)|@竹熙佳处|⭐️⭐️⭐️|\n| [[cute系列详解][Layout]📖cute 之 Layout](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F661182311)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][Layout]📖cute Layout 的代数和几何解释](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F662089556)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][Tensor]📖cute 之 Tensor](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F663093816)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][MMA]📖cute 之 MMA抽象](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F663092747)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][Copy]📖cute 之 Copy抽象](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F666232173)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖cute 之 Swizzle](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F671419093)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖cute Swizzle细谈](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F684250988)|@进击的Killua|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖cutlass swizzle机制解析（一）](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F710337546)|@Titus|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖cutlass swizzle机制解析（二）](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F711398930)|@Titus|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖CUDA避免smem bank conflict的swizzle机制解析](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F4746910252)|@frankshi|⭐️⭐️⭐️|\n| [[cute系列详解][Swizzle]📖布局代数实战：Swizzle自动推导](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1941306442683515068)|@melonedo|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖cute 之 简单GEMM实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F667521327)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖cute 之 GEMM流水线](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F665082713)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖cute 之 高效GEMM实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F675308830)|@reed|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖GEMM流水线: single\u002Fmulti-stage、pipeline](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F712451053)|@Titus|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖GEMM细节分析(一): ldmatrix的选择](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F702818267)|@Anonymous|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖GEMM细节分析(二): TiledCopy与cp.async](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F703560147)|@Anonymous|⭐️⭐️⭐️|\n| [[cute系列详解][GEMM]📖GEMM细节分析(三): Swizzle\u003CB,M,S>参数取值](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F713713957)|@Anonymous|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖Hopper Mixed GEMM的CUTLASS实现笔记](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F714378343)|@BBuf|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖CUTLASS CuTe实战(一): 基础](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F690703999)|@进击的Killua|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖CUTLASS CuTe实战(二): 应用](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F692078624)|@进击的Killua|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖FlashAttention fp8实现（ada架构)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F712314257)|@shengying.wei|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖FlashAttention 笔记: tiny-flash-attention解读](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F708867810)|@shengying.wei|⭐️⭐️⭐️|\n| [[cute系列详解][实践]📖使用cutlass cute复现flash attention](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F696323042)|@66RING|⭐️⭐️⭐️|\n| [[cutlass教程][入门]📖cutlass 基本认知](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F677616101)|@JoeNomad|⭐️⭐️⭐️|\n| [[cutlass教程][入门]📖cutlass 软件架构](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F678915618)|@JoeNomad|⭐️⭐️⭐️|\n| [[cutlass教程][入门]📖CUTLASS 基础介绍](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F671324125)|@进击的Killua|⭐️⭐️⭐️|\n| [[cutlass教程][入门]📖乱谈CUTLASS GTC2020 SLIDES](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F674693873)|@zzk again|⭐️⭐️⭐️|\n| [[cutlass教程][深入]📖cutlass block swizzle 和 tile iterator](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F679929705)|@JoeNomad|⭐️⭐️⭐️|\n| [[cutlass教程][深入]📖cutlass bank conflict free的smem layout](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F681966685)|@JoeNomad|⭐️⭐️⭐️|\n| [[cutlass教程][深入]📖cutlass 多级流水线](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F687397095)|@JoeNomad|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-前言](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F686198447)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-寄存器](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F688616037)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-Load和Cache](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F692445145)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-浮点运算](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695667044)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-整数运算](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F700921948)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-比特和逻辑操作](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F712356884)|@reed|⭐️⭐️⭐️|\n| [[GPU指令集架构][精解]📖NVidia GPU指令集架构-Warp级和Uniform操作](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F712357647)|@reed|⭐️⭐️⭐️|\n| [[CUDA优化][入门]📖CUDA 入门的正确姿势：how-to-optimize-gemm](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F478846788)|@白牛|⭐️⭐️⭐️|\n| [[CUDA优化][入门]📖CUDA（一）：CUDA 编程基础](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F645330027)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][入门]📖CUDA（二）：GPU的内存体系及其优化指南](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F654027980)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA（三）：通用矩阵乘法：从入门到熟练](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F657632577)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(1)：LayerNorm 算子的 CUDA 实现与优化](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F694974164)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(2)：SoftMax算子的 CUDA 实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695307283)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(3)：Cross Entropy 的 CUDA 实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695594396)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(4)：AdamW 优化器的 CUDA 实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695611950)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(5)：激活函数与残差连接的 CUDA 实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695703671)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(6)：embedding 层与 LM head 层的 CUDA 实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695785781)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(7)：self-attention 的 CUDA 实现及优化 (上)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F695898274)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖ops(8)：self-attention 的 CUDA 实现及优化 (下)](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F696197013)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA（四）：使用 CUDA 实现 Transformer 结构](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F694416583)|@紫气东来|⭐️⭐️⭐️|\n| [[CUDA优化][Copy]📖Async Copy及Memory Barrier指令的功能与实现](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F685168850)|@Frank Wang|⭐️⭐️⭐️|\n| [[CUDA优化][GEMV]📖深入浅出GPU优化系列：gemv优化](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F494144694)|@有了琦琦的棍子|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA element-wise 算子详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1888630735520391519)|@懒蚂蚁呀不嘿|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA transpose 算子详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1899760505733756129)|@懒蚂蚁呀不嘿|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA reduce 算子详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1905661893739283464)|@懒蚂蚁呀不嘿|⭐️⭐️⭐️|\n| [[CUDA优化][实践]📖CUDA GEMM 算子详解](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F1910636263666610461)|@懒蚂蚁呀不嘿|⭐️⭐️⭐️|\n| [[Tensor Cores]📖Nvidia Tensor Core初探](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F620185229)|@木子知|⭐️⭐️⭐️|\n| [[Tensor Cores]📖Nvidia Tensor Core-WMMA API编程入门](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F620766588)|@木子知|⭐️⭐️⭐️|\n| [[Tensor Cores]📖Nvidia Tensor Core-MMA PTX编程入门](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F621855199)|@木子知|⭐️⭐️⭐️|\n| [[Tensor Cores]📖CUDA Ampere Tensor Core HGEMM 矩阵乘法优化](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F555339335)|@nicholaswilde|⭐️⭐️⭐️|\n| [[GPU通信架构][精解]📖NVIDIA GPGPU（四）- 通信架构](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F680262016)|@Bruce|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖Torch.compile流程解析: 介绍](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F9418379234)|@StarCap|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖Torch.compile流程解析: TorchDynamo](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F9640728231)|@StarCap|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖Torch.compile流程解析: AOTAutograd](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F9997263922)|@StarCap|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖Torch.compile流程解析: TorchInductor](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F11224299472)|@StarCap|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖Torch.compile流程解析: 算子融合](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F21053905491)|@StarCap|⭐️⭐️⭐️|\n| [[torch.compile][实践]📖Torch.compile使用指南](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F620163218)|@jhang|⭐️⭐️⭐️|\n| [[torch.compile][实践]📖Torch.compile详细示例解析教程](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F855291863)|@Bbuf|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖一文搞懂TorchDynamo原理](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F630933479)|@吾乃阿尔法|⭐️⭐️⭐️|\n| [[torch.compile][原理]📖理解torch.compile基本原理和使用方式](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F12712224407)|@俯仰|⭐️⭐️⭐️|\n\n## ©️License ([©️back👆🏻](#contents))\n\n\u003Cdiv id=\"License\">\u003C\u002Fdiv>\n\nGNU General Public License v3.0\n\n## 🎉Contribute ([©️back👆🏻](#contents))\n\n\u003Cdiv id=\"contribute\">\u003C\u002Fdiv>\n\nHow to contribute? Star this repo or check [🌤🌤CONTRIBUTE🎉🎉](.\u002FCONTRIBUTE.md).\n\n\u003Cdiv align='center'>\n\u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#xlite-dev\u002FLeetCUDA&Date\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=xlite-dev\u002FLeetCUDA&type=Date&theme=dark\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=xlite-dev\u002FLeetCUDA&type=Date\" \u002F>\n   \u003Cimg width=400 height=300 alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=xlite-dev\u002FLeetCUDA&type=Date\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## 📖 References ([©️back👆🏻](#contents))\n\u003Cdiv id=\"ref\">\u003C\u002Fdiv>\n\n- [flash-attention-minimal](https:\u002F\u002Fgithub.com\u002Ftspeterkim\u002Fflash-attention-minimal)\n- [tiny-flash-attention](https:\u002F\u002Fgithub.com\u002F66RING\u002Ftiny-flash-attention)\n- [cute-gemm](https:\u002F\u002Fgithub.com\u002Freed-lau\u002Fcute-gemm)\n- [cutlass_flash_atten_fp8](https:\u002F\u002Fgithub.com\u002Fweishengying\u002Fcutlass_flash_atten_fp8)\n- [cuda_learning](https:\u002F\u002Fgithub.com\u002Fifromeast\u002Fcuda_learning)\n- [cuda_hgemm](https:\u002F\u002Fgithub.com\u002FBruce-Lee-LY\u002Fcuda_hgemm)\n- [cuda-tensorcore-hgemm](https:\u002F\u002Fgithub.com\u002Fnicolaswilde\u002Fcuda-tensorcore-hgemm)\n- [How_to_optimize_in_GPU](https:\u002F\u002Fgithub.com\u002FLiu-xiandong\u002FHow_to_optimize_in_GPU\u002Ftree\u002Fmaster\u002Fsgemv)\n- [how-to-optim-algorithm-in-cuda](https:\u002F\u002Fgithub.com\u002FBBuf\u002Fhow-to-optim-algorithm-in-cuda)\n- [cute_gemm](https:\u002F\u002Fgithub.com\u002Fweishengying\u002Fcute_gemm)\n- [cutlass](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass)\n","LeetCUDA 是一个面向初学者的现代 CUDA 学习笔记项目，结合了 PyTorch 使用。该项目包含超过 200 个 CUDA 内核示例、Tensor 核心支持、HGEMM 实现（性能接近 cuBLAS 的 98%~100%）以及基于 Tensor 核心的 Flash Attention 实现。这些内容通过详细的代码和文档帮助用户快速掌握 CUDA 编程技巧及其在深度学习中的应用。LeetCUDA 适合那些希望深入了解 GPU 并行计算技术，并且想要利用 NVIDIA 硬件加速机器学习模型训练与推理过程的研究者或开发者使用。",2,"2026-06-11 03:43:50","high_star"]