[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1527":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},1527,"sass-king","florianmattana\u002Fsass-king","florianmattana","Reverse engineering NVIDIA SASS  instruction dictionary, kernel audits and pattern recognition across GPU architectures.",null,"Cuda",293,14,7,6,0,17,23,74,51,82.43,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:10","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fsass-king-logo.svg\" alt=\"SASS King logo\" width=\"220\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">SASS King\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Reverse engineering NVIDIA SASS from controlled kernels to production audits.\u003C\u002Fstrong>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"knowledge\u002FREADME.md\">Knowledge base\u003C\u002Fa> ·\n  \u003Ca href=\"knowledge\u002FSASS_INSTRUCTIONS_SM120.md\">SM120 instruction glossary\u003C\u002Fa> ·\n  \u003Ca href=\"knowledge\u002Fencoding\u002F\">Encoding notes\u003C\u002Fa> ·\n  \u003Ca href=\"tensor_cores\u002FREADME.md\">Tensor-core chapters\u003C\u002Fa> ·\n  \u003Ca href=\"CONTRIBUTING.md\">Contributing\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg alt=\"Architecture\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Farchitecture-SM120%20%2F%20SM120a-0b6d55\">\n  \u003Cimg alt=\"Status\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fstatus-research%20knowledge%20base-f4c95d\">\n  \u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-blue\">\n\u003C\u002Fp>\n\nSASS King is a systematic reverse-engineering project for NVIDIA SASS, the native GPU instruction set emitted inside compiled CUDA binaries. The project starts with SM120 \u002F SM120a consumer Blackwell hardware and expands toward a full cross-architecture ISA and pattern library over time.\n\nThe goal is practical: help a kernel engineer open a SASS dump, recognize compiler patterns, identify performance-relevant structures, and connect the binary back to source-level optimization decisions.\n\n## Why It Exists\n\nThe last broad public SASS reverse-engineering work comparable in spirit was Jia et al. on Volta and Turing in 2018. Ampere, Hopper, and Blackwell have changed the instruction mix substantially: async copy paths, tensor-core families, matrix load\u002Fstore instructions, sparse and scaled MMA forms, and new uniform-register flows.\n\nSASS King fills that gap by combining controlled micro-kernels, raw SASS reading, runtime probes, and production-kernel audits.\n\n## Current State\n\n| Area | Status | Where |\n|---|---|---|\n| SM120 teaching kernels | Complete through kernels 01-12 | `01_vector_add\u002F` to `12_register_spill\u002F` |\n| Tensor-core studies | Complete through Kernel 25 | `tensor_cores\u002F` |\n| Global findings | Active source of truth | `knowledge\u002FFINDINGS.md` |\n| SM120 instruction glossary | Active, evidence-backed | `knowledge\u002FSASS_INSTRUCTIONS_SM120.md` |\n| Encoding pilots | Started with `LDSM`, `STSM`, `QMMA` | `knowledge\u002Fencoding\u002F` |\n| Pattern library | Next phase | `patterns\u002F` |\n| Production audits | Planned | `production\u002F` |\n\n## Start Here\n\n- New to the project: read the [knowledge base index](knowledge\u002FREADME.md).\n- Want the current instruction map: read [SASS instructions on SM120 \u002F SM120a](knowledge\u002FSASS_INSTRUCTIONS_SM120.md).\n- Want the raw source of truth: read [findings](knowledge\u002FFINDINGS.md).\n- Want tensor-core evidence: start with [tensor-core chapters](tensor_cores\u002FREADME.md).\n- Want to contribute dumps or corrections: read [contributing](CONTRIBUTING.md).\n\nFull context for the first public writeup: [Part 1 - Reading NVIDIA SASS from First Principles](https:\u002F\u002Fflorianmattana.com\u002Fp\u002Freading-nvidia-sass-from-first-principles).\n\n## Methodology\n\n**Controlled variation.** Two kernels differ by exactly one variable: dtype, operand order, unroll factor, memory layout, or compilation target. The SASS diff isolates the compiler decision.\n\n**Strict claim tags.** Every technical claim uses a tag:\n\n| Tag | Meaning |\n|---|---|\n| `[OBS]` | Directly observed in a dump, log, runtime output, or profile. |\n| `[INF]` | Inferred from observed evidence. |\n| `[HYP]` | Plausible but not confirmed. |\n| `[RES]` | A prior hypothesis resolved by later evidence. |\n| `[GAP]` | Open question documented explicitly. |\n\n**Top-down and bottom-up together.** Micro-kernels isolate individual instructions and compiler decisions. Production-like kernels show which patterns matter in real code.\n\n## What Is Covered\n\nThe first pass focuses on the SM120 tensor-core and memory pipeline:\n\n- `HMMA`, `QMMA`, `OMMA`\n- `LDSM`, `STSM`\n- `LDGSTS`, `LDGDEPBAR`, `DEPBAR`\n- `LDG`, `STG`, `LDS`, `STS`, `REDG`\n- `BRA`, `EXIT`, `BSSY`, `BSYNC`, `WARPSYNC`\n- `SHFL`, `VOTE`, `REDUX`\n- uniform-register flow: `S2UR`, `R2UR`, `UMOV`, `ULEA`, `LDCU`\n\nThe project does not pretend the ISA is complete yet. The public glossary tracks what is observed and explained; deeper pages under `knowledge\u002Fencoding\u002F` track families with enough evidence for matcher-style documentation.\n\n## Roadmap\n\n### Phase 1 - Teaching Kernels\n\nKernels 01-12 establish baseline SASS concepts: FMA fusion, scoreboard behavior, loop lowering, shared memory, global memory, warp primitives, slow-path math, and local-memory spills.\n\n### Phase 2 - Tensor-Core And SM120 Coverage\n\nKernels 13-25 cover the current SM120 tensor-core path:\n\n| Kernel | Topic |\n|---|---|\n| 13 | HMMA baseline, register allocation, accumulator chaining |\n| 14 | QMMA FP8 \u002F FP6 \u002F FP4 baseline |\n| 15 | Narrow MMA variants |\n| 16 | FP4 peak and block-scaled OMMA\u002FQMMA |\n| 17 | LDSM and matrix-load behavior |\n| 18 | Pipelined MMA tile and async copy staging |\n| 19 | Sparse MMA metadata |\n| 20 | Control flow and back-edge detection |\n| 21 | Divergence and reconvergence |\n| 22 | STSM matrix-store behavior |\n| 23 | FP4 \u002F FP6 fragment layout probes |\n| 24 | Production mini-GEMM audit |\n| 25 | STSM epilogue layout and storeback semantics |\n\n### Phase 3 - Pattern Library\n\nFormalize recurring structures into reusable signatures:\n\n- `LDGSTS -> DEPBAR -> LDSM -> MMA`\n- chained `HMMA` \u002F `QMMA` \u002F `OMMA`\n- `STSM -> BAR -> LDS -> STG`\n- warp reductions and cross-lane collectives\n- register-spill signatures\n- scalar and uniform control-flow patterns\n\n### Phase 4 - Production Audits\n\nApply the pattern library to real kernels from libraries such as FlashAttention, CUTLASS, xFormers, Transformer Engine, FlashInfer, llama.cpp \u002F ggml, tinygrad, and related projects. The goal is representative coverage by algorithmic pattern, not one markdown file per kernel.\n\n### Phase 5 - Audit Tool\n\nBuild a pipeline that takes a cubin, detects known patterns, and emits an optimization-oriented report.\n\n### Phase 6 - Cross-Architecture\n\nReplay the methodology on additional targets:\n\n| Arch | Representative GPU | Why |\n|---|---|---|\n| SM80 | A100 | Datacenter Ampere baseline |\n| SM86 | RTX 3090 | Consumer Ampere corpus |\n| SM89 | RTX 4090 | Common consumer inference card |\n| SM90a | H100 | TMA, WGMMA, warp specialization, clusters |\n| SM100a | B200 | tcgen05.mma, TMEM |\n| SM120 | RTX 5070 Ti \u002F 5090 | Consumer Blackwell starting point |\n\n## Repository Map\n\n```text\n.\n├── 01_vector_add\u002F ... 12_register_spill\u002F   # Phase 1 teaching kernels\n├── tensor_cores\u002F                           # Phase 2 tensor-core studies\n├── knowledge\u002F                              # Findings, glossary, encoding notes\n│   ├── FINDINGS.md\n│   ├── SASS_INSTRUCTIONS_SM120.md\n│   └── encoding\u002F\n├── patterns\u002F                               # Coming: formal pattern library\n├── production\u002F                             # Coming: production-kernel audits\n└── guide\u002F                                  # SASS reading guide material\n```\n\nEach chapter folder contains source kernels, compiled artifacts when relevant, SASS dumps when they are part of the validated evidence set, and a `conclusion\u003CN>.md` writeup.\n\n## Tooling\n\n- `cuobjdump --dump-sass` for raw disassembly.\n- `gpuasm.com` for scoreboards, stalls, pressure, and dependency arrows.\n- Nsight Compute for profiling and stall attribution.\n- `%clock` microbenchmarks for instruction latency probes.\n- `nvcc -Xptxas -v` for register and spill metadata.\n\n## Related Work\n\n- [redplait\u002Fdenvdis](https:\u002F\u002Fgithub.com\u002Fredplait\u002Fdenvdis) for opcode tables, latency extraction, scheduling analysis, and cubin patching.\n- [kuterdinel.com\u002Fnv_isa](https:\u002F\u002Fkuterdinel.com\u002Fnv_isa\u002F) for a fuzzed ISA specification.\n- Jia et al. 2018, \"Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking.\"\n- NVIDIA `cuda-binary-utilities` documentation.\n\nSASS King operates at the algorithmic pattern layer: recognizing how compiled kernels are structured and connecting those structures to source-level optimization decisions.\n\n## Contributing\n\nContributions are welcome, especially:\n\n- raw SASS dumps from hardware not directly available here;\n- controlled kernel studies that isolate one compiler decision;\n- corrections to existing observations;\n- new production-kernel pattern proposals;\n- cross-architecture comparisons.\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the expected metadata and writing standard.\n\n## Author\n\nFlorian Mattana. [florianmattana.com](https:\u002F\u002Fflorianmattana.com)\n","SASS King 是一个针对 NVIDIA SASS 指令集的逆向工程项目，旨在解析和理解编译后的 CUDA 二进制文件中的 GPU 原生指令。该项目通过构建控制内核、直接读取 SASS 代码、运行时探测及生产内核审计等手段，逐步建立起跨架构的指令集和模式库，当前主要聚焦于 SM120\u002FSM120a 架构。其核心技术特点包括详尽的指令词汇表、编码笔记以及针对张量核心的研究章节，为开发者提供从基础到高级的知识体系。SASS King 非常适合需要深入了解 GPU 编译器行为、优化性能关键结构或进行低级调试工作的研究人员与工程师使用。",2,"2026-06-11 02:44:29","CREATED_QUERY"]