[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1863":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},1863,"cuda-optimized-skill","KernelFlow-ops\u002Fcuda-optimized-skill","KernelFlow-ops","A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It helps improve custom GPU operators with reproducible workflows and evidence-based performance comparison.",null,"Python",177,17,152,1,0,3,24,9,53.67,"MIT License",false,"main",true,[],"2026-06-12 04:00:11","# cuda-kernel-optimizer\n\n**English** | [简体中文](README.zh-CN.md)\n\nA Claude skill that iteratively optimizes a CUDA \u002F CUTLASS \u002F Triton kernel against a Python reference, using `nsight-compute` (`ncu`) as the source of evidence for each optimization decision.\n\nThis is a **skill package**, not a standalone tool. Claude reads `SKILL.md` and drives the loop. The scripts under `scripts\u002F` handle the deterministic parts (environment detection, profiling, benchmarking, state).\n\n---\n\n![alt text](asset\u002Fv2_en_arch.png)\n\n## Usage\n```text\nUse this prompt in the agent:\n@cuda-kernel-optimizer use this skill to optimize \"the operator you want to optimize\" for N iterations.\n```\n\n## What's new in V2\n\nV2 upgrades the loop from \"try-and-log\" into \"try–attribute–verify–learn\". Four mechanisms are added on top of V1; everything below reflects V2 behavior:\n\n- **Roofline-driven axis budget** — instead of V1's fixed 1-method-per-axis, V2 computes per-iteration compute\u002Fmemory\u002Flatency gaps (Δc, Δm, Δl) and splits the 3-method budget proportionally (per-axis cap = 2). When all three gaps fall below 0.15 the loop early-stops with `near_peak: true`.\n- **Branch-and-Select exploration** — each iteration generates K branch candidates (default K=4) sharing the same methods but varying tile size, pipeline stages, warp count, and implementation variants. The fastest correct branch wins as champion; the rest are archived in `frontier`.\n- **Ablation-based attribution** — after the champion is picked, each method is ablated one at a time. `attribution(m) = ms_without_m − ms_champion` gives a per-method causal contribution instead of a single packed verdict.\n- **SASS instruction-level verification** — `cuobjdump --dump-sass` is grepped against a signature table (`sass_signatures.json`) to confirm each claimed optimization actually appears in the compiled machine code.\n\nThese together change method classification from two buckets (effective \u002F ineffective) to three: `effective_methods` (SASS ✓ and attribution > noise), `ineffective_methods` (SASS ✓ but attribution ≤ noise), and `implementation_failed_methods` (SASS ✗).\n\n## What you need\n\nOn the host where Claude runs:\n\n- A CUDA GPU with working drivers (`nvidia-smi` works)\n- `nvcc` in `$PATH` (for CUDA \u002F CUTLASS backends)\n- `ncu` in `$PATH` with permission to read perf counters — without it, the skill degrades to code-static reasoning only, which is significantly weaker\n- `cuobjdump` in `$PATH` (ships with the CUDA toolkit) — needed for V2's SASS verification step\n- Python 3.10+ with `torch` (CUDA build), `triton` if you want the Triton backend\n- For CUTLASS kernels: `$CUTLASS_PATH` or `$CUTLASS_INCLUDE_DIR` pointing at a tree with both `cutlass\u002F` and `cute\u002F` headers\n\n`benchmark.py` (the generic operator benchmark driver) is bundled at `scripts\u002Fbenchmark.py` — no separate installation needed.\n\n### `ncu` permission gotcha\n\nOn most cloud and container setups, profiling-counter access is disabled. You'll see it as `can_read_counters: false` in `env.json`. Fixes (pick one):\n\n- Run the host as root, or\n- Add `options nvidia NVreg_RestrictProfilingToAdminUsers=0` to `\u002Fetc\u002Fmodprobe.d\u002Fnvidia.conf` and reboot, or\n- For docker: `--cap-add=SYS_ADMIN` (Nsight docs recommend this)\n\n## What you give Claude\n\n1. **Baseline kernel file** — `gemm.cu` (CUDA\u002FCUTLASS) or `gemm.py` (Triton)\n2. **Reference file** — `ref.py` exposing `reference(**kwargs)` and optional `atol` \u002F `rtol`\n3. **Dims** — the scalar args the signature takes (e.g. `M=4096 N=4096 K=4096`)\n4. **Path to `benchmark.py`** — already bundled under `scripts\u002Fbenchmark.py`; `orchestrate.py` defaults to it. Pass `--benchmark \u003Cpath>` only if you have a custom version.\n5. Optional: iteration count `N` (default 3), `ncu_num` per-axis top-K (default 5), noise threshold (default 2%), **branches per iteration `K` (default 4, via `--branches`)**\n\n## What you get back\n\nA sibling directory of your baseline, `run_YYYYMMDD_HHMMSS\u002F`, containing:\n\n```text\nrun_YYYYMMDD_HHMMSS\u002F\n├── state.json                   # global state, re-readable across sessions\n│                                #   V2 adds: branches, implementation_failed_methods,\n│                                #            roofline_history, frontier\n├── env.json                     # GPU \u002F nvcc \u002F ncu \u002F CUTLASS snapshot\n├── baseline\u002F\n│   ├── \u003Cbaseline>               # copied verbatim\n│   └── bench.json               # seed timing + correctness\n├── iterv1\u002F\n│   ├── roofline.json            # Δc \u002F Δm \u002F Δl + per-axis budget allocation\n│   ├── methods.json             # methods picked under the budget (trigger_strength included)\n│   ├── analysis.md              # ncu metrics + CoT + risk notes\n│   ├── best_input.ncu-rep       # profile of what went IN\n│   ├── branches\u002F                # K branch candidates (same methods, different hyperparams)\n│   │   ├── b0\u002Fkernel.{cu,py} + bench.json\n│   │   ├── b1\u002F…\n│   │   └── …\n│   ├── kernel.{cu,py}           # champion kernel (fastest correct branch)\n│   ├── kernel.ncu-rep           # profile of the champion\n│   ├── ncu_top.json             # top-K metrics per axis (what Claude sees)\n│   ├── sass_check.json          # per-method SASS signature verification\n│   ├── ablations\u002F               # leave-one-out ablation runs\n│   │   ├── no_\u003Cmethod_a>\u002Fkernel.{cu,py} + bench.json\n│   │   └── …\n│   ├── attribution.json         # per-method causal contribution (ms)\n│   └── bench.json\n├── iterv2\u002F …\n├── iterv3\u002F …\n└── summary.md                   # headline speedup, timeline, bottleneck drift, retrospective\n```\n\n## Manual invocation\n\nYou don't need to drive the loop by hand — that's Claude's job — but for debugging the skill itself:\n\n```bash\n# 0 + 0b + 1 + 2 + 3a-for-iter1\npython scripts\u002Forchestrate.py setup \\\n  --baseline   .\u002Fgemm.cu \\\n  --ref        .\u002Fref.py \\\n  --iterations 3 \\\n  --ncu-num    5 \\\n  --branches   4 \\\n  --dims       '{\"M\":4096,\"N\":4096,\"K\":4096}'\n  # --benchmark defaults to scripts\u002Fbenchmark.py (bundled)\n\n# --- (Claude writes iterv1\u002Fkernel.cu + iterv1\u002Fmethods.json + iterv1\u002Fanalysis.md\n#      + K branch candidates under iterv1\u002Fbranches\u002F) ---\n\n# 3d + 3f + 3a-for-iter2 for iter 1\n# close-iter now also runs: branch selection → SASS check → ablation → state update\npython scripts\u002Forchestrate.py close-iter \\\n  --run-dir   run_20260418_143022 \\\n  --iter      1\n  # --benchmark defaults to scripts\u002Fbenchmark.py (bundled)\n\n# (repeat code-gen + close-iter for iter 2 and iter 3)\n\n# 4\npython scripts\u002Forchestrate.py finalize --run-dir run_20260418_143022\n```\n\nEach script is independently invocable (`--help` on any of them); `orchestrate.py` is just a convenience wrapper.\n\n## Repo layout\n\n```text\ncuda-kernel-optimizer\u002F\n├── SKILL.md                         # the skill — Claude reads this\n├── README.md                        # you are here\n├── scripts\u002F\n│   ├── benchmark.py                 # bundled benchmark driver (from project)\n│   ├── check_env.py                 # detect GPU \u002F nvcc \u002F ncu \u002F cuobjdump \u002F CUTLASS \u002F libs\n│   ├── preflight.py                 # validate baseline + ref contract\n│   ├── state.py                     # the ONLY writer of state.json\n│   ├── validate_methods.py          # priority-compliance gate (called by state.py)\n│   ├── run_iteration.py             # calls benchmark.py, captures results\n│   ├── profile_ncu.py               # runs ncu, extracts top-K per axis\n│   ├── roofline.py                  # [V2] compute Δc\u002FΔm\u002FΔl, allocate axis budget, near_peak check\n│   ├── branch_explore.py            # [V2] compile + benchmark K branches, elect champion, update frontier\n│   ├── ablate.py                    # [V2] leave-one-out ablation, emit per-method attribution\n│   ├── sass_check.py                # [V2] cuobjdump → grep signatures → per-method SASS verdict\n│   ├── summarize.py                 # renders summary.md (V2: includes bottleneck drift table)\n│   └── orchestrate.py               # end-to-end CLI (setup\u002Fclose-iter\u002Ffinalize)\n├── references\u002F\n│   ├── ncu_metrics_guide.md         # bottleneck → optimization mapping\n│   ├── optimization_catalog.md      # priority-ordered catalog (Claude reads)\n│   ├── method_registry.json         # machine-readable mirror (validator reads)\n│   └── sass_signatures.json         # [V2] method → expected SASS instruction signatures\n├── templates\u002F\n│   ├── iteration_report.md          # analysis.md skeleton Claude fills in\n│   └── methods.schema.json          # schema for methods.json (V2: adds trigger_strength)\n└── examples\u002F\n    └── walkthrough.md               # annotated example session\n```\n\n## How Claude uses this\n\nWhen a user says \"optimize `gemm.cu`\", Claude:\n\n1. reads `SKILL.md`\n2. calls `orchestrate.py setup` (which runs env check → preflight → init → seed baseline → first profile)\n3. reads `iterv1\u002Fncu_top.json` and the current best kernel source\n4. **runs `roofline.py` to get Δc \u002F Δm \u002F Δl and the per-axis method budget (total = 3, per-axis cap = 2); if `near_peak: true`, the loop ends here**\n5. consults `references\u002Foptimization_catalog.md` + `references\u002Fncu_metrics_guide.md`\n6. picks methods **under the axis budget** (budget-aware scan: skip axis if budget=0, pick top-N by `trigger_strength` if budget=2), writes them + reasoning to `iterv1\u002Fmethods.json` and `iterv1\u002Fanalysis.md`\n7. writes **K branch candidates** to `iterv1\u002Fbranches\u002Fb{0..K-1}\u002Fkernel.\u003Cext>` — same methods, different hyperparameters (tile \u002F stages \u002F warps \u002F impl variants)\n8. calls `orchestrate.py close-iter --iter 1`, which internally:\n   - runs `branch_explore.py` → compiles + benchmarks all branches, elects the fastest correct one as champion (copied to `iterv1\u002Fkernel.\u003Cext>`), archives the rest in `frontier`\n   - profiles the champion with `ncu` → `iterv1\u002Fkernel.ncu-rep`\n   - runs `sass_check.py` → `iterv1\u002Fsass_check.json`\n   - runs `ablate.py` → `iterv1\u002Fattribution.json`\n   - updates state: each method lands in one of `effective_methods` \u002F `ineffective_methods` \u002F `implementation_failed_methods` based on SASS ✓\u002F✗ × attribution > noise\n9. on correctness failure (all K branches fail): inspects `bench.json.correctness` + `bench.stderr.txt`, rewrites the kernel, retries (up to 3×)\n10. on success: `best_file` advances if faster; `roofline_history` is appended\n11. loops back to step 3 for the next iteration\n12. calls `orchestrate.py finalize` and writes a retrospective into `summary.md` — including the bottleneck drift table sourced from `roofline_history`\n\nSee `examples\u002Fwalkthrough.md` for a full example and `SKILL.md` for the formal procedure.\n\n## Limits and honest caveats\n\n- **Ceiling**: if your reference is already cuBLAS \u002F cuDNN \u002F cuBLASLt, meaningful wins require algorithmic changes (split-K, stream-K, fused epilogues, mixed precision) that Claude may or may not find in a 3-iteration budget. Large speedups are easier when the baseline is hand-rolled.\n- **Noise**: kernels running under ~50 μs are dominated by launch overhead. The skill's default 2% noise threshold helps, but if your dims are tiny, raise `--repeat` or the dimensions. Ablation attribution uses the same threshold — sub-noise contributions are classified as `ineffective_methods`.\n- **Triton + `@triton.autotune`**: autotuning under `ncu` is slow and can time out. Either pre-bake a single config before profiling, or set `--launch-count 1` and increase warmup.\n- **ncu CSV column names**: older `ncu` (\u003C 2022.1) emits `\"Metric Value\"` with different capitalization\u002Funits; `profile_ncu.py` is tolerant but if you see all zeros check the `.ncu.log` file in the iteration directory.\n- **Branch cost**: with K=4 and ablation, each iteration compiles up to K + (num_methods) kernels. On a fresh build this can be slow; lower `--branches` if wall-clock matters more than exploration.\n- **SASS signatures are heuristic**: `sass_signatures.json` greps for instruction patterns, not full semantic equivalence. A method can pass the grep but still be implemented suboptimally — attribution is what catches that.\n- **Retries are bounded**: after 3 correctness failures on one iteration, the skill moves on and records the attempt as failed rather than looping forever. A kernel that can't be made correct after 3 tries usually has a conceptual issue that needs human review.\n\n## Example result\n\nUsing the Batch Normalization problem from Tensara as an example, this project demonstrates a substantial performance improvement from a baseline implementation to an optimized kernel. After submission to the A100-80GB environment, the solution passed 4\u002F4 test cases successfully. The average runtime dropped from 82.94 ms to 439.13 μs, while throughput increased dramatically from 2.52 GFLOPS to 476.20 GFLOPS. It is worth noting that most development and tuning were carried out locally on an RTX 3060, so local measurements cannot fully reflect the upper-bound performance achievable on an A100. Therefore, the final benchmark results should be based on the platform’s A100 evaluation, which better highlights the impact of careful kernel optimization and implementation details.\n\n![alt text](asset\u002FTensara_baseline.png)\n\n![alt text](asset\u002FTensara_best.png)\n\n\n## License \u002F attribution\n\nThis skill is independent of and does not redistribute CUTLASS, Triton, or Nsight Compute. You need to install those separately.\n\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=KernelFlow-ops%2Fcuda-optimized-skill&type=date&legend=top-left\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=KernelFlow-ops\u002Fcuda-optimized-skill&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=KernelFlow-ops\u002Fcuda-optimized-skill&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=KernelFlow-ops\u002Fcuda-optimized-skill&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n","cuda-optimized-skill 是一个用于验证、基准测试、Nsight Compute 分析、瓶颈分析和迭代调优的 CUDA 内核优化工具包。它通过可重现的工作流程和基于证据的性能比较，帮助改进自定义 GPU 操作符。项目采用 Python 编写，并利用 `nsight-compute` (`ncu`) 作为每个优化决策的依据。核心功能包括 Roofline 驱动的轴预算分配、分支选择探索、消融归因以及 SASS 指令级验证等机制，以实现更精细的方法分类与优化效果评估。适用于需要对 CUDA\u002FCUTLASS\u002FTriton 内核进行深度优化且追求极致性能的应用场景，如高性能计算、深度学习模型加速等领域。",2,"2026-06-11 02:46:32","CREATED_QUERY"]