[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76167":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},76167,"kernel-pilot","BBuf\u002Fkernel-pilot","BBuf",null,"Python",165,28,97,1,0,4,56,4.39,false,"main",true,[],"2026-06-12 02:03:40","\u003Cdiv align=\"center\">\n\n# KernelPilot\n\n**An autonomous Humanize-powered GPU kernel optimization loop with peer\nevidence routes, Nsight Compute report skills, and clean standalone benchmark\nrepos.**\n\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBBuf\u002Fkernel-pilot?style=social)](https:\u002F\u002Fgithub.com\u002FBBuf\u002Fkernel-pilot\u002Fstargazers)\n[![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FBBuf\u002Fkernel-pilot?style=social)](https:\u002F\u002Fgithub.com\u002FBBuf\u002Fkernel-pilot\u002Fforks)\n[![Last commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FBBuf\u002Fkernel-pilot?style=flat-square)](https:\u002F\u002Fgithub.com\u002FBBuf\u002Fkernel-pilot\u002Fcommits\u002Fmain)\n[![PR evidence](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPR_evidence-3660-2ea44f?style=flat-square)](knowledge\u002Fevidence\u002Fpull-bundles\u002F)\n[![Knowledge cutoff](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcutoff-2026--05--16-8250df?style=flat-square)](knowledge\u002Fdata\u002Frefresh-cutoff.yaml)\n\n\u003C\u002Fdiv>\n\nKernelPilot is for serious CUDA kernel tuning runs where the important facts\nare easy to lose: which upstream PR inspired a candidate, which shape regressed,\nwhat Nsight Compute actually said, which evidence changed the next edit, and\nwhether the candidate belongs in a framework repo or a clean experiment.\n\nThe project packages three cooperating skills:\n\n| Skill | Role |\n| --- | --- |\n| [`humanize-kernel-agent-loop`](humanize\u002Fskills\u002Fhumanize-kernel-agent-loop\u002F) | Turns kernel definition `K`, reference `R`, and workload distribution `W` into task-acceptance pairs, a standalone optimization repo, autonomous research\u002Fiteration\u002Fautotuning, correctness tests, benchmarks, ledgers, dispatcher, tuning decisions, and review-gated iteration. |\n| [`kernel-knowledge`](knowledge\u002FSKILL.md) | Kernel evidence acquisition through peer routes: local PR diffs, cloned external source-map repos, and live web\u002Fofficial\u002Fupstream source research. |\n| [`ncu-report`](humanize\u002Fskills\u002Fncu-report\u002F) | Converts Nsight Compute reports into a reproducible profile digest: metrics, source counters, PM sampling, PTX\u002FSASS hotspots, bottleneck diagnosis, and exactly one next kernel edit. |\n\nTogether they make an optimization loop that can work from a simple request:\n\n```text\n[$humanize-kernel-agent-loop] Optimize SGLang's GEMM path for M=64, N=2048, K=2048, fp16, bias=true, and beat the current SGLang baseline by at least 10%.\n```\n\nThe loop decides how to plan, when to query knowledge, what to profile, how to\nrecord lineage, how to scan the workload distribution, and when to ask the\nHumanize review gate whether another round is needed. The human should specify\nthe target when it is ambiguous; the loop owns the rest.\n\n## Why Use It\n\n- **Peer evidence routes.** The agent can use local PR diffs, cloned upstream\n  source-map repositories, and live web\u002Fofficial\u002Fupstream research as equal\n  ways to gather kernel evidence.\n- **Standalone by default.** Candidate kernels do not pollute SGLang, vLLM,\n  PyTorch, or other large framework repos. The loop creates an isolated repo\n  with bindings, tests, benchmarks, ledgers, lineage, and profile artifacts.\n  The standalone repo is where implementation artifacts, provenance, and\n  measurements live.\n- **Evidence-driven profiling.** The loop decides when `ncu-report` is worth\n  running, then uses it to move from vague labels like \"memory-bound\" toward\n  measured bottlenecks and one concrete next edit.\n- **Evidence-backed edits.** The agent draws on local upstream PR diffs, cloned\n  source-map repositories, and live web\u002Fofficial\u002Fupstream source research as\n  peer evidence routes, widening the search inside a route or cross-checking\n  against another route before letting a thin match shape the kernel.\n- **Review-gated iteration.** Humanize RLCR keeps the loop from declaring\n  victory too early; default loop budget is 84 iterations unless configured\n  otherwise.\n- **Shape-aware tuning.** The loop treats benchmark cases as a workload\n  distribution, builds a performance map, and emits dispatcher\u002Ftuning decisions\n  when different regimes need different kernels or configurations.\n\n## Kernel Agent Loop\n\n```mermaid\nflowchart LR\n    K[Kernel definition K] --> P[Plan P = task and AC pairs]\n    R[Correctness reference R] --> P\n    W[Workload distribution W] --> P\n    P --> S[Clean standalone repo]\n\n    subgraph R0[Stage 1: Research]\n        KW[kernel-knowledge \u002F evidence routes]\n        B[Baseline and repo inspection]\n        RD[Research digest and recipes]\n        KW --> RD\n        B --> RD\n    end\n\n    subgraph I0[Stage 2: Iterate]\n        T[Writer executes task t_i]\n        E[Inspect, edit, compile, test, benchmark, profile]\n        V{Reviewer checks evidence vs ac_i}\n        T --> E --> V\n        V -->|blocked feedback| T\n    end\n\n    subgraph A0[Stage 3: Autotune]\n        PM[Performance map over W]\n        D[Shape-aware dispatcher]\n        TD[Tuning decisions]\n        PM --> D --> TD\n    end\n\n    S --> RD --> T\n    V -->|pass| PM\n    E -->|profile evidence needed| NCU[ncu-report \u002F Nsight Compute]\n    NCU --> T\n    E -->|prior art needed| KW\n    TD --> O[Final kernels, dispatcher, correctness\u002Fbenchmark matrix, fallback paths, unsupported regimes]\n```\n\nThe writer agent is not hardcoded. In Codex it can be Codex; in Claude Code it\ncan be Claude. The review backend and model come from Humanize configuration.\nUnlike the paper's in-repository version, KernelPilot keeps implementation\nartifacts in a clean standalone repo unless the user explicitly asks for an\nin-place framework patch.\n\n## Kernel Requests\n\nA useful request names the kernel definition, correctness reference, workload\ndistribution, target hardware, scope, benchmark method, and performance target.\nKernelPilot turns that into a task-acceptance plan, an isolated implementation\nworkspace, repeatable measurements, profiler evidence, lineage, performance\nmap, dispatcher\u002Ftuning decisions, and Humanize review rounds.\n\nExisting implementations, PR diffs, live upstream sources, official docs, and\nprofile reports are working materials for the loop. When external source or\ndesign evidence materially influences a candidate, the standalone repo records\nthe provenance, license or notice requirements, and the optimization delta.\n\n## Knowledge Base\n\nThe knowledge base lives in [`knowledge\u002F`](knowledge\u002F). It is a local skill root\nand does not need a global environment variable for normal query use.\n\nCurrent snapshot:\n\n| Corpus layer | Contents |\n| --- | --- |\n| PR evidence | 3,660 merged CUDA\u002FTriton\u002FCuTe\u002FCUTLASS-related PR pages and bundles from 14 upstream repos (SGLang, vLLM, TensorRT-LLM, PyTorch, FlashAttention, FlashInfer, CUTLASS\u002FCuTe, CCCL, Triton, DeepGEMM, ThunderKittens, TileLang, QuACK, DeepSeek TileKernels), Jan 2024 through May 16 2026. |\n| External source map | `knowledge\u002Findex.json` points at the **complementary** code repositories not in the PR corpus (NVIDIA developer samples, Colfax research kernels, simveit micro-tutorials) for live clone\u002Fsearch workflows. |\n| Candidate ledgers | 14 include\u002Fdefer ledgers for PR ingestion. Dropped PRs are not kept as per-PR rows. |\n\nPrimary organization:\n\n```text\nknowledge\u002F\n|-- SKILL.md\n|-- README.md\n|-- scripts\u002F\n|   |-- query.py\n|   |-- get_page.py\n|   |-- fetch-pr-evidence.py\n|   `-- validate.py\n|-- sources\u002F\n|   `-- prs\u002F\n|-- evidence\u002F\n|   `-- pull-bundles\u002F\n|-- candidates\u002F\n`-- data\u002F\n```\n\nThe important rule is **no local summaries as evidence**. The supported routes\nare local PR diffs, cloned source-map repositories, and live web\u002Fofficial\u002F\nupstream source research. There is no local wiki\u002Fdoc\u002Fblog\u002Fcontest fallback.\n\n`knowledge\u002Findex.json` is kept as an external source map over the\ncomplementary repositories not covered by the PR corpus. Working with it is a\ntwo-step flow: clone the referenced repos with `scripts\u002Fclone-index-repos.py`,\nthen grep them with `scripts\u002Fsearch-index-repos.py`. The search script enforces\nthe clone step, so the clone is the only gate.\n\n## Query Examples\n\nRun knowledge tools from the knowledge root:\n\n```bash\ncd knowledge\npython3 scripts\u002Fquery.py \"tcgen05\" --architecture B200 --limit 10\npython3 scripts\u002Fsearch-pr-diffs.py tcgen05 tmem --any --limit 200\npython3 scripts\u002Fquery.py --repo pytorch\u002Fpytorch --compact\npython3 scripts\u002Fget_page.py pr-pytorch-157241\npython3 scripts\u002Fclone-index-repos.py\npython3 scripts\u002Fsearch-index-repos.py tma swizzle transpose\npython3 scripts\u002Fvalidate.py\n```\n\n## ncu-report\n\n`ncu-report` standardizes the profiling part of the loop. It creates a digest\nthat compares a candidate to a baseline or parent version and ends with one\nspecific edit to try next.\n\nTypical capture:\n\n```bash\nmkdir -p profile-artifacts\u002Fv000_baseline\nncu --target-processes all \\\n    --kernel-name regex:\"\u003Ckernel-name-pattern>\" \\\n    --launch-skip 5 --launch-count 1 \\\n    --set full --import-source on \\\n    --section SpeedOfLight \\\n    --section SchedulerStats \\\n    --section WarpStateStats \\\n    --section Occupancy \\\n    --section LaunchStats \\\n    --section MemoryWorkloadAnalysis \\\n    --section SourceCounters \\\n    -o profile-artifacts\u002Fv000_baseline\u002Freport \\\n    python benchmarks\u002F\u003Cbench>.py --shape \u003Cshape> --dtype \u003Cdtype>\n\nncu --import profile-artifacts\u002Fv000_baseline\u002Freport.ncu-rep \\\n    --page raw --csv > profile-artifacts\u002Fv000_baseline\u002Fraw.csv\nncu --import profile-artifacts\u002Fv000_baseline\u002Freport.ncu-rep \\\n    --page details > profile-artifacts\u002Fv000_baseline\u002Fdetails.txt\n```\n\nThe skill inspects SpeedOfLight, scheduler stats, warp state stalls, occupancy,\nlaunch stats, memory workload, source counters, PM sampling when the installed\nNCU exposes it, and when relevant PTX\u002FSASS dumps from `cuobjdump` or\n`nvdisasm`.\n\n## Install\n\nKernelPilot is not Codex-only. It can be used from Claude Code, Codex, or Kimi.\n\n### Claude Code\n\nInstall the Humanize plugin from this repository, then expose the KernelPilot\nknowledge skill to Claude Code:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FBBuf\u002Fkernel-pilot.git\ncd kernel-pilot\n\n# Add the KernelPilot marketplace and install its Humanize plugin.\nhumanize\u002Fscripts\u002Finstall-skills-claude.sh\n```\n\nThe installer adds the KernelPilot marketplace, installs\n`humanize@KernelPilot`, exposes `knowledge\u002F` as the `kernel-knowledge` skill,\ninstalls the knowledge query dependency, hydrates Claude Code's installed skill\ncache with absolute `HUMANIZE_RUNTIME_ROOT` and `KERNELPILOT_ROOT` paths, and\nfails if those placeholders remain. Restart Claude Code after installing, then\nconfirm the plugin and skills are visible:\n\n```bash\nclaude plugin list\nclaude plugin details humanize@KernelPilot\n```\n\nInside Claude Code, you should see commands such as\n`\u002Fhumanize:start-rlcr-loop` and skills such as `humanize-kernel-agent-loop`,\n`kernel-knowledge`, and `ncu-report`. For a one-session local checkout without\ninstalling the marketplace, start Claude Code with:\n\n```bash\nclaude --plugin-dir \u002Fpath\u002Fto\u002Fkernel-pilot\u002Fhumanize \\\n  --add-dir \u002Fpath\u002Fto\u002Fkernel-pilot\n```\n\n### Codex\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FBBuf\u002Fkernel-pilot.git\ncd kernel-pilot\nhumanize\u002Fscripts\u002Finstall-skills-codex.sh\n```\n\nGeneric installer:\n\n```bash\ncd kernel-pilot\nhumanize\u002Fscripts\u002Finstall-skill.sh --target codex\n```\n\nThe installer hydrates `{{KERNELPILOT_ROOT}}` into installed skills and\nvalidates that the root contains `knowledge\u002FSKILL.md` and\n`knowledge\u002Fevidence\u002Fpull-bundles\u002F`. If the knowledge base is missing, install\nfails instead of producing a broken skill.\n\n### Kimi\n\nFor Kimi-oriented setups, use:\n\n```bash\ncd kernel-pilot\nhumanize\u002Fscripts\u002Finstall-skills-kimi.sh\n```\n\nAfter installation, restart the agent session and check that these skills are\navailable:\n\n```text\nhumanize-kernel-agent-loop\nkernel-knowledge\nncu-report\n```\n\nIf Humanize reports that hooks need review, approve the Stop hook in the client\nUI before relying on review-gated loop exits.\n\n## Prompt Card\n\nKernel optimization:\n\n```text\n[$humanize-kernel-agent-loop] Optimize SGLang's int8_scaled_mm kernel on H100 for M=64, N=2048, K=2048, out_dtype=fp16, bias=true. Keep the work in a clean standalone repo, compare correctness and latency against the current SGLang baseline, and beat that baseline by at least 10% p50 latency on this focused case.\n```\n\nKeep the prompt focused on the target kernel, environment, correctness checks,\nbenchmark, and performance target.\n\nExample result from this shape:\n\n| Shape | Candidate | SGLang baseline | Result |\n| --- | ---: | ---: | ---: |\n| `M=64, N=2048, K=2048, fp16+bias` | `0.015184 ms` p50 | `0.017888 ms` p50 | `15.12%` faster |\n\nThe stop hook summary should make the round outcome and review decision easy to\ninspect:\n\n![Humanize stop hook summary](docs\u002Fassets\u002Fhumanize-stop-hook-summary.png)\n\nThe optimization ledger should make selected versions and rejected follow-ups\neasy to scan:\n\n![KernelPilot optimization ledger](docs\u002Fassets\u002Fkernelpilot-optimization-ledger.png)\n\n## Maintenance\n\nValidate the knowledge base:\n\n```bash\ncd knowledge\npip install -r requirements.txt\npython3 scripts\u002Fvalidate.py\n```\n\nMaterialize missing PR evidence bundles during corpus maintenance:\n\n```bash\ncd knowledge\npython3 scripts\u002Ffetch-pr-evidence.py --repo pytorch\u002Fpytorch --max-files 16\npython3 scripts\u002Fvalidate.py\n```\n\nRun Humanize tests after changing skills:\n\n```bash\ncd humanize\ntests\u002Frun-all-tests.sh\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=BBuf\u002Fkernel-pilot&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#BBuf\u002Fkernel-pilot&Date)\n\n## Related\n\n- [Humanize](https:\u002F\u002Fgithub.com\u002FPolyArch\u002Fhumanize): the RLCR runtime that\n  KernelPilot specializes for GPU kernel optimization.\n- [AI-Infra-Auto-Driven-SKILLS](https:\u002F\u002Fgithub.com\u002FBBuf\u002FAI-Infra-Auto-Driven-SKILLS):\n  broader serving, profiling, SGLang, incident, and model optimization skills.\n","KernelPilot 是一个用于CUDA内核调优的自动化工具，通过结合人类化驱动的GPU内核优化循环、同侪证据路径以及Nsight Compute报告技能来实现。其核心功能包括自动化的研究\u002F迭代\u002F自调优过程、正确性测试、基准测试、账本记录及调度决策等，并能将Nsight Compute报告转化为可重复的性能摘要。该项目特别适合于需要细致调整和验证的CUDA内核开发场景中使用，尤其是在那些容易丢失重要信息（如哪个上游PR启发了候选方案、哪些形状退化了等）的情况下。此外，它还支持从本地PR差异、克隆的外部源映射仓库以及实时网络\u002F官方\u002F上游资源中获取内核证据的能力，确保候选内核不会污染大型框架库的同时提供了独立且干净的研究环境。",2,"2026-06-06 03:55:49","CREATED_QUERY"]