[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83873":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":15,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":16,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":8,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},83873,"KDA-Pilot","BBuf\u002FKDA-Pilot","BBuf",null,"Python",177,29,165,1,0,12,7,47.63,false,"main",true,[],"2026-06-12 04:01:42","\u003Cdiv align=\"center\">\n\n# KDA-Pilot\n\n**Evidence-first autonomous GPU-kernel optimization campaigns for SGLang.**\n\nKDA-Pilot turns real serving-framework kernels into reproducible optimization\ntasks: frozen production shapes, copied upstream baselines, symmetric\nbenchmarks, correctness gates, Nsight Compute evidence, KernelWiki references,\nand RLCR-style agent iteration in one place.\n\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FBBuf\u002FKDA-Pilot?style=social)](https:\u002F\u002Fgithub.com\u002FBBuf\u002FKDA-Pilot\u002Fstargazers)\n[![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FBBuf\u002FKDA-Pilot?style=social)](https:\u002F\u002Fgithub.com\u002FBBuf\u002FKDA-Pilot\u002Fforks)\n[![Last commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002FBBuf\u002FKDA-Pilot?style=flat-square)](https:\u002F\u002Fgithub.com\u002FBBuf\u002FKDA-Pilot\u002Fcommits\u002Fmain)\n[![B200 diffusion](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FB200_diffusion-7_kernel_tasks-2ea44f?style=flat-square)](#b200-diffusion-results)\n[![AI Infra Skills](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fsibling-AI--Infra--Auto--Driven--SKILLS-2f80ed?style=flat-square)](https:\u002F\u002Fgithub.com\u002FBBuf\u002FAI-Infra-Auto-Driven-SKILLS)\n\n\u003C\u002Fdiv>\n\nMost AI kernel demos optimize a snippet. KDA-Pilot optimizes the parts that\nactually show up in SGLang diffusion and LLM serving workflows, then keeps the\nevidence needed to tell whether the agent really improved the production path.\n\nIf you care about autonomous CUDA\u002FTriton\u002FCuTe-DSL optimization that can be\nreplayed, reviewed, and compared against real framework baselines, this is the\nrepo to watch.\n\n## Why It Matters\n\n- **Real workloads, not toy shapes.** Diffusion tasks were built from 20 real\n  SGLang diffusion models and collapsed into per-kernel multi-shape workloads.\n- **Wall-time metrics.** The headline numbers include Python, dispatch,\n  wrappers, kernel launch, and `cuda.synchronize()` overhead, not just isolated\n  device time.\n- **No reward-hacking path.** Baseline and candidate use matching local ABIs;\n  the task does not monkey-patch or import SGLang at runtime.\n- **Knowledge-guided iteration.** Tasks can pull from `KernelWiki` and\n  `ncu-report-skill`, so prior Blackwell\u002FHopper kernel work and NCU bottleneck\n  evidence become part of the optimization loop.\n- **Agent loop with review.** Candidate promotion is tied to correctness gates,\n  run logs, and code review rather than \"one fast row wins\".\n\n## B200 Diffusion Results\n\nThese are wall geomean speedups against the corresponding SGLang\u002FTriton\u002FCuTe-DSL\nbaselines on B200. The measurements include dispatch and synchronization\noverheads, so they are closer to what a user sees from the public kernel path.\n\n| Kernel task | B200 wall geomean | Representative wins |\n| --- | ---: | --- |\n| `qknorm_rope` | 1.1341x | large rows 1.145-1.279x |\n| `norm_infer` | 1.3523x | RMS small 1.634-1.641x |\n| `rotary_embedding` | 1.4912x | HunyuanVideo 2.087x; LTX2 1.133-1.622x |\n| `cutedsl_norm_tanh_mul_add` | 1.4953x | v1 1.602-1.625x |\n| `cutedsl_norm_scale_shift` | 1.3201x | Hunyuan 1.388-1.516x; JoyAI 1.477-1.495x |\n| `fuse_scale_shift` | 2.7499x | small broadcast rows 7.365-7.891x |\n| `group_norm_silu` | 2.3118x | small\u002Fmid C rows 1.369-4.982x; NC rows up to 3.648x |\n\n## KernelWiki-Guided Highlights\n\n| Kernel | KernelWiki \u002F reference | Key techniques |\n| --- | --- | --- |\n| `qknorm_rope` | **TensorRT-LLM PR-13052\u002F11869 DiT QKNorm+RoPE; SGLang PR-15141\u002F19059\u002F21440\u002F21654 fused QKNorm\u002FRoPE; memory-bound pattern** | Shared RoPE staging, Q\u002FK reuse, staged path only for large rows |\n| `norm_infer` | **KernelWiki memory-bound\u002Fvectorized-loads\u002Fregister-budgeting; vLLM PR-31828 SM100 RMSNorm opt-in path** | Warp-row RMS, tiled persistent RMS, 8B\u002F16B vector paths |\n| `rotary_embedding` | **SGLang PR-24411 LTX2 split RoPE; vLLM PR-21126\u002F30729 FlashInfer RoPE routing; vectorized-loads** | 128-bit vector I\u002FO, cos\u002Fsin hoisting, LTX2 block matching |\n| `cutedsl_norm_tanh_mul_add` | **KernelWiki memory-bound\u002Fvectorized-loads\u002Fregister-budgeting; NCU long-scoreboard and launch-bounds evidence** | Hoisted row-invariant math, launch-bounds tuning, exact `tanhf` |\n| `cutedsl_norm_scale_shift` | **SGLang PR-14717 CuTe-DSL norm\u002Fscale\u002Fshift fusion; vectorized-loads; register-budgeting** | Operand-class dispatch, 16B\u002F32B vectors, two-pass variance |\n| `fuse_scale_shift` | **SGLang PR-14717 fused norm\u002Fscale\u002Fshift family; vectorized-loads; cache-policy; memory-bound pattern** | Rowgrid\u002Fflatvec\u002Fexact-C paths, cache hints, one-pass reduction |\n| `group_norm_silu` | **SGLang PR-22814\u002F23148\u002F23938 GroupNorm+SiLU; memory-bound pattern; vectorized-loads** | Split-group stats, generation counters, channels-last transpose |\n\nThe companion write-up records the benchmark interpretation, kernel-specific\noptimization paths, KernelWiki\u002Freference links, and AKO4X comparison:\n[KDA-Pilot optimizing SGLang Diffusion Kernel](https:\u002F\u002Fgithub.com\u002FBBuf\u002Fhow-to-optim-algorithm-in-cuda\u002Fblob\u002Fmain\u002Flarge-language-model\u002Fsglang\u002FKDA-Pilot%20%E4%BC%98%E5%8C%96%20SGLang%20Diffusion%20Kernel%20%E6%95%88%E6%9E%9C%E4%B8%8E%E7%BB%8F%E9%AA%8C.md).\n\n## What Is Inside\n\n```text\ndiffusion\u002F    SGLang diffusion-operator kernel tasks.\n              Each task owns a copied baseline, optimized solution, benchmark,\n              correctness contract, run logs, and result ledger.\n\nllm\u002F          SGLang autoregressive-model kernel-workflow campaign.\n              Serve priority models on B200\u002FH200, benchmark low\u002Fmid\u002Fhigh\n              concurrency, profile forward passes, and turn >=1% non-attention\n              kernels into optimization task cards.\n\nexternal\u002F     Optional shared knowledge submodules.\n              KernelWiki\u002F         Blackwell\u002FHopper kernel design references\n              ncu-report-skill\u002F   Nsight Compute profiling\u002Freport helper\n```\n\nStart with:\n\n- [`diffusion\u002FREADME.md`](diffusion\u002FREADME.md) for standalone diffusion kernel\n  tasks and benchmark rules.\n- [`llm\u002FREADME.md`](llm\u002FREADME.md) for the LLM kernel-workflow campaign.\n- [`diffusion\u002Fdocs\u002Fstandalone_diffusion_benchmark.md`](diffusion\u002Fdocs\u002Fstandalone_diffusion_benchmark.md)\n  for the baseline\u002Fcandidate benchmark contract.\n- [`diffusion\u002Fdocs\u002Fdiffusion_kernel_rules.md`](diffusion\u002Fdocs\u002Fdiffusion_kernel_rules.md)\n  for correctness, fallback, and promotion guardrails.\n\n## Task Lifecycle\n\nEvery diffusion kernel task follows the same shape:\n\n```text\nprompt.md       task card for the agent\nconfig.toml     benchmark\u002Fbuild defaults\nbaseline\u002F       copied upstream SGLang baseline source\nsolution\u002F       optimized candidate source\nbench\u002F          standalone benchmark and correctness harness\ndocs\u002F           run logs, profile notes, source notes, decision ledger\n```\n\nThe important rule is symmetry: the agent must compare the copied baseline and\ncandidate through matching local interfaces, fixed workload rows, preallocated\noutputs, CUDA-event timing, interleaved A\u002FB sampling, strict correctness checks,\nand full provenance.\n\n## Run A Task\n\nClone submodules when you want the optional knowledge references:\n\n```bash\ngit submodule update --init --recursive\n```\n\nLaunch a task from the repo root:\n\n```bash\ndiffusion\u002Fscripts\u002Flaunch_kernels\u002Fk03_b200_diffusion_qknorm_rope__multi_shape.sh\n```\n\nUseful environment switches:\n\n```bash\nKDA_NO_CLAUDE=1                 # prepare the worktree without launching an agent\nKDA_BASE_BRANCH=\u003Cref>           # launch from a specific committed ref\nKDA_BASH_BIN=\u002Fopt\u002Fhomebrew\u002Fbin\u002Fbash\n```\n\nmacOS `\u002Fbin\u002Fbash` 3.2 is rejected by the launcher because nested Humanize\u002FCodex\nhooks rely on modern Bash behavior.\n\n## Current Campaigns\n\n- **Diffusion kernels:** qk norm + RoPE, norm inference, rotary embedding,\n  fused scale\u002Fshift, group norm + SiLU, CuTe-DSL norm\u002Ftanh\u002Fmul\u002Fadd, and\n  CuTe-DSL norm\u002Fscale\u002Fshift across B200 and H200 task folders.\n- **LLM kernel workflow:** model-level serving commands, benchmark sweeps,\n  torch profiler traces, and kernel inventories for future optimization tasks.\n- **Open frontier:** compute-bound kernels such as FA4\u002FMHA and GEMM-like paths\n  remain harder; this repo keeps the failed and partial attempts visible so the\n  next loop can start from evidence instead of folklore.\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=BBuf\u002FKDA-Pilot&type=Date)](https:\u002F\u002Fstar-history.com\u002F#BBuf\u002FKDA-Pilot&Date)\n",2,"2026-06-11 04:11:43","CREATED_QUERY"]