[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74244":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},74244,"autokernel","RightNow-AI\u002Fautokernel","RightNow-AI","Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.","https:\u002F\u002Fwww.rightnowai.co\u002Fforge",null,"Python",1402,141,12,8,0,10,16,47,30,87.16,"MIT License",false,"main",true,[27,28,29,30,31,32],"autoresearch","cuda","gpu","kernel-optimization","pytorch","triton","2026-06-12 04:01:14","# AutoKernel\n\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20us-5865F2?logo=discord&logoColor=white)](https:\u002F\u002Fdiscord.gg\u002FUfEyc72t)\n\n**Autoresearch for GPU kernels.** Give it any PyTorch model, go to sleep, wake up to optimized Triton or CUDA C++ kernels.\n\n![AutoKernel Progress](progress.png)\n\nInspired by [@karpathy\u002Fautoresearch](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fautoresearch) -- which demonstrated autonomous AI agents for LLM training research. AutoKernel applies the same philosophy to GPU kernel optimization: agent modifies one file, runs a fixed evaluation, keeps or reverts, repeats forever.\n\n## How It Works\n\nGive AutoKernel any PyTorch model. It will:\n\n1. **Profile** the model to find which GPU kernels are bottlenecks\n2. **Extract** each bottleneck as a standalone Triton or CUDA C++ kernel\n3. **Optimize** each kernel autonomously (edit, benchmark, keep\u002Frevert -- forever)\n4. **Verify** end-to-end correctness and report the total speedup\n\nThe agent reads `program.md` -- the \"research org code\" -- which contains comprehensive instructions for autonomous operation. It edits `kernel.py` one kernel at a time, runs `bench.py` (fixed benchmark with 5-stage correctness checks + roofline analysis), and either keeps or reverts the change. The orchestrator decides when to move to the next kernel using Amdahl's law.\n\nEach experiment takes ~90 seconds. That's ~40 experiments\u002Fhour, ~320 overnight, across all kernels.\n\n## Quick Start\n\n**Requirements:** NVIDIA GPU (tested on H100\u002FA100\u002FRTX 4090), Python 3.10+, [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F).\n\n```bash\n# Install uv (if you don't have it)\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n\n# Clone and setup\ngit clone https:\u002F\u002Fgithub.com\u002FRightNow-AI\u002Fautokernel.git\ncd autokernel\nuv sync\n\n# One-time setup: test data + baselines\nuv run prepare.py\n\n# Profile a model (ships with GPT-2, LLaMA, BERT -- no transformers needed)\nuv run profile.py --model models\u002Fllama_7b.py --class-name LlamaModel \\\n --input-shape 1,512 --dtype float16\n\n# Extract top bottleneck kernels\nuv run extract.py --top 5\n\n# Verify benchmark works\nuv run bench.py\n```\n\n## Running the Agent\n\nSpin up Claude, Codex, or any coding agent in this directory:\n\n```\nRead program.md and let's kick off a new experiment. Start with setup.\n```\n\nThe agent will:\n1. Profile your model and present the optimization plan\n2. Create a branch (e.g., `autokernel\u002Fmar10-llama7b`)\n3. Optimize each bottleneck kernel in priority order\n4. Verify end-to-end correctness and report total speedup\n\n`program.md` is intentionally comprehensive so the agent can run 10+ hours without getting stuck. It includes a 6-tier optimization playbook, decision framework, crash handling, and Amdahl's law reasoning.\n\n## The Pipeline\n\n```\n                 profile.py              extract.py           bench.py (loop)         verify.py\nAny PyTorch  ──>  Rank kernels  ──>  Generate baseline  ──>  Optimize each  ──>  End-to-end\n   model          by GPU time       Triton\u002FCUDA kernels     kernel (agent)       verification\n```\n\n| Tool | What it does |\n|------|-------------|\n| `profile.py` | Profiles any PyTorch model with `torch.profiler`, ranks kernels by GPU time, classifies as compute\u002Fmemory-bound |\n| `extract.py` | Extracts top-N bottleneck kernels into standalone Triton or CUDA C++ kernel files (`--backend triton\\|cuda`) |\n| `orchestrate.py` | Multi-kernel scheduler: decides which kernel to optimize next using Amdahl's law, tracks aggregate progress |\n| `bench.py` | Fixed benchmark: 5-stage correctness (smoke, shape sweep, numerical stability, determinism, edge cases) + performance + roofline |\n| `verify.py` | Plugs optimized kernels back into the model, checks end-to-end correctness, reports total speedup |\n\n## Supported Kernels\n\n9 kernel types covering the core operations of modern deep learning:\n\n| Kernel | Description | Key Metric |\n|--------|-------------|------------|\n| **matmul** | Dense matrix multiplication (M x K) @ (K x N) | TFLOPS |\n| **softmax** | Row-parallel numerically stable softmax | GB\u002Fs |\n| **layernorm** | Layer normalization with affine transform | GB\u002Fs |\n| **rmsnorm** | RMS normalization (LLaMA-style) | GB\u002Fs |\n| **flash_attention** | Scaled dot-product attention with causal masking | TFLOPS |\n| **fused_mlp** | SwiGLU-style fused MLP (gate + up + down) | TFLOPS |\n| **cross_entropy** | Fused cross entropy loss | GB\u002Fs |\n| **rotary_embedding** | Rotary position embeddings (RoPE) | GB\u002Fs |\n| **reduce** | Parallel reduction (sum) | GB\u002Fs |\n\nEach has a PyTorch reference in `reference.py`, a starter Triton kernel in `kernels\u002F`, and a starter CUDA C++ kernel in `kernels\u002Fcuda\u002F`.\n\n## Example Models\n\nSelf-contained model definitions ship with AutoKernel (no `transformers` library needed):\n\n| Model | File | Params | Usage |\n|-------|------|--------|-------|\n| GPT-2 Small | `models\u002Fgpt2.py` | 124M | `--class-name GPT2 --input-shape 1,1024` |\n| LLaMA (compact) | `models\u002Fllama_7b.py` | 160M | `--class-name LlamaModel --input-shape 1,512` |\n| LLaMA 7B | `models\u002Fllama_7b.py` | 7B | `--class-name LlamaModel7B --input-shape 1,2048` |\n| BERT-base | `models\u002Fbert_base.py` | 110M | `--class-name BertModel --input-shape 8,512` |\n| Custom | `models\u002Fcustom.py` | -- | Template for your own model |\n\nFor HuggingFace models (`uv sync --extra models`):\n\n```bash\nuv run profile.py --module transformers --class-name AutoModelForCausalLM \\\n --pretrained meta-llama\u002FLlama-2-7b-hf --input-shape 1,2048 --dtype float16\n```\n\n## KernelBench Integration\n\nAutoKernel integrates with [KernelBench](https:\u002F\u002Fgithub.com\u002FScalingIntelligence\u002FKernelBench),\nthe standard benchmark for evaluating AI-generated GPU kernels (250+ problems across 4 difficulty\nlevels). While most KernelBench evaluations use one-shot LLM generation, AutoKernel runs\n**50-300+ iterative refinement experiments per problem** -- systematically exploring the\noptimization space instead of guessing.\n\n```bash\n# Install KernelBench dependencies\nuv sync --extra kernelbench\n\n# Fetch Level 1 problems from HuggingFace\nuv run kernelbench\u002Fbridge.py fetch --source hf --level 1\n\n# Set up a specific problem for optimization\nuv run kernelbench\u002Fbridge.py setup --level 1 --problem 1 --source hf\n\n# Evaluate (correctness + speedup vs PyTorch reference)\nuv run kernelbench\u002Fbench_kb.py\n\n# Batch score an entire level (computes fast_p metric)\nuv run kernelbench\u002Fscorer.py --level 1\n```\n\nThe agent reads `kernelbench\u002Fprogram_kb.md` for KernelBench-specific optimization instructions:\nhow to write `ModelNew` classes, when to use CUDA C++ vs Triton, fusion strategies per problem\nlevel, and the edit-bench-keep\u002Frevert loop adapted for the KernelBench `fast_p` metric.\n\n| Tool | What it does |\n|------|-------------|\n| `kernelbench\u002Fbridge.py` | Loads problems from HuggingFace or local repo, caches them, generates starter `kernel.py` |\n| `kernelbench\u002Fbench_kb.py` | Evaluates `ModelNew` vs `Model`: 5-trial correctness + CUDA event timing + stability + determinism |\n| `kernelbench\u002Fscorer.py` | Batch evaluation across a level, computes `fast_p` at thresholds (1.0x, 1.5x, 2.0x, 3.0x, 5.0x) |\n| `kernelbench\u002Fprogram_kb.md` | Agent instructions for KernelBench mode |\n\n## HuggingFace Kernels Export\n\nExport optimized kernels to the [HuggingFace Hub](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fkernels\u002Fen\u002Findex)\nfor easy distribution. Users can then load your kernels with a single line:\n\n```python\nfrom kernels import get_kernel\nmodule = get_kernel(\"your-username\u002Fkernel-name\")\n```\n\n```bash\n# Export an optimized CUDA kernel\nuv run export_hf.py --name my_matmul\n\n# Upload to Hub (requires `pip install kernels` and `huggingface-cli login`)\ncd workspace\u002Fhf_export\u002Fmy_matmul\nkernels upload . --repo_id your-username\u002Fmy_matmul\n```\n\n## Project Structure\n\n```\nautokernel\u002F\n  kernel.py             the file the agent modifies (one kernel at a time)\n  program.md            agent instructions -- the \"research org code\"\n\n  bench.py              fixed benchmark + 5-stage correctness harness\n  reference.py          PyTorch reference implementations (ground truth)\n  prepare.py            one-time setup: test data, baselines\n\n  profile.py            profile any PyTorch model, rank kernels by GPU time\n  extract.py            extract bottleneck kernels into workspace\u002F\n  orchestrate.py        multi-kernel scheduler (Amdahl's law)\n  verify.py             end-to-end model verification + speedup report\n  export_hf.py          export optimized kernels to HuggingFace Kernels format\n  analysis.py           experiment visualization (generates progress.png)\n\n  kernels\u002F              starter Triton kernels (9 types)\n  kernels\u002Fcuda\u002F         starter CUDA C++ kernels (9 types, tensor core accelerated)\n  kernelbench\u002F          KernelBench integration (bridge, eval harness, scorer)\n  models\u002F               self-contained model definitions (GPT-2, LLaMA, BERT)\n  workspace\u002F            runtime artifacts (gitignored)\n```\n\n## Design Choices\n\n**Dual backend: Triton + CUDA C++.** Triton for fast iteration (Python-like syntax, compiles in seconds). CUDA C++ for maximum performance (direct access to tensor cores via `wmma`, PTX intrinsics, shared memory bank-conflict-free layouts). Triton regularly reaches 80-95% of cuBLAS; CUDA C++ can match or exceed it. Both backends share the same `kernel_fn()` interface -- `bench.py` runs identically on either.\n\n**Correctness first.** The benchmark checks kernel output against PyTorch before measuring performance. A fast but wrong kernel is immediately reverted. This prevents the agent from \"optimizing\" by producing garbage.\n\n**Amdahl's law orchestration.** The orchestrator prioritizes by impact. A 1.5x speedup on a 60% kernel (1.25x end-to-end) beats a 3x speedup on a 5% kernel (1.03x end-to-end). It moves on when diminishing returns set in.\n\n**Single file to modify.** The agent only touches `kernel.py`. Scope stays manageable, diffs reviewable, reverts clean.\n\n**TSV logging.** Results go to a plain `results.tsv` file. Human-readable, git-friendly, trivially parseable, no infrastructure.\n\n## Results Format\n\nEvery experiment is logged to `results.tsv` (tab-separated):\n\n| Column | Description |\n|--------|-------------|\n| `experiment` | Sequential experiment number (0 = baseline) |\n| `tag` | Short identifier |\n| `kernel_type` | Which kernel (e.g., `matmul`) |\n| `throughput_tflops` | Measured throughput (higher is better) |\n| `latency_us` | Execution time in microseconds |\n| `pct_peak` | Percentage of GPU theoretical peak |\n| `speedup_vs_pytorch` | Speedup vs PyTorch\u002FcuBLAS |\n| `correctness` | PASS, FAIL, TIMEOUT, or CRASH |\n| `peak_vram_mb` | Peak GPU memory usage |\n| `description` | What was tried |\n\n## Credits\n\nThis project is **autoresearch for GPU kernels** -- directly inspired by Andrej Karpathy's [autoresearch](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fautoresearch), the original experiment in autonomous AI research agents for LLM training. Karpathy showed that an AI agent can run hundreds of experiments overnight, methodically exploring a search space and logging every result. AutoKernel applies that same loop -- agent edits one file, runs a fixed evaluation, keeps or reverts -- to the domain of GPU kernel optimization with Triton and native CUDA C++.\n\n**KernelBench** integration is based on the work of Simon Guo, Sean Resta, et al. at Stanford's Scaling Intelligence Lab. Their paper [\"KernelBench: Can LLMs Write GPU Kernels?\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10517) (2025) established the standard benchmark for evaluating AI-generated GPU kernels. AutoKernel extends this by applying iterative optimization (300+ experiments per problem) instead of one-shot generation. KernelBench dataset and evaluation protocol: [ScalingIntelligence\u002FKernelBench](https:\u002F\u002Fgithub.com\u002FScalingIntelligence\u002FKernelBench).\n\nBuilt by [RightNow AI](https:\u002F\u002Fwww.rightnowai.co). For enterprise GPU optimization, check out [RightNow Enterprise](https:\u002F\u002Fwww.rightnowai.co\u002Fforge).\n\n## Changelog\n\n### v1.3.0\n- AMD ROCm GPU support: MI300X, MI325X, MI350X, MI355X detection and specs (thanks [@andyluo7](https:\u002F\u002Fgithub.com\u002Fandyluo7))\n- Fixed `verify.py` SyntaxError on Python 3.13+\n- Fixed CUDA flash_attention ignoring `sm_scale` parameter\n- Fixed CUDA cross_entropy returning wrong dtype\n- Fixed Triton rotary_embedding broadcasting truncation\n- Fixed Triton reduce output shape for non-last-dim reductions\n\n### v1.2.0\n- Enhanced profiler: `--export-trace`, `--memory-snapshot`, `--torch-compile-log` flags\n- HuggingFace Kernels export via `export_hf.py`\n\n### v1.1.0\n- Native CUDA C++ backend with 9 starter kernels (tensor cores, warp intrinsics, shared memory tiling)\n- KernelBench integration (250+ standardized GPU kernel problems)\n- `--backend triton|cuda` flag for `extract.py`\n\n### v1.0.0\n- Initial release: Triton kernel optimization pipeline with 5-stage correctness harness\n\nSee [CHANGELOG.md](CHANGELOG.md) for full details.\n\n## License\n\nMIT\n","AutoKernel 是一个用于自动优化 GPU 内核的工具，能够接收任意 PyTorch 模型，并自动生成优化后的 Triton 或 CUDA C++ 内核。其核心功能包括模型性能分析、瓶颈内核提取与独立优化、以及最终的整体验证和加速报告。采用自主代理技术，通过不断修改、测试并保留或回滚更改来实现持续优化。该工具适用于需要提升深度学习模型在GPU上运行效率的场景，特别是对于那些希望减少手动调优工作量的研究人员和开发者来说非常有用。支持 NVIDIA H100\u002FA100\u002FRTX 4090 等显卡及 Python 3.10+ 环境。",2,"2026-06-11 03:49:39","high_star"]