[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78061":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},78061,"triton-llm-kernel-lab","zengxiao-he\u002Ftriton-llm-kernel-lab","zengxiao-he",null,"Python",132,1,28,0,109,0.9,"MIT License",false,"main",true,[],"2026-06-12 02:03:45","# Triton LLM Inference Kernel Lab\n\nSmall GPU kernel lab for LLM inference primitives in Python, Triton, PyTorch,\nand CUDA. The repo focuses on readable kernels and reproducible validation:\nrow-wise softmax, FP16 GEMM, and a FlashAttention-style fused attention forward\nkernel using tiled online softmax.\n\nThis is a cleaned-up reconstruction of earlier kernel experiments. Benchmark\nnumbers are intentionally generated by the local harness instead of being baked\ninto the README, because GPU model, driver, Triton version, and tensor shapes\nchange the results materially.\n\n## Kernels\n\n- `row_softmax`: one Triton program per row, numerically stable max-subtract\n  softmax, intended for bandwidth-bound reductions where a row fits in SRAM.\n- `fp16_matmul`: tiled FP16 GEMM with FP32 accumulation, block-level `tl.dot`,\n  grouped program ordering for better L2 reuse, and configurable\n  `BLOCK_M\u002FBLOCK_N\u002FBLOCK_K`.\n- `flash_attention_forward`: fused attention forward for prefill and simple\n  decode-style shapes, using tiled QK and PV blocks plus the online softmax\n  recurrence to avoid materializing the full attention matrix.\n\n## Project Layout\n\n```text\nsrc\u002Ftriton_llm_kernel_lab\u002F\n  bench.py              # CLI benchmark harness\n  configs.py            # kernel configs and LLM-like benchmark shapes\n  reference.py          # PyTorch reference implementations\n  runtime.py            # CUDA\u002FTriton availability checks\n  kernels\u002F\n    attention.py        # FlashAttention-style fused attention forward\n    gemm.py             # FP16 GEMM kernel\n    softmax.py          # row-wise fused softmax kernel\ntests\u002F\n  test_references.py    # CPU-safe reference and config checks\n  test_gpu_kernels.py   # CUDA\u002FTriton correctness checks, skipped otherwise\ndocs\u002F\n  profiling.md          # Nsight Compute workflow and metrics\n  tradeoffs.md          # prefill vs decode kernel selection notes\n```\n\n## Install\n\nUse a Linux environment with an NVIDIA GPU for the Triton kernels.\n\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -U pip\npip install -e \".[gpu,dev]\"\n```\n\nOn a CPU-only machine, install only the test\u002Fdev path:\n\n```bash\npip install -e \".[dev]\"\npytest tests\u002Ftest_references.py\n```\n\n## Correctness\n\nThe GPU tests compare every custom kernel against a PyTorch reference and report\nthe max absolute error. On a CUDA machine:\n\n```bash\npytest tests\u002Ftest_gpu_kernels.py -q\n```\n\nThe test coverage includes:\n\n- softmax rows with non-power-of-two column counts\n- FP16 GEMM with masked edge tiles\n- causal and non-causal fused attention forward\n\n## Benchmarking\n\nThe harness uses 50 warmup iterations and 200 timed iterations by default. It\nprints latency, estimated TFLOPS, estimated memory bandwidth, and max error.\n\n```bash\npython -m triton_llm_kernel_lab.bench --kernel all\npython -m triton_llm_kernel_lab.bench --kernel attention --warmup 50 --iters 200 --csv results\u002Fattention.csv\n```\n\nRepresentative shape groups are defined in `configs.py`:\n\n- prefill: longer query\u002Fkey lengths where QK and PV dominate arithmetic\n- decode: short query length with long KV cache where memory traffic dominates\n- GEMM: common projection and MLP matrix sizes\n- softmax: row lengths that stress SRAM fit and reduction behavior\n\n## Profiling\n\nUse Nsight Compute for detailed GPU metrics:\n\n```bash\nbash scripts\u002Fprofile_ncu.sh attention\n```\n\nThe profiling notes in `docs\u002Fprofiling.md` track the metrics that matter for\nthis lab: achieved occupancy, memory throughput, L2 hit rate, warp stalls,\ntensor core utilization, and DRAM read\u002Fwrite transactions.\n\n## Open-Source References\n\nThe implementation style was informed by the public Triton tutorials and\nFlashAttention papers, but the code in this repository is written as a compact\nteaching\u002Flab version rather than a copy of those tutorials.\n\n- Triton fused softmax tutorial:\n  https:\u002F\u002Ftriton-lang.org\u002Fmain\u002Fgetting-started\u002Ftutorials\u002F02-fused-softmax.html\n- Triton matrix multiplication tutorial:\n  https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton\u002Fblob\u002Fmain\u002Fpython\u002Ftutorials\u002F03-matrix-multiplication.py\n- Triton fused attention tutorial:\n  https:\u002F\u002Ftriton-lang.org\u002Fmain\u002Fgetting-started\u002Ftutorials\u002F06-fused-attention.html\n- FlashAttention:\n  https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14135\n- FlashAttention-2:\n  https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash2\u002Fflash2.pdf\n","该项目是一个用于大规模语言模型推理基础组件的小型GPU内核实验室，支持Python、Triton、PyTorch和CUDA。核心功能包括行级softmax、FP16 GEMM以及采用分块在线softmax的FlashAttention风格融合注意力前向内核，旨在提供可读性强且结果可复现的内核实现。适合于需要对LLM推理过程中的关键操作进行优化或研究的场景，特别是在探索不同GPU硬件配置下性能差异时尤为有用。项目结构清晰，提供了详细的基准测试脚本与验证方法，便于开发者快速上手并根据自身需求调整参数。",2,"2026-06-11 03:56:25","CREATED_QUERY"]