[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77613":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},77613,"tilelang-cuda-skills","sablin39\u002Ftilelang-cuda-skills","sablin39","Skills for writing tilelang and debugging with CUDA toolkits. ",null,"Python",123,5,103,0,1,17,2.33,false,"main",true,[],"2026-06-12 02:03:43","# GPU Development Skills\n\nAgent skills for GPU kernel development — writing, debugging, profiling, and optimizing CUDA and [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang) GPU kernels.\n\nThe `.vscode` directory is intentionally tracked as a reference to configure highlighting for humans.\n\n## TileLang Skills\n\nTileLang skills created by Claude Opus 4.6 (probably) in Claude Code with the `skill-creator` plugin. Based on TileLang `v0.1.9` docs and examples, validated on RTX PRO 6000 Blackwell GPUs (sm_120) with CUDA 13.1 and PyTorch 2.11.\n\n| Skill | Description | Key Topics |\n|-------|-------------|------------|\n| [writing-tilelang-kernels](skills\u002Ftilelang\u002Fwriting-tilelang-kernels\u002FSKILL.md) | Write TileLang GPU kernels from scratch or by adapting patterns | Kernel anatomy, templates (GEMM, elementwise, reduction), memory scopes, T.copy\u002FT.gemm, dynamic shapes |\n| [debugging-tilelang-programs](skills\u002Ftilelang\u002Fdebugging-tilelang-programs\u002FSKILL.md) | Diagnose and fix errors in TileLang programs | Failure taxonomy, T.print, AutoDD, compute-sanitizer, numerical drift, race detection |\n| [profiling-tilelang-programs](skills\u002Ftilelang\u002Fprofiling-tilelang-programs\u002FSKILL.md) | Benchmark and profile TileLang kernels | do_bench backends, TFLOPS\u002Fbandwidth, ncu bottleneck diagnosis (pipe utilization, warp stalls), roofline |\n| [torch-profiling-tilelang-programs](skills\u002Ftilelang\u002Ftorch-profiling-tilelang-programs\u002FSKILL.md) | Lightweight torch.profiler alternative to ncu\u002Fnsys for TileLang | torch.profiler setup, key_averages, chrome trace, roofline classification (IO \u002F CUDA-core \u002F tensor-core), launch overhead, memory profiling |\n| [optimizing-tilelang-programs](skills\u002Ftilelang\u002Foptimizing-tilelang-programs\u002FSKILL.md) | Optimize TileLang kernels for performance | Tile sizes, pipeline stages, threads, AutoTuner, epilogue fusion, swizzle, ncu-guided tuning |\n| [testing-fwd-bwd-kernels](skills\u002Ftilelang\u002Ftesting-fwd-bwd-kernels\u002FSKILL.md) | Test kernels with forward and backward passes | torch.autograd.Function, compare_backward (not gradcheck), mixed-precision, atomicAdd, attention bwd |\n\n### Workflow\n\nThe skills form a natural progression:\n\n```\nwriting → debugging → profiling → optimizing\n                          ↓           ↓\n            torch-profiling   testing-fwd-bwd (for differentiable ops)\n```\n\n1. **Write** a kernel using templates from the writing skill\n2. **Debug** if it fails to compile or produces wrong results\n3. **Profile** to measure baseline performance — use `profiling-tilelang-programs` for the ncu\u002Fdo_bench workflow, or `torch-profiling-tilelang-programs` when ncu\u002Fnsys aren't available and you want a quick `torch.profiler`-based pass\n4. **Optimize** to improve performance\n5. **Test fwd+bwd** if the kernel needs gradients\n\n## CUDA Skills\n\nGeneral-purpose CUDA development skill for debugging, profiling, and optimizing GPU kernels — independent of any specific framework. Originally from [`technillogue\u002Fptx-isa-markdown`](https:\u002F\u002Fgithub.com\u002Ftechnillogue\u002Fptx-isa-markdown). Includes scraped PTX ISA 9.1, CUDA Runtime API 13.1, and CUDA Driver API 13.1 documentation (640+ markdown files) for grep-based lookup.\n\n| Skill | Description | Key Topics |\n|-------|-------------|------------|\n| [cuda-programming](skills\u002Fcuda_skill\u002FSKILL.md) | Debug, profile, and optimize CUDA kernels | compute-sanitizer, cuda-gdb, ncu\u002Fnsys profiling, NVTX, PTX ISA, coalescing, bank conflicts, inline PTX |","该项目提供了一套用于编写和调试TileLang及CUDA GPU内核的技能指南。核心功能包括从零开始编写GPU内核、诊断与修复程序错误、性能基准测试与分析、优化以及前向后向传递测试等，覆盖了从基础到高级的完整开发流程。技术特点上，项目利用了CUDA工具包，并且支持通过PyTorch进行轻量级性能分析。适合需要深入理解和改进GPU并行计算效率的研究人员或开发者使用，在深度学习模型加速、高性能计算等领域尤为适用。",2,"2026-06-11 03:55:38","CREATED_QUERY"]