[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-706":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},706,"TileKernels","deepseek-ai\u002FTileKernels","deepseek-ai","A kernel library written in tilelang",null,"Python",1581,137,11,6,0,4,17,83,12,19.42,"MIT License",false,"main",[],"2026-06-12 02:00:17","# Tile Kernels\n\nOptimized GPU kernels for LLM operations, built with [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang). TileLang is a domain-specific language for expressing high-performance GPU kernels in Python, featuring easy migration, agile development, and automatic optimization.\n\nMost kernels in this project approach the limit of hardware performance regarding the compute intensity and memory bandwidth. Some of them have already been used in internal training and inference scenarios. However, they do not represent best practices and we are actively working on improving the code quality and documentation.\n\n## Features\n\n- **Gating** — Top-k expert selection and scoring for Mixture of Experts routing\n- **MoE Routing** — Token-to-expert mapping, fused expansion\u002Freduction and weight normalization\n- **Quantization** — Per-token, per-block, and per-channel FP8\u002FFP4\u002FE5M6 casting with fused SwiGLU+quantization ops\n- **Transpose** — Batched transpose operations\n- **Engram** — Engram gating kernels with fused RMSNorm, forward\u002Fbackward passes and weight gradient reduction\n- **Manifold HyperConnection** — Hyper-connection kernels including Sinkhorn normalization and mix splitting\u002Fapplication\n- **Modeling** — High-level `torch.autograd.Function` wrappers composing low-level kernels into trainable layers (engram gate, mHC pipeline)\n\n## Requirements\n\n- Python 3.10 or higher\n- PyTorch 2.10 or higher\n- TileLang 0.1.9 or higher\n- NVIDIA SM90 or SM100 architecture GPU\n- CUDA Toolkit 13.1 or higher\n\n## Installation\n\n### Install a local development version\n\n```bash\npip install -e \".[dev]\"\n```\n\n### Install a release version\n\n```bash\npip install tile-kernels\n```\n\n## Testing\n\nTests using pytest:\n\n### Test single test file\n\n```bash\npytest tests\u002Ftranspose\u002Ftest_transpose.py -n 4 # Correctness only with 4 workers\npytest tests\u002Ftranspose\u002Ftest_transpose.py --run-benchmark # Correctness + Benchmarking\n```\n\n### Pressure test\n\n```bash\nTK_FULL_TEST=1 pytest -n 4 --count 2\n```\n\n## Project Structure\n\n```txt\ntile_kernels\u002F\n├── moe\u002F        # Mixture of Experts routing related kernels\n├── quant\u002F      # FP8\u002FFP4\u002FE5M6 quantization\n├── transpose\u002F  # Batched transpose\n├── engram\u002F     # Engram gating kernels\n├── mhc\u002F        # Manifold HyperConnection kernels\n├── modeling\u002F   # High-level autograd modeling layers (engram, mHC)\n├── torch\u002F      # PyTorch reference implementations\n└── testing\u002F    # Test and benchmark utilities\n```\n\n## Acknowledgement\n\nThis project is built on [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang). Thanks and respect to the developers!\n\n## License\n\nThis code repository is released under [the MIT License](LICENSE).\n\n## Citation\n\n```bibtex\n@misc{tilekernels,\n      title={TileKernels},\n      author={Xiangwen Wang, Chenhao Xu, Huanqi Cao, Rui Tian, Weilin Zhao, Kuai Yu and Chenggang Zhao},\n      year={2026},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FTileKernels}},\n}\n```\n","TileKernels 是一个使用 TileLang 编写的 GPU 内核库，专为大规模语言模型（LLM）操作优化。该项目提供了包括门控、MoE 路由、量化、转置、Engram 门控以及超连接等核心功能的高性能内核，并支持 FP8\u002FFP4\u002FE5M6 量化和融合操作。这些内核接近硬件性能极限，在计算强度和内存带宽方面表现出色。适用于需要高效执行 LLM 训练与推理任务的场景，特别是那些对性能有高要求的应用环境。开发人员可以通过 PyTorch 的 autograd 函数封装将低级内核组合成可训练层，从而简化模型构建过程。",2,"2026-06-11 02:38:48","CREATED_QUERY"]