[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72058":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72058,"tilelang","tile-ai\u002Ftilelang","tile-ai"," Domain-specific language designed to streamline the development of high-performance GPU\u002FCPU\u002FAccelerators kernels","https:\u002F\u002Ftilelang.com\u002F",null,"Python",6473,597,40,81,0,37,87,308,111,39.33,"Other",false,"main",true,[],"2026-06-12 02:02:58","\u003Cimg src=.\u002Fimages\u002Flogo-row.svg \u002F>\n\n\u003Cdiv align=\"center\">\n\n# Tile Language\n[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftilelang.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftilelang)\n[![Ask DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg)](https:\u002F\u002Fdeepwiki.com\u002Ftile-ai\u002Ftilelang)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-%235865F2.svg?logo=discord&logoColor=white)](https:\u002F\u002Fdiscord.gg\u002FTUrHyJnKPG)\n[![Puzzles](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧩_Learn-TileLang_Puzzles-blueviolet)](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-puzzles)\n\u003C\u002Fdiv>\n\nTile Language (**tile-lang**) is a concise domain-specific language designed to streamline the development of high-performance GPU\u002FCPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of [TVM](https:\u002F\u002Ftvm.apache.org\u002F), tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.\n\n\u003Cimg src=.\u002Fimages\u002FMatmulExample.png \u002F>\n\n## Latest News\n- 02\u002F02\u002F2026 🧩: Check out [TileLang Puzzles](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-puzzles), a fun and interactive way to learn TileLang programming with 10 progressively harder puzzles!\n- 12\u002F18\u002F2025 🚀: Added [CuTeDSL backend](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F1421) support, enabling compilation to NVIDIA CUTLASS CuTe DSL! Join us in building and optimizing this exciting new backend: [Issue #1454](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fissues\u002F1454).\n- 12\u002F17\u002F2025 🔬: Integrated [Z3 theorem prover](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F1367) into TVM Arith Analyzer, bringing SMT-based symbolic reasoning for enhanced optimizations and automatic correctness verification!\n- 10\u002F31\u002F2025 🔧: Migrated to [apache-tvm-ffi](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F1108), significantly reducing CPU overhead!\n- 10\u002F30\u002F2025 📦: We have released v0.1.6.post2, which is the last version compatible with Python 3.8.\n- 10\u002F07\u002F2025 🍎: Added Apple Metal Device support, check out [Pull Request #799](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F799) for details.\n- 09\u002F29\u002F2025  🎉: Thrilled to announce that ​​AscendC​​ and ​Ascend​NPU IR​​ backends targeting Huawei Ascend chips are now supported!\nCheck out the preview here:\n🔗 [link](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-ascend).\nThis includes implementations across two branches:\n[ascendc_pto](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-ascend) and\n[npuir](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-ascend\u002Ftree\u002Fnpuir).\nFeel free to explore and share your feedback!\n- 07\u002F04\u002F2025 🚀: Introduced `T.gemm_sp` for 2:4 sparse tensor core support, check out [Pull Request #526](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F526) for details.\n- 06\u002F05\u002F2025 ✨: Added [NVRTC Backend](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F461) to significantly reduce compilation time for cute templates!\n- 04\u002F14\u002F2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See [example_mla_amd](.\u002Fexamples\u002Fdeepseek_mla\u002Famd\u002FREADME.md) for details.\n- 03\u002F03\u002F2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](.\u002Fexamples\u002Fdeepseek_mla\u002Fexample_mla_decode.py))! We also provide [documentation](.\u002Fexamples\u002Fdeepseek_mla\u002FREADME.md) explaining how TileLang achieves this.\n- 02\u002F15\u002F2025 ✨: Added WebGPU Codegen support, see [Pull Request #86](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fpull\u002F86)!\n- 02\u002F12\u002F2025 ✨: Excited to announce the release of [v0.1.0](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Freleases\u002Ftag\u002Fv0.1.0)!\n- 02\u002F10\u002F2025 🚀: Added debug tools for TileLang—`T.print` for printing variables\u002Fbuffers ([docs](https:\u002F\u002Ftilelang.com\u002Ftutorials\u002Fdebug_tools_for_tilelang.html)) and a memory layout plotter ([examples\u002Fplot_layout](.\u002Fexamples\u002Fplot_layout)).\n- 01\u002F20\u002F2025 ✨: We are excited to announce that tile-lang, a dsl for high performance AI workloads, is now open source and available to the public!\n\n## Tested Devices\nAlthough tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA\u002FWGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A6000; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).\n\n## OP Implementation Examples\n**tile-lang** provides the building blocks to implement a wide variety of operators. Some examples include:\n\n- [Matrix Multiplication](.\u002Fexamples\u002Fgemm\u002F)\n- [Dequantization GEMM](.\u002Fexamples\u002Fdequantize_gemm\u002F)\n- [Flash Attention](.\u002Fexamples\u002Fflash_attention\u002F)\n- [Flash Linear Attention](.\u002Fexamples\u002Flinear_attention\u002F)\n- [Flash MLA Decoding](.\u002Fexamples\u002Fdeepseek_mla\u002F)\n- [Native Sparse Attention](.\u002Fexamples\u002Fdeepseek_nsa\u002F)\n\nWithin the `examples` directory, you will also find additional complex kernels—such as convolutions, forward\u002Fbackward passes for FlashAttention, more operators will continuously be added.\n\n## Benchmark Summary\n\nTileLang achieves exceptional performance across a variety of computational patterns. Comprehensive benchmark scripts and settings are available at [tilelang-benchmark](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang-benchmark). Below are selected results showcasing its capabilities:\n\n- MLA Decoding Performance on H100\n\n  \u003Cdiv style=\"display: flex; gap: 10px; justify-content: center;\">\n    \u003Cdiv style=\"flex: 1;\">\n      \u003Cimg src=\".\u002Fexamples\u002Fdeepseek_mla\u002Ffigures\u002Fbs64_float16.png\" alt=\"mla decode performance bs64 on H100\" width=\"100%\" \u002F>\n    \u003C\u002Fdiv>\n    \u003Cdiv style=\"flex: 1;\">\n      \u003Cimg src=\".\u002Fexamples\u002Fdeepseek_mla\u002Ffigures\u002Fbs128_float16.png\" alt=\"mla decode performance bs128 on H100\" width=\"100%\" \u002F>\n    \u003C\u002Fdiv>\n  \u003C\u002Fdiv>\n\n- Flash Attention Performance on H100\n\n  \u003Cdiv align=\"center\">    \u003Cimg src=\".\u002Fimages\u002Fmha_performance_h100.png\" alt=\"operator performance on H100\" width=80% \u002F>\n  \u003C\u002Fdiv>\n\n- Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)\n\n  \u003Cdiv>\n    \u003Cimg src=\".\u002Fimages\u002Fop_benchmark_consistent_gemm_fp16.png\" alt=\"gemm fp16 performance on Gpus\" \u002F>\n  \u003C\u002Fdiv>\n\n- Dequantize Matmul Performance on A100\n\n  \u003Cdiv>\n    \u003Cimg src=\".\u002Fimages\u002Fop_benchmark_a100_wq_gemv.png\" alt=\"dequantize gemv performance on A100\" \u002F>\n  \u003C\u002Fdiv>\n\n## Installation\n### Method 1: Install with Pip\n\nThe quickest way to get started is to install the latest release from PyPI:\n\n```bash\npip install tilelang\n```\n\nAlternatively, you can install directly from the GitHub repository:\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\n```\n\nOr install locally:\n\n```bash\n# install required system dependencies\nsudo apt-get update\nsudo apt-get install -y python3-setuptools gcc libtinfo-dev zlib1g-dev build-essential cmake libedit-dev libxml2-dev\n\npip install -e . -v # remove -e option if you don't want to install in editable mode, -v for verbose output\n```\n\n### Method 2: Build from Source\nWe currently provide three ways to install **tile-lang** from source:\n- [Install from Source (using your own TVM installation)](.\u002Fdocs\u002Fget_started\u002FInstallation.md#method-1-install-from-source-using-your-own-tvm-installation)\n- [Install from Source (using the bundled TVM submodule)](.\u002Fdocs\u002Fget_started\u002FInstallation.md#method-2-install-from-source-using-the-bundled-tvm-submodule)\n- [Install Using the Provided Script](.\u002Fdocs\u002Fget_started\u002FInstallation.md#method-3-install-using-the-provided-script)\n\n### Method 3: Install with Nightly Version\n\nFor users who want access to the latest features and improvements before official releases, we provide nightly builds of **tile-lang**.\n\n```bash\npip install tilelang -f https:\u002F\u002Ftile-ai.github.io\u002Fwhl\u002Fnightly\n# or pip install tilelang --find-links https:\u002F\u002Ftile-ai.github.io\u002Fwhl\u002Fnightly\n```\n\n> **Note:** Nightly builds contain the most recent code changes but may be less stable than official releases. They're ideal for testing new features or if you need a specific bugfix that hasn't been released yet.\n\n## Quick Start\n\nIn this section, you'll learn how to write and execute a straightforward GEMM (matrix multiplication) kernel using tile-lang, followed by techniques for layout optimizations, pipelining, and L2-cache–friendly swizzling.\n\n### GEMM Example with Annotations (Layout, L2 Cache Swizzling, and Pipelining, etc.)\n\nBelow is an example that demonstrates more advanced features: layout annotation, parallelized copy, and swizzle for improved L2 cache locality. This snippet shows how to adapt your kernel to maximize performance on complex hardware.\n\n```python\n# @tilelang.jit(target=\"cuda\")\n# target currently can be \"cuda\" or \"hip\" or \"cpu\".\n# if not specified, it will be inferred from the input tensors during compile time\n@tilelang.jit\ndef matmul_relu(\n    A, B,\n    block_M: int = 64,\n    block_N: int = 64,\n    block_K: int = 64,\n    dtype: T.dtype = T.float16,\n    accum_dtype: T.dtype = T.float32,\n):\n    # declare compilation shape constant\n    M, N, K = T.const('M, N, K')\n\n    # annotate input tensor shape\n    A: T.Tensor[[M, K], dtype]\n    B: T.Tensor[[K, N], dtype]\n\n    # allocate output tensor\n    C = T.empty([M, N], dtype)\n\n    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=128) as (bx, by):\n        A_shared = T.alloc_shared((block_M, block_K), dtype)\n        B_shared = T.alloc_shared((block_K, block_N), dtype)\n        C_local = T.alloc_fragment((block_M, block_N), accum_dtype)\n\n        # Enable rasterization for better L2 cache locality (Optional)\n        # T.use_swizzle(panel_size=10, enable=True)\n\n        # Clear local accumulation\n        T.clear(C_local)\n\n        for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):\n            # Copy tile of A\n            # This is a sugar syntax for parallelized copy\n            T.copy(A[by * block_M, ko * block_K], A_shared)\n\n            # Copy tile of B\n            T.copy(B[ko * block_K, bx * block_N], B_shared)\n\n            # Perform a tile-level GEMM on the shared buffers\n            # Currently we dispatch to the cute\u002Fhip on Nvidia\u002FAMD GPUs\n            T.gemm(A_shared, B_shared, C_local)\n\n        # relu\n        for i, j in T.Parallel(block_M, block_N):\n            C_local[i, j] = T.max(C_local[i, j], 0)\n\n        # Copy result back to global memory\n        T.copy(C_local, C[by * block_M, bx * block_N])\n\n    # You can write multiple cuda kernel in one function, they execute sequentially\n    # with T.Kernel(...) as ...\n\n    # Return the tensor, you can also return multiple tensors\n    return C\n\n\nM, N, K = 1024, 1024, 1024\n\na = torch.randn(M, K, device=\"cuda\", dtype=torch.float16)\nb = torch.randn(K, N, device=\"cuda\", dtype=torch.float16)\nc_ref = torch.relu(a @ b)\n\n# Call the kernel\nc = matmul_relu(a, b)\ntorch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)\n\n# Call the kernel with overwritten compilation constants\nc = matmul_relu(a, b, block_M=128, block_N=128, block_K=64)\ntorch.testing.assert_close(c, c_ref, rtol=1e-2, atol=1e-2)\n\n# Retrieve the compiled kernel\nkernel = matmul_relu.compile(a, b) # use torch.Tensor\nkernel = matmul_relu.compile(      # use T.Tensor as placeholder\n  T.Tensor((M, K), T.float16),\n  T.Tensor((K, N), T.float16)\n)\nkernel = matmul_relu.compile(      # directly specify the shape constants\n  M=M, N=N, K=K,\n  block_M=128, block_N=128, block_K=64\n)\nprint(kernel.get_kernel_source())\nc = kernel(a, b)\n\n# 5.Profile latency with kernel\nprofiler = kernel.get_profiler(tensor_supply_type=tilelang.TensorSupplyType.Normal)\nlatency = profiler.do_bench()\nprint(f\"Latency: {latency} ms\")\n```\n\n### Dive Deep into TileLang Beyond GEMM\n\nIn addition to GEMM, we provide a variety of examples to showcase the versatility and power of TileLang, including:\n\n- [Dequantize GEMM](.\u002Fexamples\u002Fdequantize_gemm\u002F): Achieve high-performance dequantization by **fine-grained control over per-thread operations**, with many features now adopted as default behaviors in [BitBLAS](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitBLAS), which utilizing magic layout transformation and intrins to accelerate dequantize gemm.\n- [FlashAttention](.\u002Fexamples\u002Fflash_attention\u002F): Enable cross-operator fusion with simple and intuitive syntax, and we also provide an example of auto tuning.\n- [LinearAttention](.\u002Fexamples\u002Flinear_attention\u002F): Examples include RetNet and Mamba implementations.\n- [Convolution](.\u002Fexamples\u002Fconvolution\u002F): Implementations of Convolution with IM2Col.\n\n## Upcoming Features\n\nCheck our [tilelang v0.2.0 release plan](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang\u002Fissues\u002F79) for upcoming features.\n\n---\n\nTileLang has now been used in project [BitBLAS](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FBitBLAS) and [AttentionEngine](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAttentionEngine).\n\n## Join the Discussion\n\nWelcome to join our Discord community for discussions, support, and collaboration!\n\n[![Join our Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20Us-blue?logo=discord&style=for-the-badge)](https:\u002F\u002Fdiscord.gg\u002FTUrHyJnKPG)\n\n## Acknowledgments\n\nWe would like to express our gratitude to the [TVM](https:\u002F\u002Fgithub.com\u002Fapache\u002Ftvm) community for their invaluable contributions. The initial version of this project was mainly developed by [LeiWang1999](https:\u002F\u002Fgithub.com\u002FLeiWang1999), [chengyupku](https:\u002F\u002Fgithub.com\u002Fchengyupku) and [nox-410](https:\u002F\u002Fgithub.com\u002Fnox-410) with supervision from Prof. [Zhi Yang](https:\u002F\u002Fyangzhihome.github.io) at Peking University. Part of this work was carried out during an internship at Microsoft Research, where Dr. Lingxiao Ma, Dr. Yuqing Xia, Dr. Jilong Xue, and Dr. Fan Yang offered valuable advice and support. We deeply appreciate their mentorship and contributions.\n","Tile Language 是一种专为简化高性能GPU\u002FCPU内核开发而设计的领域特定语言。它通过使用Python语法并基于TVM编译器基础设施，使得开发者能够专注于提高生产力，同时不牺牲实现顶级性能所需的底层优化。其核心功能包括支持GEMM、Dequant GEMM、FlashAttention等算法，并且集成了Z3定理证明器以增强优化和自动验证正确性。此外，Tile Language还支持多种后端，如NVIDIA CUTLASS CuTe DSL、Apple Metal设备以及华为Ascend芯片，使其适用于需要跨平台高效执行计算密集型任务的各种场景，比如深度学习模型训练与推理、科学计算等。",2,"2026-06-11 03:40:10","high_star"]