[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11706":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":9,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},11706,"FlashMLA","deepseek-ai\u002FFlashMLA","deepseek-ai","FlashMLA: Efficient Multi-head Latent Attention Kernels",null,"https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA","C++",12697,1057,110,66,0,6,21,55,18,44.07,false,"main","2026-06-12 02:02:33","# FlashMLA\n\n## Introduction\n\nFlashMLA is DeepSeek's library of optimized attention kernels, powering the [DeepSeek-V3](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3) and [DeepSeek-V3.2-Exp](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3.2-Exp) models. This repository contains the following implementations:\n\n**Sparse Attention Kernels**\n\n*These kernels power DeepSeek Sparse Attention (DSA), as introduced in [this paper](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3.2-Exp).*\n\n- Token-level sparse attention for the prefill stage\n- Token-level sparse attention for the decoding stage, with FP8 KV cache\n\n**Dense Attention Kernels**\n\n- Dense attention for the prefill stage\n- Dense attention for the decoding stage\n\n## News\n\n- **2025.09.29 Release of Sparse Attention Kernels**: With the launch of [DeepSeek-V3.2](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3.2-Exp), we are releasing the corresponding token-level sparse attention kernels. These kernels power the model's DeepSeek Sparse Attention (DSA) and achieve up to 640 TFlops during prefilling and 410 TFlops during decoding. We also release a deep-dive blog for our new FP8 sparse decoding kernel. Check it out [here](docs\u002F20250929-hopper-fp8-sparse-deep-dive.md).\n- **2025.08.01 Kernels for MHA on SM100**: Thanks to [NVIDIA's PR](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA\u002Fpull\u002F76) for MHA forward \u002F backward kernels on SM100!\n- **2025.04.22 Deep-Dive Blog**: We'd love to share the technical details behind the new FlashMLA kernel! Check out our deep-dive write-up [here](docs\u002F20250422-new-kernel-deep-dive.md).\n- **2025.04.22 Performance Update**: We're excited to announce the new release of Flash MLA, which delivers 5% ~ 15% performance improvement for compute-bound workloads, achieving up to 660 TFlops on NVIDIA H800 SXM5 GPUs. The interface of the new version is fully compatible with the old one. Simply upgrade to the new version for an immediate performance boost! 🚀🚀🚀\n\n## Performance\n\n#### Test & benchmark MLA decoding (Sparse & Dense):\n\n```bash\npython tests\u002Ftest_flash_mla_dense_decoding.py\npython tests\u002Ftest_flash_mla_sparse_decoding.py\n```\n\nThe dense MLA decoding kernel achieves up to 3000 GB\u002Fs in memory-bound configuration and 660 TFLOPS in computation-bound configuration on H800 SXM5 with CUDA 12.8. The token-level sparse MLA decoding kernel (which uses an FP8 KV cache while performing the matrix multiplication in bfloat16) achieves 410 TFLOPS in compute-bound configuration on H800 SXM5 with CUDA 12.8, and achieves up to 350 TFlops on B200 (which is not really optimized yet).\n\n#### Test & benchmark MHA prefill (Dense):\n\n```bash\npython tests\u002Ftest_fmha_sm100.py\n```\n\nIt achieves up to 1460 TFlops in forward and 1000 TFlops in backward computation on B200, as reported by NVIDIA.\n\n#### Test & benchmark MLA prefill (Sparse):\n\n```bash\npython tests\u002Ftest_flash_mla_sparse_prefill.py\n```\n\nIt achieves up to 640 TFlops in forward computation on H800 SXM5 with CUDA 12.8, and achieves up to 1450 TFlops on B200, CUDA 12.9.\n\n## Requirements\n\n- SM90 \u002F SM100 (See the support matrix below)\n- CUDA 12.8 and above (CUDA 12.9+ is required for SM100 kernels)\n- PyTorch 2.0 and above\n\nSupport matrix:\n\n| Kernel | GPU Architecture | MLA Mode [2] | KVCache Format |\n| :---: | :---: | :---: | :---: |\n| Dense Decoding | SM90 | MQA | BF16 |\n| Sparse Decoding | SM90 & SM100 | MQA | FP8 [1] |\n| Dense Prefill | SM100 | MHA |  |\n| Sparse Prefill | SM90 & SM100 | MQA |  |\n\n[1]: For more details on using FP8 KV cache, see documents below.\n\n[2]: Here \"MLA Mode\" refers to the mode used for MLA calculation. MQA stands for Multi-Query Attention mode (i.e. `head_dim_k` =  576 with `head_dim_v` = 512), while MHA stands for Multi-Head Attention mode (i.e. `head_dim_k` = 192 \u002F 128 with `head_dim_v` = 128). For a detailed explanation of these modes, please refer to the appendix of [DeepSeek V3.2's Paper](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3.2-Exp).\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA.git flash-mla\ncd flash-mla\ngit submodule update --init --recursive\npip install -v .\n```\n\n## Usage\n\n### MLA Decoding\n\nTo use the MLA decoding kernels, call get_mla_metadata once before the decoding loop to get the tile scheduler metadata. Then, call flash_mla_with_kvcache in each decoding step. For example:\n\n```python\nfrom flash_mla import get_mla_metadata, flash_mla_with_kvcache\n\ntile_scheduler_metadata, num_splits = get_mla_metadata(\n    cache_seqlens,\n    s_q * h_q \u002F\u002F h_kv,\n    h_kv,\n    h_q,\n    is_fp8,\n    topk,\n)\n\nfor i in range(num_layers):\n    ...\n    o_i, lse_i = flash_mla_with_kvcache(\n        q_i, kvcache_i, block_table, cache_seqlens, dv,\n        tile_scheduler_metadata, num_splits,\n        is_causal, is_fp8_kvcache, indices,\n    )\n    ...\n```\n\nWhere\n\n- `s_q` is the number of q tokens per q sequence. If MTP (speculative decoding) is disabled, it should be 1.\n- `h_kv` is the number of key-value heads.\n- `h_q` is the number of query heads.\n\n**FP8 KV Cache:**\nIf `is_fp8_kvcache` is set to `True`, the kernel reads the KV cache in the \"FP8 with scale\" format (described below). It dequantizes the cache to bfloat16 and performs attention computation in bfloat16. The output is also in bfloat16.\n\nIn the \"FP8 with scale\" format, each token's KV cache is 656 Bytes, structured as:\n-   **First 512 bytes:** The \"quantized NoPE\" part, containing 512 `float8_e4m3` values.\n-   **Next 16 bytes:** Scale factors, containing 4 `float32` values. The first `float32` is the scale for the first 128 `float8_e4m3` values, the second for the next 128, and so on.\n-   **Last 128 bytes:** The \"RoPE\" part, containing 64 `bfloat16` values. This part is not quantized for accuracy.\n\nSee `tests\u002Fquant.py` for quantization and dequantization details.\n\n**Sparse Attention (`indices` tensor):**\nThe `indices` tensor (if provided) enables token-level sparse attention by instructing the kernel to compute attention only for specified tokens.\n\n-   **Shape:** `indices` should be a 3D tensor of shape `(batch_size, seq_len_q, topk)`.\n-   **Format:** `indices_in_kvcache[i][j][k] = (the index of the page block where token t resides) * page_block_size + (the offset of token t within the page block)`, where `t` is the k-th token for the j-th query sequence in the i-th batch. Since the index of the page block has already been encoded into `indices_in_kvcache`, the kernel does not require the `block_table` parameter.\n-   **Invalid entries:** Set invalid indices to `-1`.\n\n**Return Values:**\nThe kernel returns `(out, lse)`, where:\n-   `out` is the attention result.\n-   `lse` is the log-sum-exp value of the attention scores for each query head.\n\nSee `tests\u002Ftest_flash_mla_decoding.py` for a complete example.\n\n### Sparse MLA Prefill\n\nFor the sparse MLA prefill kernel, call `flash_mla_sparse_fwd` directly with the following parameters:\n-   `q`: Query tensor of shape `[s_q, h_q, d_qk]`\n-   `kv`: Key-Value tensor of shape `[s_kv, h_kv, d_qk]`\n-   `indices`: Indices tensor of shape `[s_q, h_kv, topk]`\n-   `sm_scale`: A scalar value\n\n**Note on batching:** This kernel does not support a batch dimension. For multi-batch inference, reshape the input tensors and adjust the `indices` parameter to simulate batch processing.\n\n**Invalid indices:** Set invalid entries in `indices` to `-1` or any number `>= s_kv`.\n\n**Return Values and Equivalent PyTorch Code:**\nThe kernel returns `(out, max_logits, lse)`. This is equivalent to the following PyTorch operations:\n\n```python\nQ: [s_q, h_q, d_qk], bfloat16\nkv: [s_kv, h_kv, d_qk], bfloat16\nindices: [s_q, h_kv, topk], int32\n\nkv = kv.squeeze(1)  # [s_kv, d_qk], h_kv must be 1\nindices = indices.squeeze(1)    # [s_q, topk]\nfocused_kv = kv[indices]    # For the i-th sequence (s_q), the corresponding KV tokens are selected from the KV cache based on indices[i, :]. This operation results in a tensor of shape [s_q, topk, d_qk].\n\nP = (Q @ focused_kv.transpose(-1, -2)) * sm_scale * math.log2(math.e)    # [s_q, h_q, topk]\nmax_logits = P.max(dim=-1) # [s_q, h_q]\nlse = log2sumexp2(P, dim=-1, base=2)   # [s_q, h_q]，\"log2sumexp2\" means that the exponentiation and logarithm are base-2\nS = exp2(P - lse)      # [s_q, h_q, topk]\nout = S @ focused_kv  # [s_q, h_q, d_qk]\n\nreturn (out, max_logits, lse)\n```\n\nSee `tests\u002Ftest_flash_mla_prefill.py` for a complete example.\n\n### Dense MHA Prefill\n\nThis kernel implements the standard dense Multi-Head Attention (MHA) forward and backward operations. It can be called using:\n-   `flash_attn_varlen_func`\n-   `flash_attn_varlen_qkvpacked_func`\n-   `flash_attn_varlen_kvpacked_func`\n\nThe usage is similar to the `flash_attn` package. See `tests\u002Ftest_fmha_sm100.py` for a complete example.\n\n## Acknowledgement\n\nFlashMLA is inspired by [FlashAttention 2&3](https:\u002F\u002Fgithub.com\u002Fdao-AILab\u002Fflash-attention\u002F) and [cutlass](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcutlass) projects.\n\n## Community Support\n\n### MetaX\nFor MetaX GPUs, visit the official website: [MetaX](https:\u002F\u002Fwww.metax-tech.com).\n\nThe corresponding FlashMLA version can be found at: [MetaX-MACA\u002FFlashMLA](https:\u002F\u002Fgithub.com\u002FMetaX-MACA\u002FFlashMLA)\n\n\n### Moore Threads\nFor the Moore Threads GPU, visit the official website: [Moore Threads](https:\u002F\u002Fwww.mthreads.com\u002F).\n\nThe corresponding FlashMLA version is available on GitHub: [MooreThreads\u002FMT-flashMLA](https:\u002F\u002Fgithub.com\u002FMooreThreads\u002FMT-flashMLA).\n\n\n### Hygon DCU\nFor the Hygon DCU, visit the official website: [Hygon Developer](https:\u002F\u002Fdeveloper.sourcefind.cn\u002F).\n\nThe corresponding FlashMLA version is available here: [OpenDAS\u002FMLAttention](https:\u002F\u002Fdeveloper.sourcefind.cn\u002Fcodes\u002FOpenDAS\u002FMLAttention).\n\n\n### Intellifusion\nFor the Intellifusion NNP, visit the official website: [Intellifusion](https:\u002F\u002Fwww.intellif.com).\n\nThe corresponding FlashMLA version is available on Gitee: [Intellifusion\u002Ftyllm](https:\u002F\u002Fgitee.com\u002FIntellifusion_2025\u002Ftyllm\u002Fblob\u002Fmaster\u002Fpython\u002Ftylang\u002Fflash_mla.py).\n\n\n### Iluvatar Corex\nFor Iluvatar Corex GPUs, visit the official website: [Iluvatar Corex](https:\u002F\u002Fwww.iluvatar.com).\n\nThe corresponding FlashMLA version is available on GitHub: [Deep-Spark\u002FFlashMLA](https:\u002F\u002Fgithub.com\u002FDeep-Spark\u002FFlashMLA\u002Ftree\u002Filuvatar_flashmla)\n\n\n### AMD Instinct\nFor AMD Instinct GPUs, visit the official website: [AMD Instinct](https:\u002F\u002Fwww.amd.com\u002Fen\u002Fproducts\u002Faccelerators\u002Finstinct.html).\n\nThe corresponding FlashMLA version can be found at: [AITER\u002FMLA](https:\u002F\u002Fgithub.com\u002FROCm\u002Faiter\u002Fblob\u002Fmain\u002Faiter\u002Fmla.py)\n\n## Citation\n\n```bibtex\n@misc{flashmla2025,\n      title={FlashMLA: Efficient Multi-head Latent Attention Kernels},\n      author={Jiashi Li, Shengyu Liu},\n      year={2025},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FFlashMLA}},\n}\n```\n","FlashMLA 是 DeepSeek 开发的优化注意力内核库，为 DeepSeek-V3 和 DeepSeek-V3.2-Exp 模型提供支持。该项目主要包含稀疏和密集注意力内核，其中稀疏注意力内核实现了令牌级别的预填充和解码阶段的高效处理，并引入了FP8 KV缓存技术；而密集注意力内核则针对预填充和解码阶段进行了优化。FlashMLA 通过利用 NVIDIA GPU 的高性能计算能力，在特定配置下能够达到高达数千TFlops的运算速度，尤其适用于需要大量并行计算且对延迟敏感的大规模语言模型训练与推理场景。",2,"2026-06-11 03:32:23","trending"]