[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1011":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":14,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":28,"discoverSource":29},1011,"FlashKDA","MoonshotAI\u002FFlashKDA","MoonshotAI","FlashKDA: high-performance Kimi Delta Attention kernels","",null,"Cuda",448,38,2,4,0,1,27,3,50.47,"MIT License",false,"master",[],"2026-06-12 04:00:07","# FlashKDA\n\nFlashKDA: Flash Kimi Delta Attention — high-performance KDA kernels built on CUTLASS\n\n## News\n\n- **2026-04-22** — Deep-Dive Blog: the design decisions behind FlashKDA v1, read it [here](docs\u002F20260420-flashkda-v1-deep-dive.md).\n\n## Requirements\n- SM90 and above\n- CUDA 12.9 and above\n- PyTorch 2.4 and above\n\n## Installation\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FFlashKDA.git flash-kda\ncd flash-kda\ngit submodule update --init --recursive\npip install -v .\n```\n\n## Using FlashKDA as an FLA backend\n\nOnce installed, FlashKDA is auto-dispatched from `flash-linear-attention`'s `chunk_kda`. See [fla-org\u002Fflash-linear-attention#852](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fpull\u002F852) for integration details.\n\n**Requirements**\n\n1. Install `flash-linear-attention >= 0.5.0`:\n   ```bash\n   pip install -U flash-linear-attention\n   ```\n2. Call `chunk_kda` under `torch.inference_mode()` \n   ```python\n   import torch\n   from fla.ops.kda import chunk_kda\n\n   with torch.inference_mode():\n       out, final_state = chunk_kda(\n           q=q, k=k, v=v, g=g, beta=beta,\n           scale=scale,\n           initial_state=h0,\n           output_final_state=True,\n           use_gate_in_kernel=True,\n           use_qk_l2norm_in_kernel=True,\n           use_beta_sigmoid_in_kernel=True,\n           safe_gate=True,\n           A_log=A_log, dt_bias=dt_bias,\n           lower_bound=lower_bound,\n           transpose_state_layout=True,\n           cu_seqlens=cu_seqlens,\n       )\n   ```\n\n**Opt out:** set `FLA_FLASH_KDA=0` to fall back to the Triton path.\n\n**Debug dispatch:** add `logging.basicConfig(level=logging.INFO)` to see `[FLA Backend] kda.chunk_kda -> flashkda` on hit, or `... rejected: \u003Creason>` on miss.\n\n## Performance\n\nSee [BENCHMARK_H20.md](BENCHMARK_H20.md).\n\n## Tests\n\n```bash\nbash tests\u002Ftest.sh\n```\n\n- `tests\u002Ftest_fwd.py` — correctness tests (exact match against the torch reference; compared with `flash-linear-attention`)\n\n\n## Kernel API\n\n### `flash_kda.fwd`\n\n```python\nflash_kda.fwd(q, k, v, g, beta, scale, out, A_log, dt_bias, lower_bound,\n              initial_state=None, final_state=None, cu_seqlens=None)\n```\n\n**Parameters:**\n\n| Parameter | Dtype | Shape | Description |\n|---|---|---|---|\n| `q` | bf16 | `[B, T, H, K]` | Query |\n| `k` | bf16 | `[B, T, H, K]` | Key |\n| `v` | bf16 | `[B, T, H, V]` | Value |\n| `g` | bf16 | `[B, T, H, K]` | Gate before activation |\n| `beta` | bf16 | `[B, T, H]` | Beta logits (pre-activation; sigmoid applied internally) |\n| `scale` | float | scalar | scaling factor |\n| `out` | bf16 | `[B, T, H, V]` | Output tensor |\n| `A_log` | fp32 | `[H]` | Log-gate parameter |\n| `dt_bias` | fp32 | `[H, K]` | Gate bias |\n| `lower_bound` | float | scalar | Gate lower bound (range from -5.0 to 0) |\n| `initial_state` | bf16\u002Ffp32\u002FNone | `[B, H, V, K]` or `[N, H, V, K]` | (optional) Initial recurrent state |\n| `final_state` | bf16\u002Ffp32\u002FNone | `[B, H, V, K]` or `[N, H, V, K]` | (optional, output) Final recurrent state |\n| `cu_seqlens` | int64 | `[N+1]` | (optional) Cumulative sequence lengths for variable-length batching |\n\n- Currently requires `K = V = 128`.\n- `initial_state` \u002F `final_state` accept `None` (stateless), bf16, or fp32 tensors. When both are provided, their dtypes must match.\n- When `cu_seqlens` is provided, `B` must be 1, `T` is the total length across all sequences, and `initial_state` \u002F `final_state` have shape `[N, H, V, K]`.\n- When `cu_seqlens` is `None`, each batch element is treated as an independent sequence, and the state shape is `[B, H, V, K]`.\n\n## Development\n\nTo set up IntelliSense (clangd) for the CUDA\u002FC++ sources, run:\n\n```bash\nbash setup_clangd.sh\n```\n\nThis generates a `.clangd` file with the correct repository paths and installs the global clangd `config.yaml` to `~\u002F.config\u002Fclangd\u002F`.\n\n## Citation\n\n```bibtex\n@misc{flashkda2026,\n      title={FlashKDA: Flash Kimi Delta Attention},\n      author={Yutian Chen, Zhiyuan Li, Yucheng Wang, Ming Wei},\n      year={2026},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FFlashKDA}},\n}\n```\n","FlashKDA 是一个高性能的 Kimi Delta Attention 内核库，基于 CUTLASS 构建。它主要提供了优化的 KDA 计算内核，支持 bfloat16 数据类型，并通过 CUDA 实现了高效的并行计算。该项目特别适合需要快速处理大规模注意力机制的应用场景，例如在深度学习模型中加速推理过程。其核心功能包括自动调度、高精度门控机制以及对初始和最终状态的支持等。为了使用 FlashKDA，用户需要满足特定硬件（如 SM90 及以上架构）及软件环境要求（CUDA 12.9 和 PyTorch 2.4 或更高版本）。此外，FlashKDA 可作为 `flash-linear-attention` 库的一个后端，在适当的配置下自动启用。","2026-06-11 02:41:03","CREATED_QUERY"]