[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-979":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":29,"discoverSource":30},979,"FlashQLA","QwenLM\u002FFlashQLA","QwenLM","high-performance linear attention kernel library built on TileLang","",null,"Python",533,44,3,4,0,2,11,57,6,64.16,"MIT License",false,"main",[],"2026-06-12 04:00:06","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002Fflashqla\u002Fflashqla.png\" width=\"1000\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">|&nbsp&nbsp 📜 \u003Ca href=\"https:\u002F\u002Fqwen.ai\u002Fblog?id=flashqla\">Blog\u003C\u002Fa>&nbsp&nbsp |\u003C\u002Fp>\n\n## Introduction\n\nFlashQLA is a high-performance linear attention kernel library built on [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang). FlashQLA applies **reasonable operator fusion and performance optimization** to the forward and backward passes of GDN Chunked Prefill, achieving **2-3× forward speedup** and **2× backward speedup** over the FLA Triton kernel across multiple scenarios on NVIDIA Hopper. The efficiency gains are particularly pronounced in pretraining scenarios and edge-side agentic inference.\n\nKey features:\n\n1.**Gate-driven automatic intra-card context parallelism**. By exploiting the exponential decay property of the GDN gate, FlashQLA automatically enables intra-card CP under TP, long-sequence, and small-head-count settings, improving GPU SM utilization.\n\n2.**Hardware-friendly algebraic reformulation**. We reformulate the forward and backward flows of GDN Chunked Prefill to a certain extent, effectively reducing Tensor Core, CUDA Core, and SFU overhead without sacrificing numerical precision.\n\n3.**TileLang fused warp-specialized kernels**. Rather than following the step-by-step decomposition into independent kernels, nor fusing the entire computation flow into a single kernel, we take CP and backward requirements into account, use TileLang to build several key fused kernels, and manually implement warpgroup specialization to overlap data movement, Tensor Core computation, and CUDA Core computation.\n\n## Requirements\n\n- SM90 or above\n- CUDA 12.8 or above\n- PyTorch 2.8 or above\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FQwenLM\u002FFlashQLA.git\ncd FlashQLA\npip install -v .\n```\n\n## Usage\n\n### High-level API\n\n```python\nimport torch\nfrom flash_qla import chunk_gated_delta_rule\n\no, final_state = chunk_gated_delta_rule(\n    q=q,          # [B, T, H_q, K]\n    k=k,          # [B, T, H_q, K]\n    v=v,          # [B, T, H_v, V]\n    g=g,          # [B, T, H_v]\n    beta=beta,    # [B, T, H_v]\n    scale=scale,\n    initial_state=initial_state,   # optional, [B, H_v, K, V]\n    output_final_state=True,\n    cu_seqlens=cu_seqlens,         # optional, for variable-length sequences\n)\n```\n\n### Low-level API\n\nFor separate forward and backward calls:\n\n```python\nfrom flash_qla import chunk_gated_delta_rule_fwd, chunk_gated_delta_rule_bwd\n\n# Forward\ng, A, o, h, final_state = chunk_gated_delta_rule_fwd(\n    q, k, v, g, beta, scale=scale, initial_state=h0, cu_seqlens=cu_seqlens\n)\n\n# Backward\ndq, dk, dv, db, dg, dh0 = chunk_gated_delta_rule_bwd(\n    q, k, v, g, beta, A, do, dht=dht, scale=scale, initial_state=h0, cu_seqlens=cu_seqlens\n)\n```\n\n## Tests\n\n```bash\n# require flash linear attention for comparison\npip install flash_linear_attention==0.5.0\n\ncd tests\npython test_gdr.py --set develop\npython test_gdr.py --set varlen --num-heads 32\npython test_gdr.py --set profile --num-heads 32\npython test_gdr.py --set product --ref-dtype float32 --num-heads 32\n```\n\n## Benchmark\n\nWe benchmarked FlashQLA against the FLA Triton and FlashInfer baseline (FLA 0.5.0, Triton 3.5.1, FlashInfer 0.6.9, TileLang 0.1.8) on the head configurations used by the Qwen3.5 \u002F Qwen3.6 family h_k,v \\in {64, 48, 32, 24, 16, 8}, corresponding to TP1 through TP8.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002Fflashqla\u002Ffwd_bwd_latency_comparison.png\" width=\"1000\"\u002F>\n\u003Cp>\n\nSpecifically, the forward (FWD) benchmarks measure single-kernel latency for different models and TP settings under varying batch lengths, while the backward (BWD) benchmarks examine the relationship between total token count within a batch and latency during a single update step.\n\nMore detail in [benchmark_results_H200.txt](.\u002Fbenchmark\u002Fbenchmark_results_H200.txt).\n\n```bash\n# require flash linear attention and flashinfer for comparison\npip install flash_linear_attention==0.5.0 flashinfer-python==0.6.9\n\ncd benchmark\npython bench_gated_delta_rule.py\n```\n\n## Acknowledge\n\nFlashQLA is inspired by [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention), [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang) and [FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002F) projects.\n\n## License\n\nFlashQLA is released under the MIT License.\n\n## Citation\n\n```bibtex\n@misc{flashqla2025,\n    title={FlashQLA: Flash Qwen Linear Attention},\n    author={Zhang, Chengruidong and Lin, Xi and Jiang, Huiqiang and Wang, Zekun and Li, Xiao and Cao, Yizhong and Zhuang, Bohan and Men, Rui and Zhang, Jianwei and Zheng, Bo and Lin, Junyang and Liu, Dayiheng and Zhou, Jingren},\n    year={2026},\n    publisher={GitHub},\n    howpublished={\\url{https:\u002F\u002Fgithub.com\u002FQwenLM\u002FFlashQLA}},\n}\n```\n","FlashQLA 是一个基于 TileLang 构建的高性能线性注意力内核库。它通过合理的算子融合和性能优化，实现了在 NVIDIA Hopper 架构上 GDN Chunked Prefill 的前向和后向传播过程中比 FLA Triton 内核快 2-3 倍的速度提升。该项目的核心功能包括门驱动的自动单卡上下文并行、硬件友好的代数重构以及使用 TileLang 构建的融合 warp 特化内核，这些特点有助于提高 GPU SM 利用率，减少 Tensor Core、CUDA Core 和 SFU 开销，并实现数据移动与计算的有效重叠。FlashQLA 适用于预训练场景及边缘端代理推理等需要高效处理长序列数据的情况。","2026-06-11 02:40:39","CREATED_QUERY"]