[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71186":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},71186,"flash-linear-attention","fla-org\u002Fflash-linear-attention","fla-org","🚀 Efficient implementations for emerging model architectures","https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention",null,"Python",5204,549,33,38,0,14,41,131,42,114.22,"MIT License",false,"main",true,[27,28,29,30],"large-language-models","machine-learning-systems","natural-language-processing","sequence-modeling","2026-06-12 04:00:59","\u003Cdiv align=\"center\">\n\n\u003Cimg width=\"50%\" alt=\"Flash Linear Attention\" src=\"images\u002Flogo.png\">\n\n[![hf_model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Models-gray.svg?logo=huggingface&style=flat-square)](https:\u002F\u002Fhuggingface.co\u002Ffla-hub) [![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-%235865F2.svg?&logo=discord&logoColor=white&style=flat-square)](https:\u002F\u002Fdiscord.gg\u002FvDaJTmKNcS)\n\n\u003C\u002Fdiv>\n\n\u003Cp>\n  💥 Flash Linear Attention brings together hardware-efficient building blocks, training-ready layers, and components for modern sequence models, spanning linear attention, sparse attention, state space models, and hybrid LLM architectures. All implementations are platform-agnostic and verified on NVIDIA, AMD, and Intel hardware. Pull requests are welcome!\n\u003C\u002Fp>\n\n--------\n\n* [News](#news)\n* [Models](#models)\n* [Installation](#installation)\n* [Usage](#usage)\n  * [Token Mixing](#token-mixing)\n  * [Fused Modules](#fused-modules)\n  * [Generation](#generation)\n  * [Hybrid Models](#hybrid-models)\n* [Training](#training)\n* [Evaluation](#evaluation)\n* [Benchmarks](#benchmarks)\n* [Citation](#citation)\n* [Star History](#star-history)\n* [Acknowledgements](#acknowledgements)\n\n## News\n\n- [2026-05] 🦅 Add Raven implementation to `fla` ([repo](https:\u002F\u002Fgithub.com\u002Fgoombalab\u002Fraven)).\n- [2026-05] 🚀 Add [YOCO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05254) (You Only Cache Once) implementation to `fla`.\n- [2026-05] ⚡ Add fused [AttnRes](fla\u002Fops\u002Fattnres) support to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15031)).\n- [2026-04] 🐍 Add Mamba3 implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15569)).\n- [2026-04] 🧱 Add [MoBA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13189) (Mixture of Block Attention) implementation to `fla`, with [FlashMoBA](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fflash-moba) backend support.\n- [2026-04] 🧱 Add [TileLang](https:\u002F\u002Fgithub.com\u002Ftile-ai\u002Ftilelang) backend support for selected kernels.\n- [2026-04] 🎯 Add [GPT-OSS](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-oss\u002F)-style attention sink support to `fla`'s attention kernels.\n- [2026-03] 🚀 Add [Context Parallel](fla\u002Fops\u002Fcp\u002FREADME.md) support for KDA and GDN, enabling efficient distributed training across sequence dimension.\n- [2025-10] 🌘 Add Kimi Delta Attention (KDA) implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26692)).\n- [2025-09] 🌲 Add DeltaFormer implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19488v1)).\n- [2025-09] 🐻 Thrilled to announce that [GDN](fla\u002Fops\u002Fgated_delta_rule) has been integrated into Qwen3-Next. Check out their [blog post](https:\u002F\u002Fqwen.ai\u002Fblog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list) for more info!\n- [2025-08] 🌲 Add Log-Linear Attention implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04761)).\n- [2025-08] 🎓 Add MoM implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13685)).\n- [2025-07] 🐳 Add MLA implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434)).\n- [2025-07] 🛣️ Add PaTH Attention implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16381)).\n- [2025-06] 🎉 Add MesaNet implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05233)).\n- [2025-06] 🐍 Add Comba implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02475)).\n\n\u003Cdetails>\n\u003Csummary>Older news\u003C\u002Fsummary>\n\n- [2025-05] 🎉 Add Rodimus&ast; implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577)).\n- [2025-04] 🎉 Add DeltaProduct implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10297)).\n- [2025-04] 🎉 Add FoX implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02130)).\n- [2025-03] ~~We have changed the default `initializer_range` to the magic 🐳 0.006~~ The `initializer_range` was rolled back to the default value of 0.02. For actual training, we recommend trying both.\n- [2025-02] 🐳 Add NSA implementations to `fla`. See kernels [here](fla\u002Fops\u002Fnsa).\n- [2025-01] 🔥 We are migrating to `torchtitan`-based training framework. Check out the [flame](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflame) repo for more details.\n- [2025-01] 🦅 Add RWKV7 implementations (both kernels and models) to `fla`.\n- [2024-12] Add `flash-bidirectional-attention` to `fla-org` ([repo](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-bidirectional-linear-attention)).\n- [2024-12] 🎉 Add Gated DeltaNet implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06464)).\n- [2024-12] 🚀 `fla` now officially supports kernels with variable-length inputs.\n- [2024-11] The inputs are now switched from head-first to seq-first format.\n- [2024-11] 💥 `fla` now provides a flexible way for training hybrid models.\n- [2024-10] 🔥 Announcing `flame`, a minimal and scalable framework for training `fla` models. Check out the details [here](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflame).\n- [2024-09] `fla` now includes a fused linear and cross-entropy layer, significantly reducing memory usage during training.\n- [2024-09] 🎉 Add GSA implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07146)).\n- [2024-05] 🎉 Add DeltaNet implementation to `fla` ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.11174)).\n- [2024-05] 💥 `fla` v0.1: a variety of subquadratic kernels\u002Flayers\u002Fmodels integrated (RetNet\u002FGLA\u002FMamba\u002FHGRN\u002FHGRN2\u002FRWKV6, etc., see [Models](#models)).\n- [2023-12] 💥 Launch `fla`, offering a collection of implementations for state-of-the-art linear attention models.\n\n\u003C\u002Fdetails>\n\n## Models\n\n| Year | Model                | Paper                                                                                                                                         | Code                                                                                            |                                                                                                       |\n| :--- | :------------------- | :-------------------------------------------------------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------- | :---------------------------------------------------------------------------------------------------: |\n| 2022 | ABC                  | [ABC: Attention with Bounded-memory Control](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.02488)                                                                |                                                                                                 |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fabc.py)          |\n| 2023 | RetNet               | [Retentive network: a successor to transformer for large language models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.08621)                                   | [official](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftorchscale\u002Ftree\u002Fmain)                                   | [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fmultiscale_retention.py) |\n| 2023 | HGRN                 | [Hierarchically Gated Recurrent Neural Network for Sequence Modeling](https:\u002F\u002Fopenreview.net\u002Fforum?id=P1TCHxJwLB)                             | [official](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FHGRN)                                                  |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fhgrn.py)         |\n| 2024 | GLA                  | [Gated Linear Attention Transformers with Hardware-Efficient Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06635)                                      | [official](https:\u002F\u002Fgithub.com\u002Fberlino\u002Fgated_linear_attention)                                   |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fgla.py)          |\n| 2024 | Based                | [Simple linear attention language models balance the recall-throughput tradeoff](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18668)                            | [official](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fbased)                                               |        [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fbased.py)         |\n| 2024 | Rebased              | [Linear Transformers with Learnable Kernel Functions are Better In-Context Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.10644)                          | [official](https:\u002F\u002Fgithub.com\u002Fcorl-team\u002Frebased\u002F)                                               |       [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Frebased.py)        |\n| 2024 | DeltaNet             | [Parallelizing Linear Transformers with Delta Rule over Sequence Length](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.06484)                                    | [official](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fdelta_net.py) |      [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fdelta_net.py)       |\n| 2024 | HGRN2                | [HGRN2: Gated Linear RNNs with State Expansion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07904)                                                             | [official](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FHGRN2)                                                 |        [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fhgrn2.py)         |\n| 2024 | RWKV6                | [Eagle and Finch: RWKV with Matrix-Valued States and Dynamic Recurrence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05892)                                    | [official](https:\u002F\u002Fgithub.com\u002FRWKV\u002FRWKV-LM)                                                     |        [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Frwkv6.py)         |\n| 2024 | LightNet             | [You Only Scan Once: Efficient Multi-dimension Sequential Modeling with LightNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21022)                           | [official](https:\u002F\u002Fgithub.com\u002FOpenNLPLab\u002FLightNet)                                              |       [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Flightnet.py)       |\n| 2024 | YOCO                 | [You Only Cache Once: Decoder-Decoder Architectures for Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05254)                                    | [official](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002FYOCO)                                 |          [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Fmodels\u002Fyoco)           |\n| 2024 | Mamba2               | [Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21060) | [official](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba)                                               |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Fmodels\u002Fmamba2)          |\n| 2024 | GSA                  | [Gated Slot Attention for Efficient Linear-Time Sequence Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07146)                                          | [official](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fmodels\u002Fgsa)          |           [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fmodels\u002Fgsa)           |\n| 2024 | MLA                  | [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434)                        | [official](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V2)                                          |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fmla.py)          |\n| 2025 | Samba                | [Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.07522)                 | [official](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSamba)                                                  |          [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Fmodels\u002Fsamba)          |\n| 2025 | Gated DeltaNet       | [Gated Delta Networks: Improving Mamba2 with Delta Rule](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06464)                                                    | [official](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGatedDeltaNet)                                             |      [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Fgated_delta_rule)      |\n| 2025 | RWKV7                | [RWKV-7 \"Goose\" with Expressive Dynamic State Evolution](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14456)                                                    | [official](https:\u002F\u002Fgithub.com\u002FBlinkDL\u002FRWKV-LM\u002Ftree\u002Fmain\u002FRWKV-v7)                                |           [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Frwkv7)            |\n| 2025 | NSA                  | [Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11089)                         |                                                                                                 |            [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Fnsa)             |\n| 2025 | FoX                  | [Forgetting Transformer: Softmax Attention with a Forget Gate](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02130)                                              | [official](https:\u002F\u002Fgithub.com\u002Fzhixuan-lin\u002Fforgetting-transformer)                               |      [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Fforgetting_attn)       |\n| 2025 | DeltaProduct         | [DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.10297)                            |                                                                                                 |  [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fgated_deltaproduct.py)  |\n| 2025 | Rodimus&ast;         | [Rodimus*: Breaking the Accuracy-Efficiency Trade-Off with Efficient Attentions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.06577)                            | [official](https:\u002F\u002Fgithub.com\u002Fcodefuse-ai\u002Frodimus)                                              |       [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Frodimus.py)        |\n| 2025 | MesaNet              | [MesaNet: Sequence Modeling by Locally Optimal Test-Time Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05233)                                          |                                                                                                 |       [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fmesa_net.py)       |\n| 2025 | Comba                | [Comba: Improving Bilinear RNNs with Closed-loop Control](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02475)                                                   | [official](https:\u002F\u002Fgithub.com\u002FAwesomeSeq\u002FComba-triton)                                          |        [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fcomba.py)         |\n| 2025 | PaTH                 | [PaTH Attention: Position Encoding via Accumulating Householder Transformations](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.16381)                            |                                                                                                 |      [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fpath_attn.py)       |\n| 2025 | MoM                  | [MoM: Linear Sequence Modeling with Mixture-of-Memories](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13685)                                                    | [official](https:\u002F\u002Fgithub.com\u002FOpenSparseLLMs\u002FMoM)                                               |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fmom.py)          |\n| 2025 | Log-Linear Attention | [Log-Linear Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04761)                                                                                      | [official](https:\u002F\u002Fgithub.com\u002FHanGuo97\u002Flog-linear-attention)                                    |      [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Flog_linear_attn)       |\n| 2025 | DeltaFormer          | [Understanding Transformer from the Perspective of Associative Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19488v1)                                    |                                                                                                 |     [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fdeltaformer.py)      |\n| 2025 | KDA                  | [Kimi Linear: An Expressive, Efficient Attention Architecture](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26692)                                              |                                                                                                 |            [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fops\u002Fkda)             |\n| 2025 | MoBA                 | [MoBA: Mixture of Block Attention for Long-Context LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13189)                                                    | [official](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FMoBA)                                                  |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Flayers\u002Fmoba.py)         |\n| 2026 | Mamba3               | [Mamba-3: Improved Sequence Modeling using State Space Principles](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15569)                                          | [official](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba)                                               |         [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Ffla\u002Fmodels\u002Fmamba3)          |\n| 2026 | Raven                | [Raven: High-Recall Sequence Modeling with Sparse Memory Routing](https:\u002F\u002Fgithub.com\u002Fgoombalab\u002Fraven\u002Fblob\u002Fmain\u002Fraven.pdf)                     | [official](https:\u002F\u002Fgithub.com\u002Fgoombalab\u002Fraven)                                                  |          [fla](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Ftree\u002Fmain\u002Ffla\u002Fmodels\u002Fraven)          |\n\n## Installation\n\n[![nvidia-h100-ci](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Factions\u002Fworkflows\u002Fnvidia-h100.yml\u002Fbadge.svg?branch=main&event=push)](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Factions\u002Fworkflows\u002Fnvidia-h100.yml)\n\nThe following requirements should be satisfied\n- [PyTorch](https:\u002F\u002Fpytorch.org\u002F) >= 2.7.0\n- [Triton](https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton) >= 3.3 (or nightly version, see [FAQs](FAQs.md))\n- [einops](https:\u002F\u002Feinops.rocks\u002F)\n- [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) >=4.45.0\n- [datasets](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets) >=3.3.0\n\nStarting from v0.3.2, the packages published on PyPI are `fla-core` and `flash-linear-attention`. The former contains all our customized kernels and only depends on PyTorch, Triton, and einops. The latter is an extension package of the former, containing `fla\u002Flayers` and `fla\u002Fmodels`, and depends on transformers. We also provide Triton implementations for conv1d operations, so causal-conv1d is not required.\n\nYou can install `fla` with pip:\n```sh\npip install flash-linear-attention\n```\n\nAs `fla` is actively developed now, for the latest features and updates, an alternative way is to install the package from source. Note that installing from git uses the default mode, so you need to uninstall both `fla-core` and `flash-linear-attention` first:\n```sh\n# uninstall both packages first to ensure a successful upgrade\npip uninstall fla-core flash-linear-attention -y && pip install -U git+https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\n```\nor manage `fla` with submodules\n```sh\ngit submodule add https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention.git 3rdparty\u002Fflash-linear-attention\nln -s 3rdparty\u002Fflash-linear-attention\u002Ffla fla\n```\n\n> [!NOTE]\n> For AMD GPUs, make sure to install the [Triton ROCm backend](https:\u002F\u002Fgithub.com\u002Ftriton-lang\u002Ftriton). For Intel GPUs, use the [Triton XPU backend](https:\u002F\u002Fgithub.com\u002Fintel\u002Fintel-xpu-backend-for-triton). See [FAQs](FAQs.md) for more details.\n\nIf you have installed `triton-nightly` and `torch` pre-release version, please use the following command:\n```sh\npip install einops ninja datasets transformers numpy\n# uninstall both packages first to ensure a successful upgrade\npip uninstall fla-core flash-linear-attention -y && pip install -U --no-use-pep517 git+https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention --no-deps\n```\n\n\n## Usage\n\n### Token Mixing\n\nWe provide \"token mixing\" linear attention layers in `fla.layers` for you to use.\nYou can replace the standard multihead attention layer in your model with other linear attention layers.\nExample usage is as follows:\n```py\n>>> import torch\n>>> from fla.layers import MultiScaleRetention\n>>> batch_size, num_heads, seq_len, hidden_size = 32, 4, 2048, 1024\n>>> device, dtype = 'cuda:0', torch.bfloat16\n>>> retnet = MultiScaleRetention(hidden_size=hidden_size, num_heads=num_heads).to(device=device, dtype=dtype)\n>>> x = torch.randn(batch_size, seq_len, hidden_size).to(device=device, dtype=dtype)\n>>> y, *_ = retnet(x)\n>>> y.shape\ntorch.Size([32, 2048, 1024])\n```\n\nWe provide the implementations of models that are compatible with 🤗 Transformers library.\nHere's an example of how to initialize a GLA model from the default configs in `fla`:\n\n```py\n>>> from fla.models import GLAConfig\n>>> from transformers import AutoModelForCausalLM\n>>> config = GLAConfig()\n>>> model = AutoModelForCausalLM.from_config(config)\n```\n\n\u003Cdetails>\n\u003Csummary>Click to expand config and model structure\u003C\u002Fsummary>\n\n```py\n>>> config\nGLAConfig {\n  \"attn\": null,\n  \"attn_mode\": \"chunk\",\n  \"bos_token_id\": 1,\n  \"clamp_min\": null,\n  \"conv_size\": 4,\n  \"elementwise_affine\": true,\n  \"eos_token_id\": 2,\n  \"expand_k\": 0.5,\n  \"expand_v\": 1,\n  \"feature_map\": null,\n  \"fuse_cross_entropy\": true,\n  \"fuse_norm\": true,\n  \"fuse_swiglu\": true,\n  \"hidden_act\": \"swish\",\n  \"hidden_ratio\": 4,\n  \"hidden_size\": 2048,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": null,\n  \"max_position_embeddings\": 2048,\n  \"model_type\": \"gla\",\n  \"norm_eps\": 1e-06,\n  \"num_heads\": 4,\n  \"num_hidden_layers\": 24,\n  \"num_kv_heads\": null,\n  \"tie_word_embeddings\": false,\n  \"transformers_version\": \"4.50.1\",\n  \"use_cache\": true,\n  \"use_gk\": true,\n  \"use_gv\": false,\n  \"use_output_gate\": true,\n  \"use_short_conv\": false,\n  \"vocab_size\": 32000\n}\n\n>>> model\nGLAForCausalLM(\n  (model): GLAModel(\n    (embeddings): Embedding(32000, 2048)\n    (layers): ModuleList(\n      (0-23): 24 x GLABlock(\n        (attn_norm): RMSNorm(2048, eps=1e-06)\n        (attn): GatedLinearAttention(\n          (q_proj): Linear(in_features=2048, out_features=1024, bias=False)\n          (k_proj): Linear(in_features=2048, out_features=1024, bias=False)\n          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)\n          (g_proj): Linear(in_features=2048, out_features=2048, bias=False)\n          (gk_proj): Sequential(\n            (0): Linear(in_features=2048, out_features=16, bias=False)\n            (1): Linear(in_features=16, out_features=1024, bias=True)\n          )\n          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)\n          (g_norm_swish_gate): FusedRMSNormGated(512, eps=1e-06, activation=swish)\n        )\n        (mlp_norm): RMSNorm(2048, eps=1e-06)\n        (mlp): GatedMLP(\n          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)\n          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)\n          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)\n          (swiglu_linear): SwiGLULinear()\n        )\n      )\n    )\n    (norm): RMSNorm(2048, eps=1e-06)\n  )\n  (lm_head): Linear(in_features=2048, out_features=32000, bias=False)\n)\n```\n\n\u003C\u002Fdetails>\n\n### Fused Modules\n\nWe offer a collection of fused modules in `fla.modules` to facilitate faster training:\n\n* [`Rotary Embedding`](fla\u002Fmodules\u002Frotary.py): rotary positional embeddings as adopted by the Llama architecture, a.k.a., Transformer++.\n* [`Norm Layers`](fla\u002Fmodules\u002Flayernorm.py):\n  * `RMSNorm`, `LayerNorm` and `GroupNorm`\n  * `RMSNormLinear`, `LayerNormLinear` and `GroupNormLinear` to reduce memory usage of intermediate tensors for improved memory efficiency.\n* [`Norm Layers with Gating`](fla\u002Fmodules\u002Ffused_norm_gate.py): combine norm layers with element-wise sigmoid or swish gating, as used by RetNet\u002FGLA.\n* [`Cross Entropy`](fla\u002Fmodules\u002Ffused_cross_entropy.py): faster Triton implementation of cross entropy loss.\n* [`Linear Cross Entropy`](fla\u002Fmodules\u002Ffused_linear_cross_entropy.py): fused linear layer and cross entropy loss to avoid the materialization of large logits tensors. Also refer to implementations by [mgmalek](https:\u002F\u002Fgithub.com\u002Fmgmalek\u002Fefficient_cross_entropy) and [Liger-Kernel](https:\u002F\u002Fgithub.com\u002Flinkedin\u002FLiger-Kernel\u002Fblob\u002Fmain\u002Fsrc\u002Fliger_kernel\u002Fops\u002Ffused_linear_cross_entropy.py).\n* [`Linear KL Divergence`](fla\u002Fmodules\u002Ffused_kl_div.py): fused linear layer and KL divergence loss in a similar vein as CE loss.\n\n> [!IMPORTANT]\n> You can control using `fuse_linear_cross_entropy` in the model configuration to enable\u002Fdisable the fused linear cross entropy loss.\n>\n> This fused implementation is more memory-efficient but may reduce numerical precision. Due to this trade-off, it is disabled by default.\n> If you enable this feature and encounter training instability (e.g., loss divergence), we recommend disabling it to see if the issue is resolved.\n\n### Generation\n\nUpon successfully pretraining a model, it becomes accessible for generating text using the 🤗 text generation APIs.\nIn the following, we give a generation example:\n```py\n>>> import fla\n>>> from transformers import AutoModelForCausalLM, AutoTokenizer\n>>> name = 'fla-hub\u002Fgla-1.3B-100B'\n>>> tokenizer = AutoTokenizer.from_pretrained(name)\n>>> model = AutoModelForCausalLM.from_pretrained(name).cuda()\n>>> input_prompt = \"Power goes with permanence. Impermanence is impotence. And rotation is castration.\"\n>>> input_ids = tokenizer(input_prompt, return_tensors=\"pt\").input_ids.cuda()\n>>> outputs = model.generate(input_ids, max_length=64)\n>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]\n```\n\nWe also provide a simple script [here](benchmarks\u002Fbenchmark_generation.py) for benchmarking the generation speed.\nSimply run it by:\n```sh\n$ python -m benchmarks.benchmark_generation \\\n  --path 'fla-hub\u002Fgla-1.3B-100B' \\\n  --repetition_penalty 2. \\\n  --prompt=\"Hello everyone, I'm Songlin Yang\"\n\nPrompt:\nHello everyone, I'm Songlin Yang\nGenerated:\nHello everyone, I'm Songlin Yang.\nI am a 20 year old girl from China who is currently studying in the United States of America for my Master degree and also working as an English teacher at school here on campus since last summer (1st semester). My main goal to be able do well with this course so that we can have\n\nPrompt length: 10, generation length: 64\nTotal prompt processing + decoding time: 4593ms\n```\n\nAll of the pretrained models currently available can be found in [`fla-hub`](https:\u002F\u002Fhuggingface.co\u002Ffla-hub).\n```py\n>>> from huggingface_hub import list_models\n>>> for model in list_models(author='fla-hub'): print(model.id)\n```\n\n### Hybrid Models\n\n`fla` provides a flexible method to incorporate standard attention layers into existing linear attention models.\nThis is easily achieved by specifying the `attn` argument in the model configuration.\n\nFor example, to create a 2-layer Samba model with one Mamba layer followed by one local attention layer, using a sliding window size of 2048:\n\n```py\n>>> from fla.models import SambaConfig\n>>> from transformers import AutoModelForCausalLM\n>>> config = SambaConfig(num_hidden_layers=2)\n>>> config.attn = {\n  'layers': [1],\n  'num_heads': 18,\n  'num_kv_heads': 18,\n  'qkv_bias': False,\n  'rope_theta': 10000.,\n  'window_size': 2048\n}\n>>> model = AutoModelForCausalLM.from_config(config)\n```\n\n\u003Cdetails>\n\u003Csummary>Click to expand config and model structure\u003C\u002Fsummary>\n\n```py\n>>> config\nSambaConfig {\n  \"attn\": {\n    \"layers\": [\n      1\n    ],\n    \"num_heads\": 18,\n    \"num_kv_heads\": 18,\n    \"qkv_bias\": false,\n    \"rope_theta\": 10000.0,\n    \"window_size\": 2048\n  },\n  \"bos_token_id\": 1,\n  \"conv_kernel\": 4,\n  \"eos_token_id\": 2,\n  \"expand\": 2,\n  \"fuse_cross_entropy\": true,\n  \"fuse_norm\": true,\n  \"fuse_swiglu\": true,\n  \"hidden_act\": \"swish\",\n  \"hidden_ratio\": 4,\n  \"hidden_size\": 2304,\n  \"initializer_range\": 0.02,\n  \"intermediate_size\": 4608,\n  \"max_position_embeddings\": 2048,\n  \"model_type\": \"samba\",\n  \"norm_eps\": 1e-05,\n  \"num_hidden_layers\": 2,\n  \"pad_token_id\": 0,\n  \"rescale_prenorm_residual\": false,\n  \"residual_in_fp32\": false,\n  \"state_size\": 16,\n  \"tie_word_embeddings\": false,\n  \"time_step_floor\": 0.0001,\n  \"time_step_init_scheme\": \"random\",\n  \"time_step_max\": 0.1,\n  \"time_step_min\": 0.001,\n  \"time_step_rank\": 144,\n  \"time_step_scale\": 1.0,\n  \"transformers_version\": \"4.50.1\",\n  \"use_bias\": false,\n  \"use_cache\": true,\n  \"use_conv_bias\": true,\n  \"vocab_size\": 32000\n}\n\n>>> model\nSambaForCausalLM(\n  (backbone): SambaModel(\n    (embeddings): Embedding(32000, 2304)\n    (layers): ModuleList(\n      (0): SambaBlock(\n        (mixer_norm): RMSNorm(2304, eps=1e-05)\n        (mixer): Mamba(\n          (conv1d): Conv1d(4608, 4608, kernel_size=(4,), stride=(1,), padding=(3,), groups=4608)\n          (in_proj): Linear(in_features=2304, out_features=9216, bias=False)\n          (x_proj): Linear(in_features=4608, out_features=176, bias=False)\n          (dt_proj): Linear(in_features=144, out_features=4608, bias=True)\n          (out_proj): Linear(in_features=4608, out_features=2304, bias=False)\n        )\n        (mlp_norm): RMSNorm(2304, eps=1e-05)\n        (mlp): GatedMLP(\n          (gate_proj): Linear(in_features=2304, out_features=6144, bias=False)\n          (up_proj): Linear(in_features=2304, out_features=6144, bias=False)\n          (down_proj): Linear(in_features=6144, out_features=2304, bias=False)\n          (swiglu_linear): SwiGLULinear()\n        )\n      )\n      (1): SambaBlock(\n        (mixer_norm): RMSNorm(2304, eps=1e-05)\n        (mixer): Attention(\n          (q_proj): Linear(in_features=2304, out_features=2304, bias=False)\n          (k_proj): Linear(in_features=2304, out_features=2304, bias=False)\n          (v_proj): Linear(in_features=2304, out_features=2304, bias=False)\n          (o_proj): Linear(in_features=2304, out_features=2304, bias=False)\n          (rotary): RotaryEmbedding(dim=128, base=10000.0, interleaved=False, pos_idx_in_fp32=True)\n        )\n        (mlp_norm): RMSNorm(2304, eps=1e-05)\n        (mlp): GatedMLP(\n          (gate_proj): Linear(in_features=2304, out_features=6144, bias=False)\n          (up_proj): Linear(in_features=2304, out_features=6144, bias=False)\n          (down_proj): Linear(in_features=6144, out_features=2304, bias=False)\n          (swiglu_linear): SwiGLULinear()\n        )\n      )\n    )\n    (norm_f): RMSNorm(2304, eps=1e-05)\n  )\n  (lm_head): Linear(in_features=2304, out_features=32000, bias=False)\n)\n```\n\n\u003C\u002Fdetails>\n\nDuring inference, you **DO NOT** need to revise anything for generation!\nThe model will produce output as-is, without any need for additional configurations or modifications.\n\n## Training\n\nWe provide a minimal framework called [🔥 `flame`](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflame) built on top of `torchtitan`, for efficient training of `fla` models.\n\nCheck out [the GLA example](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fblob\u002Fmain\u002Fexamples\u002Ftraining.md) for more details.\n\n## Evaluation\n\nThe [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) library allows you to easily perform (zero-shot) model evaluations.\nFollow the steps below to use this library:\n\n1. Install `lm_eval` following [their instructions](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Fblob\u002Fmain\u002FREADME.md).\n\n2. Run evaluation with:\n```sh\n$ MODEL='fla-hub\u002Fgla-1.3B-100B'\n$ python -m evals.harness --model hf \\\n    --model_args pretrained=$MODEL,dtype=bfloat16 \\\n    --tasks wikitext,lambada_openai,piqa,hellaswag,winogrande,arc_easy,arc_challenge,boolq,sciq,copa,openbookqa \\\n    --batch_size 64 \\\n    --num_fewshot 0 \\\n    --device cuda \\\n    --show_config\n```\n\nWe've made `fla` compatible with hf-style evaluations, you can call [evals.harness](evals\u002Fharness.py) to finish the evaluations.\nRunning the command above will provide the task results reported in the GLA paper.\n\n3. Multi-GPU Evaluation with Hugging Face accelerate 🚀\n\nTo perform data-parallel evaluation (where each GPU loads a separate full copy of the model), we leverage the accelerate launcher as follows:\n```sh\n$ MODEL='fla-hub\u002Fgla-1.3B-100B'\n$ accelerate launch -m evals.harness --model hf  \\\n    --model_args pretrained=$MODEL,dtype=bfloat16,trust_remote_code=True  \\\n    --tasks wikitext,lambada_openai,piqa,hellaswag,winogrande,arc_easy,arc_challenge,boolq,sciq,copa,openbookqa \\\n    --batch_size 64  \\\n    --num_fewshot 0  \\\n    --device cuda  \\\n    --show_config  \\\n    --trust_remote_code\n```\n\n4. 📏 RULER Benchmark suite\n\nThe RULER benchmarks are commonly used for evaluating model performance on long-context tasks.\nYou can evaluate `fla` models on RULER directly using `lm-evaluation-harness`. RULER is only available in a relatively recent version of `lm-evaluation-harness`, so make sure you have the latest version installed.\n\n```\ngit clone --depth 1 https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\ncd lm-evaluation-harness\npip install -e .\n```\n\n\nThen, install the necessary dependencies for RULER:\n```sh\npip install lm_eval[\"ruler\"]\n```\nand run evaluation by (e.g., 32k contexts):\n```sh\n$ accelerate launch -m evals.harness \\\n    --output_path $OUTPUT \\\n    --tasks niah_single_1,niah_single_2,niah_single_3,niah_multikey_1,niah_multikey_2,niah_multikey_3,niah_multiquery,niah_multivalue,ruler_vt,ruler_cwe,ruler_fwe,ruler_qa_hotpot,ruler_qa_squad \\\n    --model_args pretrained=$MODEL,dtype=bfloat16,max_length=32768,trust_remote_code=True \\\n    --metadata='{\"max_seq_lengths\":[4096,8192,16384,32768]}' \\\n    --batch_size 2 \\\n    --show_config  \\\n    --trust_remote_code\n```\n\nIf a GPU can't load a full copy of the model, please refer to [this link](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness?tab=readme-ov-file#multi-gpu-evaluation-with-hugging-face-accelerate) for FSDP settings.\n\n> [!TIP]\n> If you are using `lm-evaluation-harness` as an external library and can't find (almost) any tasks available, before calling `lm_eval.evaluate()` or `lm_eval.simple_evaluate()`, simply run the following to load the library's stock tasks:\n> ```py\n> >>> from lm_eval.tasks import TaskManager; TaskManager().initialize_tasks()\n> ```\n\n## Benchmarks\n\nWe compare our Triton-based implementations (`chunk_retention`, `chunk_gla`, `chunk_gdn`) with CUDA-based FlashAttention2 across various shape configurations.\nThese tests were conducted on a single NVIDIA GB200 GPU (CUDA 12.9, PyTorch 2.9.0).\n\n```sh\n# you might have to first install `fla` via `pip install -e .` to enable its import\n$ python -m benchmarks.ops.run --op chunk_retention chunk_gla chunk_gdn flash_attn\n=================================================================================\n  Machine: NVIDIA GB200 | CUDA 12.9 | PyTorch 2.9.0+cu129.msh\n=================================================================================\n  fwd        B      T    H    D  op                            main[0a484709](ms)\n          -----------------------------------------------------------------------\n             1   8192   96  128  chunk_retention                            0.787\n                                 chunk_gla                                  1.765\n                                 chunk_gdn                                  1.265\n                                 flash_attn                                 3.753\n          -----------------------------------------------------------------------\n             2  16384   16  128  chunk_retention                            0.792\n                                 chunk_gla                                  1.445\n                                 chunk_gdn                                  1.029\n                                 flash_attn                                 5.035\n          -----------------------------------------------------------------------\n             4   2048   16  128  chunk_retention                            0.559\n                                 chunk_gla                                  0.514\n                                 chunk_gdn                                  0.753\n                                 flash_attn                                 0.346\n          -----------------------------------------------------------------------\n             4   4096   64  128  chunk_retention                            0.997\n                                 chunk_gla                                  2.251\n                                 chunk_gdn                                  1.581\n                                 flash_attn                                 2.560\n          -----------------------------------------------------------------------\n             8   1024    8   64  chunk_retention                            0.425\n                                 chunk_gla                                  0.358\n                                 chunk_gdn                                  0.631\n                                 flash_attn                                 0.157\n          -----------------------------------------------------------------------\n             8   2048   32  256  chunk_retention                            1.174\n                                 chunk_gla                                  2.897\n                                 chunk_gdn                                  1.831\n                                 flash_attn                                 1.408\n=================================================================================\n  fwdbwd     B      T    H    D  op                            main[0a484709](ms)\n          -----------------------------------------------------------------------\n             1   8192   96  128  chunk_retention                            2.618\n                                 chunk_gla                                  7.670\n                                 chunk_gdn                                  4.738\n                                 flash_attn                                15.371\n          -----------------------------------------------------------------------\n             2  16384   16  128  chunk_retention                            2.122\n                                 chunk_gla                                  5.984\n                                 chunk_gdn                                  3.616\n                                 flash_attn                                19.960\n          -----------------------------------------------------------------------\n             4   2048   16  128  chunk_retention                            1.047\n                                 chunk_gla                                  1.434\n                                 chunk_gdn                                  2.085\n                                 flash_attn                                 0.902\n          -----------------------------------------------------------------------\n             4   4096   64  128  chunk_retention                            3.459\n                                 chunk_gla                                 10.216\n                                 chunk_gdn                                  5.964\n                                 flash_attn                                10.815\n          -----------------------------------------------------------------------\n             8   1024    8   64  chunk_retention                            0.898\n                                 chunk_gla                                  1.707\n                                 chunk_gdn                                  1.974\n                                 flash_attn                                 0.477\n          -----------------------------------------------------------------------\n             8   2048   32  256  chunk_retention                           51.103\n                                 chunk_gla                                 13.797\n                                 chunk_gdn                                  8.644\n                                 flash_attn                                 6.748\n=================================================================================\n```\n\n\n## Citation\nIf you find this repository helpful, please cite our work:\n```bib\n@software{yang2024fla,\n  title  = {FLA: A Triton-Based Library for Hardware-Efficient Implementations of Linear Attention Mechanism},\n  author = {Yang, Songlin and Zhang, Yu},\n  url    = {https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention},\n  month  = jan,\n  year   = {2024}\n}\n\n@misc{chen2026attnres,\n  title         = {Attention Residuals},\n  author        = {Chen, Guangyu  and Zhang, Yu  and Su, Jianlin  and Xu, Weixin  and Pan, Siyuan  and Wang, Yaoyu  and Wang, Yucheng  and Chen, Guanduo  and Yin, Bohong  and Chen, Yutian  and Yan, Junjie  and Wei, Ming  and Zhang, Y.  and Meng, Fanqing  and Hong, Chao  and Xie, Xiaotong  and Liu, Shaowei  and Lu, Enzhe  and Tai, Yunpeng  and Chen, Yanru  and Men, Xin  and Guo, Haiqing  and Charles, Y.  and Lu, Haoyu  and Sui, Lin  and Zhu, Jinguo  and Zhou, Zaida  and He, Weiran  and Huang, Weixiao  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Du, Yulun  and Wu, Yuxin  and Yang, Zhilin  and Zhou, Xinyu},\n  year          = {2026},\n  eprint        = {2603.15031},\n  archiveprefix = {arXiv},\n  primaryclass  = {cs.CL}\n}\n\n@misc{zhang2025kda,\n  title         = {Kimi Linear: An Expressive, Efficient Attention Architecture},\n  author        = {Zhang, Yu  and Lin, Zongyu  and Yao, Xingcheng  and Hu, Jiaxi  and Meng, Fanqing  and Liu, Chengyin  and Men, Xin  and Yang, Songlin  and Li, Zhiyuan  and Li, Wentao  and Lu, Enzhe  and Liu, Weizhou  and Chen, Yanru  and Xu, Weixin  and Yu, Longhui  and Wang, Yejie  and Fan, Yu  and Zhong, Longguang  and Yuan, Enming  and Zhang, Dehao  and Zhang, Yizhi  and T. Liu, Y.  and Wang, Haiming  and Fang, Shengjun  and He, Weiran  and Liu, Shaowei  and Li, Yiwei  and Su, Jianlin  and Qiu, Jiezhong  and Pang, Bo  and Yan, Junjie  and Jiang, Zhejun  and Huang, Weixiao  and Yin, Bohong  and You, Jiacheng  and Wei, Chu  and Wang, Zhengtao  and Hong, Chao  and Chen, Yutian  and Chen, Guanduo  and Wang, Yucheng  and Zheng, Huabin  and Wang, Feng  and Liu, Yibo  and Dong, Mengnan  and Zhang, Zheng  and Pan, Siyuan  and Wu, Wenhao  and Wu, Yuhao  and Guan, Longyu  and Tao, Jiawen  and Fu, Guohong  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Wu, Yuxin  and Zhou, Xinyu  and Yang, Zhilin  and Du, Yulun},\n  year          = {2025},\n  eprint        = {2510.26692},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.CL}\n}\n\n@inproceedings{yang2025path,\n  title     = {PaTH Attention: Position Encoding via Accumulating Householder Transformations},\n  author    = {Yang, Songlin  and Shen, Yikang and Wen, Kaiyue and Tan, Shawn  and Mishra, Mayank  and Ren, Liliang  and Panda, Rameswar  and Kim, Yoon},\n  booktitle = {Proceedings of NeurIPS},\n  year      = {2025}\n}\n\n@inproceedings{yang2024gdn,\n  title     = {Gated Delta Networks: Improving Mamba2 with Delta Rule},\n  author    = {Yang, Songlin  and Kautz, Jan  and Hatamizadeh, Ali},\n  booktitle = {Proceedings of ICLR},\n  year      = {2025}\n}\n\n@inproceedings{yang2024deltanet,\n  title     = {Parallelizing Linear Transformers with the Delta Rule over Sequence Length},\n  author    = {Yang, Songlin and Wang, Bailin and Zhang, Yu and Shen, Yikang and Kim, Yoon},\n  booktitle = {Proceedings of NeurIPS},\n  year      = {2024}\n}\n\n@inproceedings{zhang2024gsa,\n  title     = {Gated Slot Attention for Efficient Linear-Time Sequence Modeling},\n  author    = {Zhang, Yu and Yang, Songlin and Zhu, Ruijie and Zhang, Yue and Cui, Leyang and Wang, Yiqiao and Wang, Bolun and Shi, Freda and Wang, Bailin and Bi, Wei and Zhou, Peng and Fu, Guohong},\n  booktitle = {Proceedings of NeurIPS},\n  year      = {2024}\n}\n\n@inproceedings{qin2024hgrn2,\n  title     = {HGRN2: Gated Linear RNNs with State Expansion},\n  author    = {Qin, Zhen and Yang, Songlin and Sun, Weixuan and Shen, Xuyang and Li, Dong and Sun, Weigao and Zhong, Yiran},\n  booktitle = {Proceedings of COLM},\n  year      = {2024}\n}\n\n@inproceedings{yang2024gla,\n  title     = {Gated Linear Attention Transformers with Hardware-Efficient Training},\n  author    = {Yang, Songlin and Wang, Bailin and Shen, Yikang and Panda, Rameswar and Kim, Yoon},\n  booktitle = {Proceedings of ICML},\n  year      = {2024}\n}\n```\n\n## Star History\n\n[![Stargazers repo roster for @fla-org\u002Fflash-linear-attention](https:\u002F\u002Fbytecrank.com\u002Fnastyox\u002Freporoster\u002Fphp\u002FstargazersSVG.php?user=fla-org&repo=flash-linear-attention)](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention\u002Fstargazers)\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=fla-org\u002Fflash-linear-attention&type=Date)](https:\u002F\u002Fstar-history.com\u002F#fla-org\u002Fflash-linear-attention&Date)\n\n## Acknowledgements\n\nWe extend our gratitude to [Bitdeer](https:\u002F\u002Fwww.bitdeer.com\u002F) and [Moonshot AI](https:\u002F\u002Fwww.moonshot.ai\u002F) for their support in maintaining and powering our project infrastructure.\n","Flash Linear Attention 是一个高效实现新兴模型架构的项目，特别专注于线性注意力、稀疏注意力、状态空间模型及混合大语言模型等现代序列模型的关键组件。该项目使用Python编写，提供了硬件高效的构建模块和训练就绪层，并且在NVIDIA、AMD和Intel硬件上进行了验证，确保了跨平台兼容性。它非常适合需要高效处理长序列数据的应用场景，如自然语言处理任务中的文本生成与理解、机器学习系统优化等。此外，项目持续更新并引入最新的研究成果和技术，支持多种先进的注意力机制和模型架构，为开发者提供了一个强大的工具库来探索和应用这些前沿技术。",2,"2026-06-11 03:36:28","high_star"]