[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-78163":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":30,"discoverSource":31},78163,"GatedDeltaNet-2","NVlabs\u002FGatedDeltaNet-2","NVlabs","Official PyTorch Implementation of Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention","",null,"Python",207,16,2,1,0,5,9,130,15,68.19,"Other",false,"main",true,[],"2026-06-12 04:01:23","# 🔺 Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention\n\nOfficial PyTorch implementation of **Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention**.\n\n[![Star on GitHub](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FGatedDeltaNet-2.svg?style=social)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGatedDeltaNet-2\u002Fstargazers)\n\n[Ali Hatamizadeh](https:\u002F\u002Fahatamiz.github.io),\n[Yejin Choi](https:\u002F\u002Fhomes.cs.washington.edu\u002F~yejin\u002F), and\n[Jan Kautz](https:\u002F\u002Fjankautz.com\u002F).\n\n\n## 🌟 Why Gated DeltaNet-2?\n\nLinear attention compresses an unbounded KV cache into a fixed-size recurrent state. The hard part is not just *what to forget*, but *how to edit* this compressed memory without scrambling existing associations. Prior delta-rule models (Gated DeltaNet, Kimi Delta Attention) tie *erasing* and *writing* to a single scalar gate — even though they act on different axes of the state.\n\n**Gated DeltaNet-2** decouples these two roles:\n\n- ✂️ **Channel-wise Erase Gate `b_t`** — selects which *key-side* coordinates of the decayed state are read and removed\n- ✍️ **Channel-wise Write Gate `w_t`** — selects which *value-side* coordinates of the new content are committed\n- 🌀 **Channel-wise Decay** — inherited from KDA for fine-grained global forgetting\n- 🔁 **Strict Generalization** — recovers KDA when both gates collapse to the same scalar, and Gated DeltaNet when the decay also collapses\n- ⚡ **Hardware-efficient Training** — fast-weight WY chunkwise algorithm with gate-aware backward, fused in Triton\n\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fgdn2_figure.png\" width=\"100%\" \u002F>\n\u003C\u002Fp>\n\n\n## 📐 The Gated Delta Rule-2\n\nGiven an erase gate `b_t ∈ [0,1]^{d_k}`, a write gate `w_t ∈ [0,1]^{d_v}`, and channel-wise decay `D_t = Diag(α_t)`, the recurrent state evolves as:\n\n```\nS_t = (I − k_t (b_t ⊙ k_t)ᵀ) D_t S_{t−1}  +  k_t (w_t ⊙ v_t)ᵀ\n```\n\nCompared with KDA, the right factor of the rank-one erase becomes channel-selective on the *key* axis, and the write term becomes channel-selective on the *value* axis. The two decisions no longer share a single scalar.\n\n\n## 📊 Results\n\nWe train all models at **1.3B parameters on 100B tokens of FineWeb-Edu**, matched in recurrent state size, and compare against Mamba-2, Gated DeltaNet, KDA, and Mamba-3 (SISO and MIMO).\n\n### Language Modeling and Commonsense Reasoning\n\nGated DeltaNet-2 achieves the best average across both **recurrent-only** and **hybrid** settings:\n\n| Model | Wiki ppl ↓ | LMB ppl ↓ | LMB acc ↑ | Avg. acc ↑ |\n|---|---|---|---|---|\n| **Recurrent** | | | | |\n| Mamba-2 | 16.79 | 12.38 | 45.24 | 51.82 |\n| Gated DeltaNet | 16.40 | 11.89 | 49.62 | 52.07 |\n| KDA | 16.81 | 11.68 | 48.13 | 52.28 |\n| Mamba-3 (MIMO) | 16.45 | 11.66 | 47.82 | 52.39 |\n| **Gated DeltaNet-2** | **15.90** | **11.41** | 48.09 | **53.11** |\n| **Hybrid (+ SWA)** | | | | |\n| Transformer | 19.22 | 13.72 | 48.32 | 50.86 |\n| Gated DeltaNet | 16.00 | 10.82 | 48.71 | 52.25 |\n| KDA | 16.01 | 10.66 | 49.21 | 52.68 |\n| Mamba-3 (MIMO) | 15.81 | 10.92 | 49.82 | 52.72 |\n| **Gated DeltaNet-2** | **15.62** | **10.43** | **50.90** | **53.97** |\n\n### Long-context Retrieval (RULER)\n\nGated DeltaNet-2 is strongest where memory editing matters most — particularly the interference-heavy multi-key needle-in-a-haystack settings:\n\n| Model | S-NIAH-2 @4K | S-NIAH-3 @2K | MK-NIAH-1 @4K |\n|---|---|---|---|\n| **Recurrent** | | | |\n| Gated DeltaNet | 87.2 | 54.2 | 27.8 |\n| KDA | 89.0 | 63.2 | 28.0 |\n| Mamba-3 (MIMO) | 64.2 | 72.4 | 18.0 |\n| **Gated DeltaNet-2** | **93.0** | **89.8** | **37.8** |\n| **Hybrid** | | | |\n| Gated DeltaNet | 57.3 | 91.2 | 44.8 |\n| KDA | 56.0 | 93.4 | 40.4 |\n| Mamba-3 (MIMO) | 53.0 | 98.4 | 46.6 |\n| **Gated DeltaNet-2** | **57.9** | **99.0** | **48.0** |\n\n### Real-world Retrieval\n\nAcross SWDE, SQuAD, FDA, TriviaQA, NQ, and DROP, Gated DeltaNet-2 leads the recurrent and hybrid frontier:\n\n| Setting | Mamba-2 | GDN | KDA | Mamba-3 (MIMO) | **GDN-2** |\n|---|---|---|---|---|---|\n| Recurrent avg. | 26.84 | 28.09 | 28.67 | 28.35 | **29.88** |\n| Hybrid avg. | 39.74 | 39.11 | 40.14 | 40.11 | **42.28** |\n\n### Throughput\n\nGated DeltaNet-2 retains near-flat scaling with sequence length on a single H100 (training, hybrid 1.3B), with only a small constant overhead over KDA for the added channel-wise gates.\n\n\n## 🔧 What's New in the Update Rule\n\n| Method | Decay | Erase | Write |\n|---|---|---|---|\n| Mamba-2 | scalar | — | scalar |\n| Gated DeltaNet | scalar | scalar `β_t` | scalar `β_t` |\n| KDA | **channel-wise** | scalar `β_t` | scalar `β_t` |\n| **Gated DeltaNet-2** | **channel-wise** | **channel-wise `b_t`** | **channel-wise `w_t`** |\n\nAblations confirm both gates contribute, with the **erase gate `b_t` accounting for most of the gain** — consistent with its role in selectively protecting or revising key-side associations in the recurrent state.\n\n\n## 📢 Latest Updates\n\n- `05\u002F21\u002F2026`: 🔥 **Code Release**: Train your own Gated DeltaNet-2 on FineWeb-Edu\n- Watch this space for more exciting updates!\n\n\n## 🚀 Getting Started\n\n### Training Your Model\n\nLaunch your training with our streamlined command:\n\n```bash\npython ..\u002Fpretrain.py \\\n--train_data_dir ${TRAIN_DATA} \\\n--val_data_dir ${VALIDATION_DATA} \\\n--output_root ${SAVE_DIR} \\\n--exp_name ${NAME} \\\n--model_name ${MODEL} \\\n--train_config ${CONFIG} \\\n--eval_iters ${EVAL_ITERS} \\\n--learning_rate ${LR} \\\n--micro_batch_size ${MICRO_BATCH_SIZE}\n```\n\n💡 **Pro Tip**: Add `--interactive_job --debug` for interactive debugging sessions!\n\n### Default Recipe\n\nWe train 1.3B-parameter models on 100B tokens of FineWeb-Edu with:\n\n- AdamW, peak LR `4e-4`, weight decay `0.1`, gradient clip `1.0`\n- Cosine schedule with 1B-token warmup\n- Global batch size `0.5M` tokens, sequence length `4K`\n- Hybrid models use a `2K` sliding-window attention size\n- 16 heads, `d_k = d_v = 128`, matched recurrent state size against Mamba-2\u002F3 baselines\n\n\n## 📜 License\n\nCopyright © 2026, NVIDIA Corporation. All rights reserved.\n\nLicensed under the NVIDIA Source Code License-NC. See [LICENSE](LICENSE) for details.\n\n\n## 🙏 Acknowledgements\n\nBuilt on the shoulders of giants:\n- [Gated DeltaNet](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGatedDeltaNet)\n- [Kimi Delta Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.26692)\n- [Flash Linear Attention](https:\u002F\u002Fgithub.com\u002Ffla-org\u002Fflash-linear-attention)\n- [Samba](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FSamba)\n- [LiTGPT](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Flitgpt)\n\n\n## 📖 Citation\n\nIf you find this work useful, please consider citing:\n\n```bibtex\n@article{hatamizadeh2026gdn2,\n  title   = {Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention},\n  author  = {Hatamizadeh, Ali and Choi, Yejin and Kautz, Jan},\n  journal = {arXiv preprint},\n  year    = {2026}\n}\n```\n\n\n## ⭐ Support Us\n\nIf you find this work useful, please consider:\n- Starring the repository\n- Citing our paper\n- Contributing to the codebase\n\nJoin us in pushing the boundaries of linear attention! 🚀\n\n\n## Star History\n\n[![Stargazers repo roster for @NVlabs\u002FGatedDeltaNet-2](https:\u002F\u002Fbytecrank.com\u002Fnastyox\u002Freporoster\u002Fphp\u002FstargazersSVG.php?user=NVlabs&repo=GatedDeltaNet-2)](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGatedDeltaNet-2\u002Fstargazers)\n\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=NVlabs\u002FGatedDeltaNet-2&type=Date)](https:\u002F\u002Fstar-history.com\u002F#NVlabs\u002FGatedDeltaNet-2&Date)\n","Gated DeltaNet-2 是一个基于 PyTorch 的官方实现，旨在通过解耦线性注意力机制中的擦除和写入操作来改进模型性能。其核心功能包括通道级的擦除门和写入门，以及从KDA继承的通道级衰减机制，从而实现更细粒度的记忆管理。此外，该项目还支持硬件高效的训练算法，利用Triton进行快速权重计算。此项目适用于需要高效处理长序列数据的语言建模和常识推理任务，在这些场景下，Gated DeltaNet-2 展现了优于其他模型（如Mamba-2、Gated DeltaNet、KDA等）的表现，特别是在大规模文本数据集上的应用效果显著。","2026-06-11 03:56:30","CREATED_QUERY"]