[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75105":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":8,"languages":8,"totalLinesOfCode":8,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":12,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},75105,"Attention-Residuals","MoonshotAI\u002FAttention-Residuals","MoonshotAI",null,3302,187,28,12,0,4,9,45,28.82,false,"master",[],"2026-06-12 02:03:32","\u003Cdiv align=\"center\">\n\u003Ch2 align=\"center\">\n  \u003Cb>\n    \u003Cspan>━━━━━━━━━━━━━━━━━━━━━━━━━━━\u003C\u002Fspan>\n    \u003Cbr\u002F>\n    \u003Cimg src=\"assets\u002Flogo.png\" height=\"16\" width=\"16\" style=\"display: inline-block; vertical-align: middle; margin: 2px;\"> Attention Residuals\n    \u003Cbr\u002F>\n    \u003Cspan>━━━━━━━━━━━━━━━━━━━━━━━━━━━\u003C\u002Fspan>\n    \u003Cbr\u002F>\n  \u003C\u002Fb>\n\u003C\u002Fh2>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"Attention_Residuals.pdf\">Paper\u003C\u002Fa> &nbsp;|&nbsp;\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15031\">arXiv\u003C\u002Fa> &nbsp;|&nbsp;\n  \u003Ca href=\"#overview\">Overview\u003C\u002Fa> &nbsp;|&nbsp;\n  \u003Ca href=\"#results\">Results\u003C\u002Fa> &nbsp;|&nbsp;\n  \u003Ca href=\"#citation\">Citation\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Foverview.png\" width=\"800\" \u002F>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\u003Cem>\n  (a) Standard residuals with uniform additive accumulation.\n  (b) Full AttnRes: each layer attends over all previous outputs.\n  (c) Block AttnRes: layers are grouped into blocks, reducing memory from O(Ld) to O(Nd).\n\u003C\u002Fem>\u003C\u002Fp>\n\n---\n\nThis is the official repository for **Attention Residuals (AttnRes)**, a drop-in replacement for standard residual connections in Transformers that enables each layer to *selectively* aggregate earlier representations via learned, input-dependent attention over depth.\n\n## Overview\n\nStandard residual connections accumulate all layer outputs with fixed unit weights. As depth grows, this uniform aggregation dilutes each layer's contribution and causes hidden-state magnitudes to grow unboundedly — a well-known problem with PreNorm.\n\n**AttnRes** replaces this fixed accumulation with softmax attention over preceding layer outputs:\n\n$$\\mathbf{h}_l = \\sum_{i=0}^{l-1} \\alpha_{i \\to l} \\cdot \\mathbf{v}_i$$\n\nwhere the weights $\\alpha_{i \\to l}$ are computed via a single learned pseudo-query $\\mathbf{w}_l \\in \\mathbb{R}^d$ per layer. This gives every layer selective, content-aware access to all earlier representations.\n\n### Block AttnRes\n\nFull AttnRes is straightforward but requires O(Ld) memory at scale. **Block AttnRes** partitions layers into N blocks, accumulates within each block via standard residuals, and applies attention only over block-level representations. With ~8 blocks, it recovers most of Full AttnRes's gains while serving as a practical drop-in replacement with marginal overhead.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>PyTorch-style pseudocode\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\ndef block_attn_res(blocks: list[Tensor], partial_block: Tensor, proj: Linear, norm: RMSNorm) -> Tensor:\n    \"\"\"\n    Inter-block attention: attend over block reps + partial sum.\n    blocks:\n        N tensors of shape [B, T, D]: completed block representations for each previous block\n    partial_block:\n        [B, T, D]:    intra-block partial sum (b_n^i)\n    \"\"\"\n    V = torch.stack(blocks + [partial_block])  # [N+1, B, T, D]\n    K = norm(V)\n    logits = torch.einsum('d, n b t d -> n b t', proj.weight.squeeze(), K)\n    h = torch.einsum('n b t, n b t d -> b t d', logits.softmax(0), V)\n    return h\n\ndef forward(self, blocks: list[Tensor], hidden_states: Tensor) -> tuple[list[Tensor], Tensor]:\n    partial_block = hidden_states\n    # apply block attnres before attn\n    # blocks already include token embedding\n    h = block_attn_res(blocks, partial_block, self.attn_res_proj, self.attn_res_norm)\n\n    # if reaches block boundary, start new block\n    # block_size counts ATTN + MLP; each transformer layer has 2\n    if self.layer_number % (self.block_size \u002F\u002F 2) == 0:\n        blocks.append(partial_block)\n        partial_block = None\n\n    # self-attention layer\n    attn_out = self.attn(self.attn_norm(h))\n    partial_block = partial_block + attn_out if partial_block is not None else attn_out\n\n    # apply block attnres before MLP\n    h = block_attn_res(blocks, partial_block, self.mlp_res_proj, self.mlp_res_norm)\n\n    # MLP layer\n    mlp_out = self.mlp(self.mlp_norm(h))\n    partial_block = partial_block + mlp_out\n\n    return blocks, partial_block\n```\n\n\u003C\u002Fdetails>\n\n## Results\n\n### Scaling Laws\n\nAttnRes consistently outperforms the baseline across all compute budgets. Block AttnRes matches the loss of a baseline trained with **1.25x more compute**.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fscaling_law.png\" width=\"420\" \u002F>\n\u003C\u002Fp>\n\n### Downstream Performance (Kimi Linear 48B \u002F 3B activated, 1.4T tokens)\n\n| Category | Benchmark | Baseline | AttnRes |\n|:---|:---|:---:|:---:|\n| General | MMLU | 73.5 | **74.6** |\n| | GPQA-Diamond | 36.9 | **44.4** |\n| | BBH | 76.3 | **78.0** |\n| | TriviaQA | 69.9 | **71.8** |\n| Math & Code | Math | 53.5 | **57.1** |\n| | HumanEval | 59.1 | **62.2** |\n| | MBPP | 72.0 | **73.9** |\n| Chinese | CMMLU | 82.0 | **82.9** |\n| | C-Eval | 79.6 | **82.5** |\n\nAttnRes improves across the board, with the largest gains on multi-step reasoning (+7.5 on GPQA-Diamond) and code generation (+3.1 on HumanEval).\n\n### Training Dynamics\n\nAttnRes mitigates PreNorm dilution: output magnitudes remain bounded across depth and gradient norms distribute more uniformly across layers.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ftraining_dynamics.png\" width=\"800\" \u002F>\n\u003C\u002Fp>\n\n## Citation\n\nIf you found our work useful, please cite\n\n```bib\n@misc{chen2026attnres,\n  title         = {Attention Residuals},\n  author        = {Kimi Team  and Chen, Guangyu  and Zhang, Yu  and Su, Jianlin  and Xu, Weixin  and Pan, Siyuan  and Wang, Yaoyu  and Wang, Yucheng  and Chen, Guanduo  and Yin, Bohong  and Chen, Yutian  and Yan, Junjie  and Wei, Ming  and Zhang, Y.  and Meng, Fanqing  and Hong, Chao  and Xie, Xiaotong  and Liu, Shaowei  and Lu, Enzhe  and Tai, Yunpeng  and Chen, Yanru  and Men, Xin  and Guo, Haiqing  and Charles, Y.  and Lu, Haoyu  and Sui, Lin  and Zhu, Jinguo  and Zhou, Zaida  and He, Weiran  and Huang, Weixiao  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Du, Yulun  and Wu, Yuxin  and Yang, Zhilin  and Zhou, Xinyu},\n  year          = {2026},\n  archiveprefix = {arXiv},\n  eprint        = {2603.15031},\n  primaryclass  = {cs.CL}\n}\n```\n","Attention Residuals (AttnRes) 项目旨在为Transformer模型中的标准残差连接提供一种替代方案，通过学习到的、依赖于输入的注意力机制来选择性地聚合先前层的表示。其核心功能包括使用softmax注意力代替固定权重的累积方式，从而允许每一层根据内容有选择地访问所有前序层的信息。此外，Block AttnRes版本通过将层划分为多个块并在块间应用注意力机制，有效地减少了内存需求至O(Nd)，同时保持了性能优势。该项目适用于需要提高深层Transformer网络训练稳定性和效率的各种自然语言处理和计算机视觉任务场景中。",2,"2026-06-11 03:52:21","high_star"]