[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80983":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},80983,"delta-attention-residuals-code","wdlctc\u002Fdelta-attention-residuals-code","wdlctc","Delta Attention Residuals - supplementary code and pretrained models",null,"Python",35,1,30,0,4,5,12,0.9,false,"main",[],"2026-06-12 02:04:09","# Delta Attention Residuals\n\nOfficial code for **\"Delta Attention Residuals: Per-Sublayer Sources for Cross-Layer Information Flow\"**.\n\nCheng Luo, Zefan Cai, Junjie Hu\n\n[[Paper]](https:\u002F\u002Fgithub.com\u002Fwdlctc\u002Fdelta-attention-residuals-arxiv)\n\n## Overview\n\nDelta Attention Residuals replace cumulative hidden states with per-sublayer deltas as routing sources for cross-layer connectivity. The key insight: routing over *what changed* rather than *what accumulated* yields 3x sharper routing and consistently better perplexity across all tested scales (220M--8B).\n\nTwo variants:\n- **Delta AttnRes**: per-sublayer deltas (2L sources), best quality\n- **Delta Block**: block-level deltas (~L\u002FB sources), practical default with minimal overhead\n\n## Repository Structure\n\n```\nAttention-Residuals\u002F\n  modeling_qwen3_attnres.py   # Core model: Qwen3 + Delta Attention Residuals\ntrain_scratch.py              # From-scratch training (DDP, up to ~1B)\ntrain_scratch_fsdp.py         # From-scratch training (FSDP, 7B+)\ntrain_finetune.py             # Fine-tuning pretrained models\neval_downstream.py            # Downstream evaluation (lm-eval-harness)\nrun_8b_delta_block.sh         # Launch script for 8B training\n```\n\n## Quick Start\n\n### Requirements\n\n```bash\npip install torch transformers datasets wandb\n```\n\n### Training from scratch (220M--1B, DDP)\n\n```bash\n# Baseline\ntorchrun --standalone --nproc_per_node=8 train_scratch.py --mode baseline\n\n# Delta Block (recommended)\ntorchrun --standalone --nproc_per_node=8 train_scratch.py --mode delta_block --compile_model\n\n# Delta AttnRes (per-sublayer)\ntorchrun --standalone --nproc_per_node=8 train_scratch.py --mode delta --compile_model\n```\n\n### Training from scratch (7B+, FSDP)\n\n```bash\ntorchrun --standalone --nproc_per_node=8 train_scratch_fsdp.py \\\n    --mode delta_block \\\n    --hidden_size 4096 --num_layers 36 --num_heads 32 --num_kv_heads 8 \\\n    --intermediate_size 12288 \\\n    --batch_size 4 --grad_accum 2 \\\n    --compile_model --shard_grad_op \\\n    --steps 10000\n```\n\n### Fine-tuning pretrained models\n\n```bash\ntorchrun --standalone --nproc_per_node=4 train_finetune.py \\\n    --base_model Qwen\u002FQwen3-0.6B \\\n    --mode delta_block \\\n    --lr 5e-5 --lr_attnres 5e-3 \\\n    --steps 20000\n```\n\n## Results & W&B Runs\n\nThe exact W&B run for every paper experiment is listed in [`WANDB_RUNS.md`](.\u002FWANDB_RUNS.md) (training\u002Fvalidation curves, configs, and system metrics). Project: \u003Chttps:\u002F\u002Fwandb.ai\u002Fwdlctc_abr\u002Fattention-residual-h100>.\n\n### From-Scratch Training (10K steps, FineWeb-Edu)\n\n| Scale | Baseline | AttnRes | Delta Block | Delta AttnRes |\n|-------|----------|---------|-------------|---------------|\n| 220M  | 38.71    | 37.39   | **37.08**   | **36.83**     |\n| 533M  | 32.00    | 31.75   | **31.16**   | **31.05**     |\n| 1044M | 29.70    | 31.76   | **29.19**   | **29.13**     |\n\n### Fine-tuning Qwen3-0.6B (downstream avg accuracy)\n\n| Baseline FT | AttnRes | Delta Block |\n|-------------|---------|-------------|\n| 55.0%       | 54.1%   | **55.6%**   |\n\n## Citation\n\n```bibtex\n@article{luo2026delta,\n  title={Delta Attention Residuals},\n  author={Luo, Cheng and Cai, Zefan and Hu, Junjie},\n  year={2026}\n}\n```\n\n## License\n\nMIT\n","该项目提供了Delta Attention Residuals的官方代码及预训练模型，这是一种改进的跨层信息流机制，通过使用每子层的增量而非累积隐藏状态作为路由源，以实现更精准的信息传递。其核心功能包括两种变体：Delta AttnRes（每子层增量，质量最佳）和Delta Block（块级增量，默认选项，开销最小）。技术上基于Python实现，并支持从头训练、微调以及下游任务评估等操作，利用PyTorch进行分布式数据并行或完全分片数据并行训练。适用于需要提升大型语言模型性能的研究者与开发者，在不同规模的数据集上均表现出色，特别是在220M至8B参数量范围内。",2,"2026-06-11 04:03:04","CREATED_QUERY"]