[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80901":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},80901,"LT2","chili-lab\u002FLT2","chili-lab","Official Codebase: LT2: Linear-Time Looped Transformers.","https:\u002F\u002Fcharlesdddd.github.io\u002Flt2\u002F",null,"Python",38,1,34,0,4,0.9,"BSD 3-Clause \"New\" or \"Revised\" License",false,"main",true,[23,24],"architecture","language-model","2026-06-12 02:04:08","\u003Cdiv align=\"center\">\n\n# LT2:Linear-Time Looped Transformers\n\nA family of **looped Transformers** with **subquadratic token mixers** —\nlinear, sparse, and hybrid attention, unified within a single architecture.\n\n\u003C!-- \u003Cp align=\"center\">\n \u003Cimg src=\"lingua_overview.svg\" width=\"78%\"\u002F>\n\u003C\u002Fp> -->\n\n\u003C\u002Fdiv>\n\n> Official codebase accompanying the paper **\"LT2: Linear-Time Looped Transformers.\"**\n> Built upon the [Meta Lingua](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flingua) pre-training framework.\n\n---\n\n## 1. Architecture\n\nLT2 replaces the multi-head attention sub-layer of a standard Looped Transformer with a\n**subquadratic token mixer**, so that each shared block becomes\n\n$$F_\\ell(h) \\;=\\; h' + \\mathrm{FFN}_\\ell(h'), \\qquad h' \\;=\\; h + \\mathrm{Mixer}_\\ell(h)$$\n\nwhere `Mixer` may be any linear-attention, sparse-attention, or hybrid primitive. Looping\nreuses the same parameters `T` times in succession, so a block of `n_layers` attains an\neffective depth of `n_layers × T`.\n\n- **LT2-linear** — DPLR linear-attention mixers (GDN, KDA, RWKV7). Loop iterations turn rank-1 state updates into rank-`T` updates.\n- **LT2-sparse** — sliding-window, NSA, or DSA attention. A per-loop window of size `w` becomes an effective receptive field of `T·w`.\n- **LT2-hybrid (Full+GDN)** — interleaves a small fraction of full attention with GDN; surpasses the standard looped transformer at ~2.7× decode speedup, establishing a new Pareto frontier.\n\nSee the paper for the full theoretical analysis and experimental results.\n\n---\n\n## 2. Repository Layout\n\n```\nLT2\u002F\n ├─ lingua\u002F             Core training library (forked from Meta Lingua)\n │   ├─ transformer.py     Reference Transformer block\n │   ├─ data.py            Pre-training dataloader\n │   ├─ distributed.py     FSDP \u002F TP \u002F compile wrappers\n │   ├─ checkpoint.py      Distributed checkpointing\n │   ├─ optim.py           Optimizer + LR scheduler\n │   └─ stool.py           SLURM launcher\n ├─ apps\u002FLT2\u002F           LT2 application code\n │   ├─ transformer.py     LT2 model (linear \u002F sparse \u002F hybrid mixers)\n │   ├─ train.py           Training entry point\n │   ├─ eval.py            LM-Evaluation-Harness wrapper\n │   ├─ generate.py        Inference \u002F generation\n │   ├─ benchmark_prefill.py\n │   ├─ configs\u002F\n │   │   ├─ 600M\u002F          0.6B-parameter pre-training recipes\n │   │   └─ 1B\u002F            1.3B-parameter pre-training recipes\n │   ├─ kernel\u002F            Custom Triton \u002F CUDA kernels\n │   ├─ scripts\u002F           Helper scripts\n │   └─ slurm\u002F             Example SLURM job files\n ├─ setup\u002F              Environment + data preparation\n ├─ tokenizer\u002F          Tokenizer files (downloaded)\n └─ requirements.txt\n```\n\n---\n\n## 3. Quick Start\n\nThe following commands launch a SLURM job that creates a Conda environment for the codebase.\nEnvironment creation takes around five minutes (excluding downloads).\n\n```bash\ngit clone \u003CTHIS_REPO_URL>\ncd LT2\n\nbash setup\u002Fcreate_env.sh\n# or, if you have access to a SLURM cluster\nsbatch setup\u002Fcreate_env.sh\n```\n\nOnce that is done, activate the environment:\n\n```bash\nconda activate lingua_\u003Cdate>\n```\n\nUse the provided script to download and prepare data from HuggingFace\n(`fineweb_edu`, `fineweb_edu_10bt`, or `dclm_baseline_1.0`). The command below downloads\n`fineweb_edu` and prepares it for training in `.\u002Fdata`, specifying the memory `terashuf`\n(the shuffling tool) is allowed to use. By default `nchunks=32`; if you train on fewer than\n32 GPUs, set `nchunks` to 1 or to the number of GPUs you have\n([details](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Flingua\u002Fissues\u002F55#issuecomment-2483643076)).\n\n```bash\npython setup\u002Fdownload_prepare_hf_data.py fineweb_edu \u003CMEMORY> \\\n    --data_dir .\u002Fdata --seed 42 --nchunks \u003CNCHUNKS>\n```\n\nDownload the tokenizer (Llama 3):\n\n```bash\npython setup\u002Fdownload_tokenizer.py llama3 \u003CSAVE_PATH> --api_key \u003CHUGGINGFACE_TOKEN>\n```\n\nNow launch a quick debug job to verify the setup. The provided configurations are\ntemplates — edit `dump_dir`, `data.root_dir`, `data.tokenizer.path`, etc., for your environment.\n\n```bash\n# stool = SLURM tool\npython -m lingua.stool script=apps.LT2.train \\\n    config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Fdebug.yaml \\\n    nodes=1 partition=\u003Cpartition>\n\n# Or launch locally with torchrun\ntorchrun --nproc-per-node 8 -m apps.LT2.train \\\n    config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Fdebug.yaml\n\n# Or on a single GPU\npython -m apps.LT2.train config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Fdebug.yaml\n```\n\nIf a `stool` job crashes, it may be relaunched directly:\n\n```bash\nsbatch path\u002Fto\u002Fdump_dir\u002Fsubmit.slurm\n```\n\n---\n\n## 4. LT2 Configuration\n\n### Model Fields\n\nThe LT2 model is configured through the `model:` section of the YAML config. Key fields:\n\n| Field | Description |\n|---|---|\n| `n_layers` | Number of *physical* (parameter-sharing) layers in the looped block. |\n| `loop_count` | Number of loop iterations `T`. Effective depth is `n_layers × loop_count`. |\n| `mixer` | Token-mixer family: `full`, `window`, `gdn`, `kda`, `mamba2`, `hgrn2`, `deltanet`, `retnet`, `nsa`, `dsa`. |\n| `attention_pattern` | Depth-level hybrid layout (e.g. `\"4:1\"` for 4 mixer + 1 full attention). |\n| `attn_impl` | `\"fmha\"`, `\"flex_attention\"`, or `\"sdpa\"`. Sliding-window requires `fmha` or `flex_attention`. |\n| `default_sliding_window` | Window size for sparse \u002F sliding-window mixers. |\n| `use_residual` | Learned zero-init per-iteration residual gate across loop iterations. |\n| `gated_attn` | SDPA output gate that suppresses the attention-sink sawtooth (§ 3.4 of the paper). |\n\nExample fragment:\n\n```yaml\nmodel:\n  dim: 2048\n  n_layers: 20\n  loop_count: 4              # T = 4, effective depth = 80\n  mixer: \"gdn\"\n  attention_pattern: \"4:1\"   # 4 GDN layers : 1 full-attention layer\n  attn_impl: \"fmha\"\n  default_sliding_window: 2048\n  use_residual: true\n  gated_attn: true\n```\n\n### Recipes Included\n\n`apps\u002FLT2\u002Fconfigs\u002F` provides reference pre-training recipes for the experiments in the paper:\n\n| Scale | Config | Description |\n|:---:|---|---|\n| 0.6B | `600M\u002Fdebug.yaml` | Fast smoke test (single GPU). |\n| 0.6B | `600M\u002Flooped_pure_full_600M.yaml` | Looped Transformer baseline (full attention). |\n| 0.6B | `600M\u002Flooped_pure_{gdn,kda,mamba2,deltanet,retnet,hgrn2}_600M.yaml` | LT2-linear single-mixer ablations. |\n| 0.6B | `600M\u002Flooped_pure_{window,nsa,dsa}_600M.yaml` | LT2-sparse single-mixer ablations. |\n| 0.6B | `600M\u002Flooped_hybrid_gdn_4to1_600M.yaml` | **LT2-hybrid (Full+GDN)**, 4:1 depth interleave. |\n| 0.6B | `600M\u002Flooped_hybrid_bookend_600M.yaml` | LT2-hybrid (Full+GDN), bookend pattern. |\n| 0.6B | `600M\u002Flooped_hybrid_128_256_512_full_600M.yaml` | Loop-level hybrid (fine → coarse). |\n| 1.3B | `1B\u002Flooped_pure_{full,gdn,kda,...}_1B.yaml` | 1.3B single-mixer recipes. |\n| 1.3B | `1B\u002Flooped_hybrid_gdn_4to1_1B.yaml` | **LT2-hybrid (Full+GDN)** at 1.3B — flagship recipe. |\n\nAll recipes train on **FineWeb-Edu** at sequence length 4096 for ~100B tokens (255k steps),\nusing `T = 4` loops by default.\n\n### Training\n\n```bash\n# Single-node debug\ntorchrun --nproc-per-node 8 -m apps.LT2.train \\\n    config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Fdebug.yaml\n\n# Multi-node via stool\npython -m lingua.stool script=apps.LT2.train \\\n    config=apps\u002FLT2\u002Fconfigs\u002F1B\u002Flooped_hybrid_gdn_4to1_1B.yaml \\\n    nodes=8 partition=\u003Cyour_partition>\n\n# Override fields on the command line\ntorchrun --nproc-per-node 8 -m apps.LT2.train \\\n    config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Flooped_hybrid_gdn_4to1_600M.yaml \\\n    model.loop_count=4 \\\n    model.attention_pattern=\"4:1\" \\\n    model.default_sliding_window=2048\n```\n\n### Generation & Evaluation\n\n```bash\n# Free-form generation from a checkpoint\npython -m apps.LT2.generate ckpt=\u002Fpath\u002Fto\u002Fcheckpoint \\\n    max_gen_len=256 temperature=0.7\n\n# Zero-shot evaluation on LM-Evaluation-Harness suites\ntorchrun --nproc-per-node 8 -m apps.LT2.eval \\\n    config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Feval.yaml \\\n    ckpt_dir=\u002Fpath\u002Fto\u002Fcheckpoint\n```\n\n### Long-Context Efficiency Benchmark\n\n`benchmark_prefill.py` measures prefill \u002F decode throughput across sequence lengths:\n\n```bash\npython -m apps.LT2.benchmark_prefill \\\n    --config apps\u002FLT2\u002Fconfigs\u002F1B\u002Flooped_hybrid_gdn_4to1_1B.yaml \\\n    --seq-len 8192\n```\n\nA reference SLURM array script is provided in `apps\u002FLT2\u002Fslurm\u002Fbenchmark_prefill_array.slurm`.\n\n---\n\n## 5. Configuration System\n\nAll scripts use [OmegaConf](https:\u002F\u002Fomegaconf.readthedocs.io\u002F) and accept dot-list overrides:\n\n```bash\npython -m apps.LT2.train config=apps\u002FLT2\u002Fconfigs\u002F600M\u002Fdebug.yaml \\\n    model.dim=1024 \\\n    optim.lr=2e-4 \\\n    name=my_run\n```\n\nResolution order: *dataclass defaults → values from `config=...` YAML → command-line overrides.*\n\nA typical `TrainArgs` YAML:\n\n```yaml\ndump_dir: \u002Fpath\u002Fto\u002Fdumpdir\nname: \"lt2_hybrid_gdn_4to1_1B\"\nsteps: 255000\nseed: 777\n\noptim:\n  lr: 3e-4\n  warmup: 5000\n  lr_min_ratio: 1e-6\n  clip: 1.0\n\ndistributed:\n  fsdp_type: full_shard\n  compile: true\n\nmodel:\n  dim: 2048\n  n_layers: 20\n  loop_count: 4\n  mixer: \"gdn\"\n  attention_pattern: \"4:1\"\n\ndata:\n  root_dir: .\u002Fdata\u002Ffineweb_edu\n  batch_size: 4\n  seq_len: 4096\n  tokenizer:\n    name: tiktoken\n    path: .\u002Ftokenizer\u002Foriginal\u002Ftokenizer.model\n```\n\n---\n\n## 6. Dump Directory\n\n```\nexample_dump_dir\u002F\n ├─ checkpoints\u002F\n │   └─ 0000007000\u002F        DCP-format checkpoint + train state\n ├─ code\u002F                  Snapshot of code at launch time\n ├─ logs\u002F                  Per-GPU stdout\u002Fstderr\n ├─ profiling\u002F             Memory + CPU\u002FCUDA traces\n ├─ base_config.yaml\n ├─ metrics.jsonl\n └─ submit.slurm\n```\n\nCheckpoints are stored in `.distcp` format and may be converted to standard PyTorch checkpoints\nvia `torch.distributed.checkpoint.format_utils.dcp_to_torch_save`.\n\n---\n\n## 7. Citation\n\nIf this codebase is useful in your work, the LT2 paper reference is at:\n\n```bibtex\n@misc{deng2026lt2lineartimeloopedtransformers,\n      title={LT2: Linear-Time Looped Transformers}, \n      author={Chunyuan Deng and Yizhe Zhang and Rui-Jie Zhu and Yuanyuan Xu and Jiarui Liu and T. S. Eugene Ng and Hanjie Chen},\n      year={2026},\n      eprint={2605.20670},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.20670}, \n}\n}\n```\n","LT2是一个实现了线性时间循环Transformer的项目，旨在通过子二次令牌混合器（包括线性、稀疏和混合注意力机制）来优化标准Transformer架构。其核心功能在于将多头注意力子层替换为更高效的令牌混合器，并通过循环机制重复使用相同参数多次，从而在不显著增加计算成本的情况下提升了模型的有效深度。LT2适合需要高效处理序列数据且对计算资源有限制的应用场景，如自然语言处理中的文本生成与理解任务。该项目基于Meta Lingua预训练框架构建，提供了详尽的文档支持以及易于扩展的研究基础。",2,"2026-06-06 04:03:55","CREATED_QUERY"]