[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80672":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},80672,"lighthouse-attention","ighoshsubho\u002Flighthouse-attention","ighoshsubho","Long Context Pre-Training with Lighthouse Attention",null,"Python",55,12,48,1,0,4,7,3.34,false,"main",[],"2026-06-12 02:04:05","# Lighthouse Attention\n\n**Paper:** [*Long Context Pre-Training with Lighthouse Attention*](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.06554v1)\n\n**Original implementation** of *Lighthouse Attention*: a\nselection-based hierarchical attention mechanism for training large language\nmodels at very long context. This is the codebase used to produce all\nresults in the paper.\n\nThis repository ships Lighthouse as a single patch on top of\n[pytorch\u002Ftorchtitan][upstream] plus the two Lighthouse-specific\nsource files. The patch wires in selection, three scorer variants\n(`norm`, `dilated`, `gla`), and an optional context-parallel (CP) path,\nwith the scorer chosen per-config &mdash; no edits to `model.py` required.\n\n[upstream]: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan\n\n## Layout\n\n```\nlighthouse-attention\u002F\n├── README.md                       this file\n├── requirements.txt                pinned versions\n├── lighthouse-attention.patch      one patch, applies on torchtitan @ 61c25f8d\n├── src\u002F\n│   ├── lighthouse_selection.py     drop into torchtitan\u002Fmodels\u002Fllama3\u002Fmodel\u002F\n│   └── lighthouse_selection_cuda.py\n└── configs\u002F\n    ├── topk\u002F      vary top-K  (1536, 2048, 3072, 4096, 6144) at p=4, L=3\n    ├── pool\u002F      vary pool   (p=2, 4, 8)                    at k=1536, L=3\n    ├── levels\u002F    vary levels (L=3, 4, 5)                    at k=1536, p=2\n    ├── scorer\u002F    norm | dilated | gla                       at k=2048, p=4, L=3\n    └── cp\u002F        CP=2 \u002F DP=4 demo                           at k=1536, p=4, L=3 (norm)\n```\n\n## Tested versions\n\n```\ntorch          2.11.0+cu128\nCUDA           12.8\ncuDNN          9.19.0\nGPU            NVIDIA B200 (sm_100)\nupstream sha   61c25f8d   (pytorch\u002Ftorchtitan @ main)\n```\n\n## Apply\n\n1. Clone the upstream torchtitan and check out the tested commit:\n\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan.git\n   cd torchtitan\n   git checkout 61c25f8d\n   ```\n\n2. Drop in the two Lighthouse source files (the patch does not carry these):\n\n   ```bash\n   cp \u002Fpath\u002Fto\u002Flighthouse-attention\u002Fsrc\u002Flighthouse_selection.py      torchtitan\u002Fmodels\u002Fllama3\u002Fmodel\u002F\n   cp \u002Fpath\u002Fto\u002Flighthouse-attention\u002Fsrc\u002Flighthouse_selection_cuda.py torchtitan\u002Fmodels\u002Fllama3\u002Fmodel\u002F\n   ```\n\n3. Apply the patch:\n\n   ```bash\n   git apply \u002Fpath\u002Fto\u002Flighthouse-attention\u002Flighthouse-attention.patch\n   ```\n\n4. Install the requirements (Python 3.13, CUDA 12.8 toolkit on the host):\n\n   ```bash\n   python3.13 -m venv .venv && source .venv\u002Fbin\u002Factivate\n   pip install -r \u002Fpath\u002Fto\u002Flighthouse-attention\u002Frequirements.txt\n   pip install -e . --no-deps\n   ```\n\n   `requirements.txt` already pins the PyTorch CUDA-12.8 stable index\n   (`https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128`) via `--extra-index-url`, so\n   `torch==2.11.0+cu128` resolves without any extra flags.\n\n   `flash-linear-attention` is only needed if you select\n   `lighthouse_scorer = \"gla\"`. For `norm` (default) or `dilated`, you can\n   leave that line out.\n\n## What the patch changes\n\n| File                                          | Hunk |\n|-----------------------------------------------|------|\n| `torchtitan\u002Fmodels\u002Fllama3\u002Fmodel\u002Fargs.py`      | Adds `dilation`, `hidden_dim`, `use_selection_lighthouse`, `use_lighthouse_cp`, `lighthouse_num_levels`, `lighthouse_pooling_factor`, `lighthouse_topk`, `lighthouse_scorer`, `lighthouse_full_attn_layers` to `TransformerModelArgs`. |\n| `torchtitan\u002Fmodels\u002Fllama3\u002Fmodel\u002Fmodel.py`     | `_build_lighthouse_scorer(...)` dispatches on `lighthouse_scorer ∈ {norm, dilated, gla}` and refuses non-`norm` under CP. Wires the gate projection (`wg`) for the GLA path. FFN now honors explicit `hidden_dim` when set. |\n| `torchtitan\u002Fmodels\u002Fllama3\u002F__init__.py`        | Registers ~26 Lighthouse ablation flavors (`ablation_270m_lighthouse_topk*_*`) covering the (k, p, L) grid in the paper, plus dim-matched dense (`*_sdpa`) flavors for the SDPA-resume stage. |\n| `torchtitan\u002Fmodels\u002Fllama3\u002Finfra\u002Fparallelize.py` | `apply_compile` now uses `compile_config.fullgraph` so the Lighthouse path can compile each `TransformerBlock` with graph breaks allowed (`@torch.compiler.disable` on the scorers requires this). |\n| `torchtitan\u002Fdistributed\u002Futils.py`             | `create_context_parallel_ctx(..., enable_load_balance=True)` knob so the CP path can opt out of load-balancing. |\n| `torchtitan\u002Fhf_datasets\u002Ftext_datasets.py`     | Registers a `c4_local` dataset entry for an on-disk C4 mirror. |\n| `torchtitan\u002Ftrain.py`                         | When CP is enabled and the model has Lighthouse-CP modules, calls `set_cp_info(rank, world_size, cp_group)` once and threads `enable_load_balance=is_lighthouse_cp` through the CP context. |\n| `torchtitan\u002Fconfig\u002Fjob_config.py`             | Adds `fullgraph: bool = False` to the `Compile` dataclass. |\n\nThe two new files (`lighthouse_selection.py`, `lighthouse_selection_cuda.py`) live in `src\u002F` and are copied in step 2 above.\n\n## Selecting the scorer\n\nThe default scorer is `norm`. To switch, set `lighthouse_scorer` on the\nflavor in `torchtitan\u002Fmodels\u002Fllama3\u002F__init__.py`:\n\n```python\n\"my_dilated_run\": TransformerModelArgs(\n    dim=1024, n_layers=30, hidden_dim=1536, n_heads=8, n_kv_heads=8,\n    rope_theta=10000,\n    use_selection_lighthouse=True,\n    lighthouse_num_levels=3,\n    lighthouse_pooling_factor=4,\n    lighthouse_topk=2048,\n    lighthouse_scorer=\"dilated\",       # \u003C-- override here  (or \"norm\" \u002F \"gla\")\n    dilation=4,\n    lighthouse_full_attn_layers=[0, 1, 28, 29],\n),\n```\n\nThen point your toml at the new flavor:\n\n```toml\n[model]\nflavor = \"my_dilated_run\"\n```\n\nThe CP path explicitly refuses anything other than `norm` at construction\ntime:\n\n```\nValueError: lighthouse_scorer='dilated' is not supported under context\nparallelism. The CP path was validated only for 'norm'; ...\n```\n\n## Running a config\n\nThe configs in `configs\u002F` use placeholder paths (`\u003CDUMP_FOLDER>`,\n`\u003CHF_ASSETS_PATH>`, `\u003CCHECKPOINT_FOLDER>`). Replace them in place or via\n`sed` before launching:\n\n```bash\ncd torchtitan\nsed -e 's|\u003CDUMP_FOLDER>|\u002Fscratch\u002Fruns\u002Ftopk1536|' \\\n    -e 's|\u003CHF_ASSETS_PATH>|\u002Fscratch\u002Ftokenizer\u002Fbytes|' \\\n    -e 's|\u003CCHECKPOINT_FOLDER>|\u002Fscratch\u002Fckpts\u002Ftopk1536|' \\\n    \u002Fpath\u002Fto\u002Flighthouse-attention\u002Fconfigs\u002Ftopk\u002Ftopk1536.toml \\\n    > \u002Ftmp\u002Frun.toml\ntorchrun --nproc-per-node 8 .\u002Ftorchtitan\u002Ftrain.py --job.config_file \u002Ftmp\u002Frun.toml\n```\n\nEach config sets `[training] steps = 10000` to match the Stage-1 Lighthouse\nphase from the paper. For the SDPA-resume continuation, point a second toml\nat the same `[checkpoint] folder` with `[training] steps = 16000` and the\ndim-matched dense flavor (`ablation_270m_topk*_pool*_lvl*_sdpa`) that the\npatch registers alongside each lighthouse flavor.\n\n## Context-parallel run\n\n```bash\ntorchrun --nnodes 1 --nproc-per-node 8 .\u002Ftorchtitan\u002Ftrain.py \\\n    --job.config_file \u002Fpath\u002Fto\u002Flighthouse-attention\u002Fconfigs\u002Fcp\u002Fnorm_cp2_dp4.toml\n```\n\nThe toml sets `context_parallel_degree = 2`. The patch wires\n`set_cp_info(...)` automatically the first time `forward_backward_step`\nruns under CP, and the train loop uses `enable_load_balance=True` for the\nring-attention path while the Lighthouse selection runs shard-locally.\n","Lighthouse Attention 是一种基于选择的分层注意力机制，用于在非常长的上下文中预训练大型语言模型。该项目的核心功能包括三种评分变体（norm、dilated 和 gla）以及可选的上下文并行路径，通过单个补丁应用在 PyTorch 的一个分支上，无需修改主模型文件。它适用于需要处理长文本输入场景下的自然语言处理任务，特别是对计算资源有限但又希望提高模型性能的研究和开发工作。",2,"2026-06-11 04:01:35","CREATED_QUERY"]