[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82305":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":12,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},82305,"Parallax","Yifei-Zuo\u002FParallax","Yifei-Zuo","Official repository for Parallax (Parameterized Local Linear Attention)",null,"Python",60,5,29,1,0,6,27,50.03,"MIT License",false,"main",true,[],"2026-06-12 04:01:37","# Parallax: Parameterized Local Linear Attention\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.29157-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.29157)\n[![HF Papers](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Papers-FFD21E.svg)](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.29157)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](LICENSE)\n\nThis repository provides the official implementation of Parallax from the following paper:\n\n> **Parallax: Parameterized Local Linear Attention.**\u003C\u002Fbr>\n> Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, and Zhaoran Wang.\n> arXiv preprint, 2026.\n\nParallax is an upgrade to Softmax Attention. It is a scalable form of Local Linear Attention (LLA), a mechanism with provable theoretical advantages over Softmax Attention (see [FlashLLA](https:\u002F\u002Fgithub.com\u002FYifei-Zuo\u002FFlashLLA) for the LLA kernels). Parallax and LLA are **not** linear complexity attention mechanisms. They share the computational structure of Softmax Attention and require KV cache for decoding. Optimizations such as sliding window and block-sparsity are structurally compatible with Parallax.\n\n## Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FYifei-Zuo\u002FParallax.git\ncd Parallax\n\nuv sync\n\n# Or with pip:\npip install torch==2.9.1 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\npip install -e .\n```\n\nFor the bench harness:\n\n```bash\nuv sync --extra bench\n```\n\n## Quickstart\n\n> Note: our current kernels are developed and tested on NVIDIA Hopper GPUs.\n> A reference PyTorch implementation is provided in `parallax\u002Freference.py` for correctness verification and as a starting point for custom implementations on other hardware.\n\n### Training (Triton)\n\n```python\nimport torch\nfrom parallax import parallax_func\n\nB, H, L, D = 2, 8, 1024, 128\nq = torch.randn(B, H, L, D, device=\"cuda\", dtype=torch.bfloat16, requires_grad=True)\nr = torch.randn(B, H, L, D, device=\"cuda\", dtype=torch.bfloat16, requires_grad=True)\nk = torch.randn(B, H, L, D, device=\"cuda\", dtype=torch.bfloat16, requires_grad=True)\nv = torch.randn(B, H, L, D, device=\"cuda\", dtype=torch.bfloat16, requires_grad=True)\n\no = parallax_func(q, r, k, v) # (B, H, L, D), causal\no.float().pow(2).mean().backward()\n```\n\n### Decoding (CuTeDSL)\n\n```python\nimport math\nimport torch\nfrom parallax import parallax_decode\n\nB, H, D = 4, 8, 128\nkv_len = 4096\nq = torch.randn(B, 1, H, D, device=\"cuda\", dtype=torch.bfloat16)\nr = torch.randn_like(q)\nk = torch.randn(B, kv_len, H, D, device=\"cuda\", dtype=torch.bfloat16)\nv = torch.randn_like(k)\n\no = parallax_decode(q, r, k, v, qk_scale=1.0 \u002F math.sqrt(D)) # (B, 1, H, D)\n```\n\n## Benchmark\n\n`scripts\u002Fbench_decode.py` benchmarks the decode kernel against FA2 and\nFA3 with combined speed + precision reporting:\n\n```bash\npython scripts\u002Fbench_decode.py                       # example sweep\npython scripts\u002Fbench_decode.py --include-fa3         # add the FA3 column\npython scripts\u002Fbench_decode.py --parallax-grid \\\n                               --csv runs\u002Fbench.csv  # 216-shape grid, save to CSV\n```\n\nThe numbers below are measured on a single NVIDIA H200 SXM (132 SMs)\nwith bf16 inputs and head dimension `D = 128`. Latency is the q50\nover a CUDA-graph replay sweep (`q05` and `q95` are within ±1% on\nevery row). Accuracy is the worst per-element relative error against\nthe fp32 torch reference (`parallax.parallax_reference`).\n\n**Small batch (B = 1, H = 8, D = 128)**\n\n| L | FA2 (µs) | FA3 (µs) | Parallax (µs) | Parallax max-rel-err |\n|---:|---:|---:|---:|---:|\n|   512 |  8.38 | 10.64 | **5.79** | 2.1e-3 |\n|  1024 |  9.45 |  9.10 | **6.48** | 4.0e-3 |\n|  4096 | 17.07 | 11.90 | **8.61** | 2.0e-3 |\n| 16384 | 29.82 | 24.46 | **21.53** | 2.7e-3 |\n\n**Large batch (B = 32, H = 8, D = 128)**\n\n| L | FA2 (µs) | FA3 (µs) | Parallax (µs) | Parallax max-rel-err |\n|---:|---:|---:|---:|---:|\n|   512 |   27.73 |   **23.48** |    24.02 | 3.6e-3 |\n|  1024 |   99.73 |   **39.16** |    39.55 | 3.4e-3 |\n|  4096 |  384.90 |    281.64  | **279.96** | 3.6e-3 |\n| 16384 | 1574.94 |   1096.76  | **1094.37** | 3.2e-3 |\n\nReproduce the small-batch table with:\n\n```bash\npython scripts\u002Fbench_decode.py --include-fa3 \\\n    --shape 1,512,8,128  --shape 1,1024,8,128 \\\n    --shape 1,4096,8,128 --shape 1,16384,8,128 \\\n    --warmup 100 --iters 50 --trials 20\n```\n\nReproduce the large-batch table with:\n\n```bash\npython scripts\u002Fbench_decode.py --include-fa3 \\\n    --shape 32,512,8,128  --shape 32,1024,8,128 \\\n    --shape 32,4096,8,128 --shape 32,16384,8,128 \\\n    --warmup 100 --iters 50 --trials 20\n```\n\n## Citation\n\n```bibtex\n@misc{zuo2026parallaxparameterizedlocallinear,\n      title={Parallax: Parameterized Local Linear Attention for Language Modeling}, \n      author={Yifei Zuo and Dhruv Pai and Zhichen Zeng and Alec Dewulf and Shuming Hu and Zhaoran Wang},\n      year={2026},\n      eprint={2605.29157},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.29157}, \n}\n```\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","Parallax是一个实现参数化局部线性注意力机制的项目，旨在改进传统的Softmax Attention。其核心功能包括通过优化的局部线性注意力算法提供更高效的计算性能，并且与现有的Softmax Attention共享相同的计算结构，支持KV缓存用于解码过程。技术上，Parallax利用了如滑动窗口和块稀疏性等优化策略来进一步提升效率。该项目特别适合需要在保持模型准确性的前提下提高处理速度的应用场景，比如大规模文本生成、机器翻译等自然语言处理任务中。代码基于Python编写，并针对NVIDIA Hopper GPU进行了优化。",2,"2026-06-11 04:08:18","CREATED_QUERY"]