[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72576":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},72576,"MoBA","MoonshotAI\u002FMoBA","MoonshotAI","MoBA: Mixture of Block Attention for Long-Context LLMs","",null,"Python",2128,151,30,10,0,5,6,11,15,72.65,"MIT License",false,"master",[26,27,28,29,30,31,32],"flash-attention","llm","llm-serving","llm-training","moe","pytorch","transformer","2026-06-12 04:01:06","\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13189\">\u003Cimg width=\"80%\" src=\"figures\u002Fbanner.png\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n# MoBA: Mixture of Block Attention for Long-Context LLMs\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"MoBA_Tech_Report.pdf\">\u003Cimg src=\"figures\u002Flogo.png\" height=\"16\" width=\"16\" style=\"vertical-align:middle\">\u003Cb> Full Report\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n🚀 Introducing **MoBA --- Mixture of Block Attention**\n\n* **Trainable Block Sparse Attention**: The full context is divided into blocks, where each query token learns to attend to the most relevant KV blocks, enabling efficient processing of long sequences.\n* **Parameter-less Gating Mechanism**: A novel Parameter-less top-k gating mechanism is introduced to selects the most relevant blocks for each query token, ensuring that the model focuses only on the most informative blocks.\n* **Seamlessly Transition between Full and Sparse Attention**: MoBA is designed to be a flexible substitute for full attention, allowing seamless transitions between full and sparse attention modes.\n\u003Cp align=\"center\">\n  \u003Cimg width=\"40%\" src=\"figures\u002Frunning_example.png\" style=\"display:inline-block; margin-right:2%\">\n  \u003Cimg width=\"40%\" src=\"figures\u002Fmoba_with_flash_attn.png\" style=\"display:inline-block\">\n\u003C\u002Fp>\n\n> **Note**: MoBA requires continue training of existing models to achieve its acceleration benefits. It is not a drop-in sparse attention solution that can be directly applied to pretrained models without additional training.\n\n## Abstract\nScaling the effective context length is essential for advancing large language models (LLMs) toward artificial general intelligence (AGI). However, the quadratic increase in computational complexity inherent in traditional attention mechanisms presents a prohibitive overhead. Existing approaches either impose strongly biased structures, such as sink or window attention which are task-specific, or radically modify the attention mechanism into linear approximations, whose performance in complex reasoning tasks remains inadequately explored.\n\nIn this work, we propose a solution that adheres to the **“less structure”** principle, allowing the model to autonomously determine where to attend, rather than introducing predefined biases. We introduce Mixture of Block Attention (MoBA), an innovative approach that applies the principles of Mixture of Experts (MoE) to the attention mechanism. This novel architecture demonstrates superior performance on long-context tasks while offering a key advantage: the ability to seamlessly transition between full and sparse attention, enhancing efficiency without the risk of compromising performance. MoBA has already been deployed to support Kimi’s long-context requests and demonstrates significant advancements in efficient attention computation for LLMs. \n\nOur code is available at [MoonshotAI\u002FMoBA](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FMoBA).\n\u003Cp align=\"center\">\n  \u003Cimg width=\"40%\" src=\"figures\u002Fcomputation_time.png\" style=\"display:inline-block; margin-right:2%\">\n\u003C\u002Fp>\n\n### Evaluation with 1M context length\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"80%\" src=\"figures\u002Fneedle-in-a-haystack.png\">\n\u003C\u002Fp>\n\n\n\n\n## Environment Setup\n**Note that current kernel implementations rely on `flash-attn==2.6.3` and `torch >= 2.1.0`**\n\n```bash\nconda create -n moba python=3.10\nconda activate moba\npip install .\n```\n\n## Quick Start\nWe provide a transformers-friendly implementation for MoBA.\n\nFeel free to choose attention backends by `--attn` between `moba` and `moba_naive`.\n\n```bash\npython3 examples\u002Fllama.py --model meta-llama\u002FLlama-3.1-8B --attn moba\n```\n\n### Implementation Details\n- **moba_naive**: A naive implementation based on attention masks. It's designed to help understand how MoBA selects corresponding chunks. You may save and visualize the attention masks to see the block selection process.\n- **moba_efficient**: Our production-ready implementation optimized for performance. It achieves up to 40x speedup compared to moba_naive (tested with 32K sequence length, 1 attention head, MoBA Block 2048 and MoBA Topk 3). We recommend using this version for practical applications.\n\n\n## Unit Tests\n```bash\npytest tests\u002Ftest_moba_attn.py\n```\n\n## References\n* Llama Implementation: [huggingface\u002Ftransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n* Flash Attention: [Dao-AILab\u002Fflash-attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\n\n\n\n## Citation\nIf you find MoBA is useful or want to use in your projects, please kindly cite our paper:\n```\n@article{lu2025mobamixtureblockattention,\n  author = {Enzhe Lu and Zhejun Jiang and Jingyuan Liu and Yulun Du and Tao Jiang and Chao Hong and Shaowei Liu and Weiran He and Enming Yuan and Yuzhi Wang and Zhiqi Huang and Huan Yuan and Suting Xu and Xinran Xu and Guokun Lai and Yanru Chen and Huabin Zheng and Junjie Yan and Jianlin Su and Yuxin Wu and Yutao Zhang and Zhilin Yang and Xinyu Zhou and Mingxing Zhang and Jiezhong Qiu},\n  title = {MoBA: Mixture of Block Attention for Long-Context LLMs},\n  journal={arXiv preprint arXiv:2502.13189},\n  year={2025}\n}\n```\n","MoBA项目旨在通过混合块注意力机制优化长上下文大型语言模型的处理效率。其核心功能包括可训练的块稀疏注意力，将整个上下文分割成多个区块，让每个查询令牌学习关注最相关的键值对区块；无参数的选择机制，确保模型仅聚焦于信息量最大的区块；以及能够在全注意力与稀疏注意力之间平滑切换的能力。这些特性使得MoBA特别适合需要高效处理长文本序列的应用场景，如文档理解、对话系统等。基于PyTorch实现，并支持Flash Attention加速技术，该项目为研究人员和开发者提供了一种灵活且高效的解决方案来提升LLM在复杂推理任务中的表现。需要注意的是，为了充分发挥MoBA的优势，现有的预训练模型需要进行额外的微调。",2,"2026-06-11 03:42:39","high_star"]