[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2590":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},2590,"mamba","state-spaces\u002Fmamba","state-spaces","Mamba SSM architecture","",null,"Python",18424,1753,122,517,0,7,57,194,39,113.73,"Apache License 2.0",false,"main",[],"2026-06-12 04:00:14","# Mamba\n\n![Mamba](assets\u002Fselection.png \"Selective State Space\")\n> **Mamba: Linear-Time Sequence Modeling with Selective State Spaces**\\\n> Albert Gu*, Tri Dao*\\\n> Paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.00752\n\n![Mamba-2](assets\u002Fssd_algorithm.png \"State Space Dual Model\")\n> **Transformers are SSMs: Generalized Models and Efficient Algorithms**\\\n>     **Through Structured State Space Duality**\\\n> Tri Dao*, Albert Gu*\\\n> Paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.21060\n\n![Mamba-3](assets\u002Fmamba3.png \"Inference-first State Space Model\")\n> **Mamba-3: Improved Sequence Modeling using State Space Principles**\\\n>     **Through Structured State Space Duality**\\\n> Aakash Lahoti*, Kevin Y. Li*, Berlin Chen*, Caitlin Wang*, Aviv Bick, J. Zico Kolter, Tri Dao†, Albert Gu†\\\n> Paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15569\n\n## About\n\nMamba is a new state space model architecture showing promising performance on information-dense data such as language modeling, where previous subquadratic models fall short of Transformers.\nIt is based on the line of progress on [structured state space models](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fs4),\nwith an efficient hardware-aware design and implementation in the spirit of [FlashAttention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention).\n\n## Installation\n\nInstall PyTorch first, then:\n- [Option] `pip install causal-conv1d>=1.4.0 --no-build-isolation`: an efficient implementation of a simple causal Conv1d layer used inside the Mamba block.\n- `pip install mamba-ssm --no-build-isolation`: the core Mamba package.\n- `pip install mamba-ssm[causal-conv1d] --no-build-isolation`: To install core Mamba package and causal-conv1d.\n\n`--no-build-isolation` is required so that pip uses your existing CUDA-enabled PyTorch instead of installing torch-cpu in an isolated build environment.\n\nNOTE: To use Mamba-3, please install from source `MAMBA_FORCE_BUILD=TRUE pip install --no-cache-dir --force-reinstall git+https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba.git --no-build-isolation`.\n\nOther requirements:\n- Linux\n- NVIDIA GPU\n- PyTorch 1.12+\n- CUDA 11.6+\n\nFor AMD cards, see additional prerequisites below.\n\n## Usage\n\nWe expose several levels of interface with the Mamba model.\n\n### Selective SSM\n\nMamba is based on a selective SSM layer, which is the focus of the paper (Section 3; Algorithm 2).\n\nSource: [ops\u002Fselective_scan_interface.py](mamba_ssm\u002Fops\u002Fselective_scan_interface.py).\n\n### Mamba Block\n\nThe main module of this repository is the Mamba architecture block wrapping the selective SSM.\n\nSource: [modules\u002Fmamba_simple.py](mamba_ssm\u002Fmodules\u002Fmamba_simple.py).\n\nUsage:\n``` python\nimport torch\nfrom mamba_ssm import Mamba\n\nbatch, length, dim = 2, 64, 16\nx = torch.randn(batch, length, dim).to(\"cuda\")\nmodel = Mamba(\n    # This module uses roughly 3 * expand * d_model^2 parameters\n    d_model=dim, # Model dimension d_model\n    d_state=16,  # SSM state expansion factor\n    d_conv=4,    # Local convolution width\n    expand=2,    # Block expansion factor\n).to(\"cuda\")\ny = model(x)\nassert y.shape == x.shape\n```\n\n### Mamba-2\n\nThe Mamba-2 block is implemented at [modules\u002Fmamba2.py](mamba_ssm\u002Fmodules\u002Fmamba2.py).\n\nA simpler version is at [modules\u002Fmamba2_simple.py](mamba_ssm\u002Fmodules\u002Fmamba2_simple.py)\n\nThe usage is similar to Mamba(-1):\n``` python\nfrom mamba_ssm import Mamba2\nmodel = Mamba2(\n    # This module uses roughly 3 * expand * d_model^2 parameters\n    d_model=dim, # Model dimension d_model\n    d_state=64,  # SSM state expansion factor, typically 64 or 128\n    d_conv=4,    # Local convolution width\n    expand=2,    # Block expansion factor\n).to(\"cuda\")\ny = model(x)\nassert y.shape == x.shape\n```\n\n#### SSD\n\nA minimal version of the inner SSD module (Listing 1 from the Mamba-2 paper) with conversion between \"discrete\" and \"continuous\" SSM versions\nis at [modules\u002Fssd_minimal.py](mamba_ssm\u002Fmodules\u002Fssd_minimal.py).\n\n### Mamba-3\n\nThe Mamba-3 block is implemented at [modules\u002Fmamba3.py](mamba_ssm\u002Fmodules\u002Fmamba3.py).\n\nThe usage is as follows:\n``` python\nfrom mamba_ssm import Mamba3\nbatch, length, dim = 2, 2048, 768\nx = torch.randn(batch, length, dim).to(torch.bfloat16).to(\"cuda\")\nmodel = Mamba3(\n    # This module uses roughly 6 * d_model^2 parameters\n    d_model=dim, # Model dimension d_model\n    d_state=128,  # SSM state size\n    headdim=64, # SSM headdim\n    is_mimo=True, # Use MIMO mode\n    mimo_rank=4, # MIMO rank when is_mimo=True\n    chunk_size=16, # 64\u002Fmimo_rank if x is in bf16, else 32\u002Fmimo_rank\n    is_outproj_norm=False, # Additional post SSM norm\n    dtype=torch.bfloat16,\n).to(\"cuda\")\ny = model(x)\nassert y.shape == x.shape\n```\n\n### Mamba Language Model\n\nFinally, we provide an example of a complete language model: a deep sequence model backbone (with repeating Mamba blocks) + language model head.\n\nSource: [models\u002Fmixer_seq_simple.py](mamba_ssm\u002Fmodels\u002Fmixer_seq_simple.py).\n\nThis is an example of how to integrate Mamba into an end-to-end neural network.\nThis example is used in the generation scripts below.\n\n\n## Pretrained Models\n\nPretrained models are uploaded to\n[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fstate-spaces): `mamba-130m`, `mamba-370m`,\n`mamba-790m`, `mamba-1.4b`, `mamba-2.8b`, `mamba2-130m`, `mamba2-370m`,\n`mamba2-780m`, `mamba2-1.3b`, `mamba2-2.7b`, `transformerpp-2.7b`, `mamba2attn-2.7b`, trained on 300B tokens on the Pile, as well as `mamba-2.8b-slimpj`\n(trained on 600B tokens on the SlimPajama dataset).\n\n\nThe models will be autodownloaded by the generation script below.\n\nThese models were trained on the [Pile](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FEleutherAI\u002Fpile), and follow the standard model dimensions described by GPT-3 and followed by many open source models:\n\n| Parameters | Layers | Model dim. | \n|------------|--------|------------|\n| 130M       | 24     | 768        |\n| 370M       | 48     | 1024       |\n| 790M       | 48     | 1536       |\n| 1.4B       | 48     | 2048       |\n| 2.8B       | 64     | 2560       |\n\n(The layer count of Mamba doubles that of a Transformer with similar size, as two Mamba blocks are needed for each \"layer\" (MHA block + MLP block) of a Transformer.)\n\nNote: these are base models trained only for 300B tokens, without any form of downstream modification (instruction tuning, etc.).\nPerformance is expected to be comparable or better than other architectures trained on similar data, but not to match larger or fine-tuned models.\n\n\n## Evaluations\n\nTo run zero-shot evaluations of models (corresponding to Table 3 of the paper),\nwe use the\n[lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness)\nlibrary.\n\n1. Install `lm-evaluation-harness` by `pip install lm-eval==0.4.2`.\n2. Run evaluation with (more documentation at the [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fbig-refactor) repo):\n``` sh\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Fmamba-130m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256\npython evals\u002Flm_harness_eval.py --model hf --model_args pretrained=EleutherAI\u002Fpythia-160m --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande --device cuda --batch_size 64\n```\n\nTo reproduce the results on the `mamba-2.8b-slimpj` model reported in the blogposts:\n``` sh\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Fmamba-2.8b-slimpj --tasks boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa,race,truthfulqa_mc2 --device cuda --batch_size 256\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Fmamba-2.8b-slimpj --tasks mmlu --num_fewshot 5 --device cuda --batch_size 256\n```\n\nTo run evaluations on Mamba-2 models, simply replace the model names:\n``` sh\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Fmamba2-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Ftransformerpp-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256\nlm_eval --model mamba_ssm --model_args pretrained=state-spaces\u002Fmamba2attn-2.7b --tasks lambada_openai,hellaswag,piqa,arc_easy,arc_challenge,winogrande,openbookqa --device cuda --batch_size 256\n```\n\nNote that the result of each task might differ from reported values by 0.1-0.3 due to noise in the evaluation process.\n\n## Inference\n\nThe script [benchmarks\u002Fbenchmark_generation_mamba_simple.py](benchmarks\u002Fbenchmark_generation_mamba_simple.py)\n1. autoloads a model from the Hugging Face Hub,\n2. generates completions of a user-specified prompt,\n3. benchmarks the inference speed of this generation.\n\nOther configurable options include the top-p (nucleus sampling) probability, and the softmax temperature.\n\n### Examples\n\nTo test generation latency (e.g. batch size = 1) with different sampling strategies:\n\n``` sh\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"state-spaces\u002Fmamba-2.8b\" --prompt \"My cat wrote all this CUDA code for a new language model and\" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"EleutherAI\u002Fpythia-2.8b\" --prompt \"My cat wrote all this CUDA code for a new language model and\" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"state-spaces\u002Fmamba-2.8b\" --prompt \"My cat wrote all this CUDA code for a new language model and\" --minp 0.05 --topk 0 --temperature 0.7 --repetition-penalty 1.2\n```\n\nTo test generation throughput with random prompts (e.g. large batch size):\n``` sh\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"state-spaces\u002Fmamba-2.8b\" --batch 64\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"EleutherAI\u002Fpythia-2.8b\" --batch 64\n```\n\nWith Mamba-2, you just need to change the model name:\n``` sh\npython benchmarks\u002Fbenchmark_generation_mamba_simple.py --model-name \"state-spaces\u002Fmamba2-2.7b\" --prompt \"My cat wrote all this CUDA code for a new language model and\" --topp 0.9 --temperature 0.7 --repetition-penalty 1.2\n```\n\n\n## Troubleshooting\n\n### Precision\nOur models were trained using PyTorch [AMP](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Famp.html) for mixed precision. AMP keeps model parameters in float32 and casts to half precision when necessary.\nOn the other hand, other frameworks like DeepSpeed store parameters in float16 and upcasts when necessary (e.g. for optimizer accumulation).\n\nWe've observed that higher precision for the main model parameters may be necessary, because SSMs are sensitive to their recurrent dynamics. If you are experiencing instabilities,\nas a first step please try a framework storing parameters in fp32 (such as AMP).\n\n### Initialization\nSome parts of the model have initializations inherited from prior work on S4 models.\nFor [example](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba\u002Fblob\u002Ff0affcf69f06d1d06cef018ff640bf080a11c421\u002Fmamba_ssm\u002Fmodules\u002Fmamba_simple.py#L102), the $\\Delta$ parameter has a targeted range by initializing the bias of its linear projection.\nHowever, some frameworks may have post-initialization hooks (e.g. setting all bias terms in `nn.Linear` modules to zero).\nIf this is the case, you may have to add custom logic (e.g. this [line](https:\u002F\u002Fgithub.com\u002Fstate-spaces\u002Fmamba\u002Fblob\u002Ff0affcf69f06d1d06cef018ff640bf080a11c421\u002Fmamba_ssm\u002Fmodules\u002Fmamba_simple.py#L104) turns off re-initializing in our trainer, but would be a no-op in any other framework)\nthat is specific to the training framework.\n\n## Additional Prerequisites for AMD cards\n\n### Patching ROCm\n\nIf you are on ROCm 6.0, run the following steps to avoid errors during compilation. This is not required for ROCm 6.1 onwards.\n\n1. Locate your ROCm installation directory. This is typically found at `\u002Fopt\u002Frocm\u002F`, but may vary depending on your installation.\n\n2. Apply the Patch. Run with `sudo` in case you encounter permission issues.\n   ```bash\n    patch \u002Fopt\u002Frocm\u002Finclude\u002Fhip\u002Famd_detail\u002Famd_hip_bf16.h \u003C rocm_patch\u002Frocm6_0.patch \n   ```\n\n\n## Citation\n\nIf you use this codebase, or otherwise find our work valuable, please cite Mamba:\n```\n@article{mamba,\n  title={Mamba: Linear-Time Sequence Modeling with Selective State Spaces},\n  author={Gu, Albert and Dao, Tri},\n  journal={arXiv preprint arXiv:2312.00752},\n  year={2023}\n}\n\n@inproceedings{mamba2,\n  title={Transformers are {SSM}s: Generalized Models and Efficient Algorithms Through Structured State Space Duality},\n  author={Dao, Tri and Gu, Albert},\n  booktitle={International Conference on Machine Learning (ICML)},\n  year={2024}\n}\n\n@misc{lahoti2026mamba3improvedsequencemodeling,\n      title={Mamba-3: Improved Sequence Modeling using State Space Principles}, \n      author={Aakash Lahoti and Kevin Y. Li and Berlin Chen and Caitlin Wang and Aviv Bick and J. Zico Kolter and Tri Dao and Albert Gu},\n      year={2026},\n      eprint={2603.15569},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.15569}, \n}\n```\n","Mamba 是一种新的状态空间模型架构，旨在提高信息密集型数据如语言建模任务中的性能。其核心功能包括基于结构化状态空间模型的高效硬件感知设计与实现，通过选择性状态空间技术，在线性时间内完成序列建模。技术特点方面，Mamba 引入了类似于 FlashAttention 的高效计算方法，并支持因果卷积层以优化模型块内部处理过程。该项目特别适用于需要高性能序列建模但传统次二次模型无法满足需求的场景，比如自然语言处理领域内的长文本生成或理解任务。安装使用需依赖 PyTorch 以及 NVIDIA GPU 等特定软硬件环境。",2,"2026-06-11 02:50:26","top_language"]