[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72415":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":28,"discoverSource":29},72415,"MambaOut","yuweihao\u002FMambaOut","yuweihao","MambaOut: Do We Really Need Mamba for Vision? (CVPR 2025)","",null,"Python",2697,49,8,241,0,2,11,27.1,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:03","# [MambaOut: Do We Really Need Mamba for Vision?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07992) (CVPR 2025)\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07992\" alt=\"arXiv\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.07992-b31b1b.svg?style=flat\" \u002F>\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fwhyu\u002FMambaOut\" alt=\"Hugging Face Spaces\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue\" \u002F>\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1DTJRsPczV0pOwmFhEjSWyI2NqQoR_u-K?usp=sharing\" alt=\"Colab\">\n    \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cem>In memory of Kobe Bryant\u003C\u002Fem>\u003C\u002Fp>\n\n> \"What can I say, Mamba out.\" — *Kobe Bryant, NBA farewell speech, 2016*\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fyuweihao\u002Fmisc\u002Fmaster\u002FMambaOut\u002Fmamba_out.png\" width=\"400\"> \u003Cbr>\n\u003Csmall>Image credit: https:\u002F\u002Fwww.ebay.ca\u002Fitm\u002F264973452480\u003C\u002Fsmall>\n\u003C\u002Fp>\n\n\nThis is a PyTorch implementation of MambaOut proposed by our paper \"[MambaOut: Do We Really Need Mamba for Vision?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07992)\". \n\n## Updates\n* 22 October 2024: Huge thanks to Ross [@rwightman](https:\u002F\u002Fgithub.com\u002Frwightman) for integrating MambaOut into [pytorch-image-models](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models) (timm) and developing the mambaout_rw model series. The impressive mambaout_base_plus_rw model (102M params), pretrained solely on ImageNet-12k, \"*is matching or passing accuracy levels of ImageNet-22k pretrained ConvNeXt-Large (~200M params), it's not far from the best 22k trained ViT-Large (DeiT-III, ~300M params)*\". Please see Ross's [article](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Frwightman\u002Fmambaout) for more details.\n\n* 20 May 2024: As suggested by [Issue #5](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Fissues\u002F5#issuecomment-2119555019), we release **MambaOut-Kobe** model version with **24** Gated CNN blocks, achieving **8**0.0% accuracy on ImageNet. MambaOut-Kobe outperforms ViT-S by 0.2% accuracy with only 41% parameters and 33% FLOPs. See [Models](#models).\n\n* 18 May 2024: Add a [tutorial](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Fissues\u002F210) on counting Transformer FLOPs (Equation 6 in the paper).\n---\n\n![MambaOut first figure](https:\u002F\u002Fraw.githubusercontent.com\u002Fyuweihao\u002Fmisc\u002Fmaster\u002FMambaOut\u002Fmambaout_first_figure.png)\nFigure 1: (a) Architecture of Gated CNN and Mamba blocks (omitting Normalization and shortcut). The Mamba block extends the Gated CNN with an additional state space model (SSM). As will be conceptually discussed in Section 3, SSM is not necessary for image classification on ImageNet. To empirically verify this claim, we stack Gated CNN blocks to build a series of models named MambaOut.(b) MambaOut outperforms visual Mamba models, e.g., Vision Mamhba, VMamba and PlainMamba, on ImageNet image classification. \n\n\u003Cbr>\n\n![MambaOut second figure](https:\u002F\u002Fraw.githubusercontent.com\u002Fyuweihao\u002Fmisc\u002Fmaster\u002FMambaOut\u002Fmambaout_second_figure.png)\nFigure 2: The mechanism illustration of causal attention and RNN-like models from memory perspective, where $x_i$ denotes the input token of $i$-th step. (a) Causal attention stores all previous tokens' keys $k$ and values $v$ as memory. The memory is updated by continuously adding the current token's key and value, so the memory is lossless, but the downside is that the computational complexity of integrating old memory and current tokens increases as the sequence lengthens. Therefore attention can effectively manage short sequences but may encounter difficulties with longer ones. (b) In contrast, RNN-like models compress previous tokens into fixed-size hidden state $h$, which serves as the memory. This fixed size means that RNN memory is inherently lossy, which cannot directly compete with the lossless memory capacity of attention models. Nonetheless, **RNN-like models can demonstrate distinct advantages in processing long sequences,  as the complexity of merging old memory with current input remains constant, regardless of sequence length.**\n\n\u003Cbr>\n\n![MambaOut third figure](https:\u002F\u002Fraw.githubusercontent.com\u002Fyuweihao\u002Fmisc\u002Fmaster\u002FMambaOut\u002Fmambaout_third_figure.png)\nFigure 3: (a) Two modes of token mixing. For a total of $T$ tokens, the fully-visible mode allows token $t$ to aggregate inputs from all tokens, i.e., $ \\left\\{ x_i \\right\\}_{i=1}^{T} $, to compute its output $y_t$. In contrast, the causal mode restricts token $t$ to only aggregate inputs from preceding and current tokens $ \\left\\{ x_i \\right\\}_{i=1}^{t} $. By default, attention operates in fully-visible mode but can be adjusted to causal mode with causal attention masks. RNN-like models, such as Mamba's SSM, inherently operate in causal mode due to their recurrent nature. (b) **We modify the ViT's attention from fully-visible to causal mode and observe performance drop on ImageNet, which indicates causal mixing is unnecessary for understanding tasks.**\n\n\n\n## Requirements\nPyTorch and timm 0.6.11 (`pip install timm==0.6.11`).\n\nData preparation: ImageNet with the following folder structure, you can extract ImageNet by this [script](https:\u002F\u002Fgist.github.com\u002FBIGBALLON\u002F8a71d225eff18d88e469e6ea9b39cef4).\n\n```\n│imagenet\u002F\n├──train\u002F\n│  ├── n01440764\n│  │   ├── n01440764_10026.JPEG\n│  │   ├── n01440764_10027.JPEG\n│  │   ├── ......\n│  ├── ......\n├──val\u002F\n│  ├── n01440764\n│  │   ├── ILSVRC2012_val_00000293.JPEG\n│  │   ├── ILSVRC2012_val_00002138.JPEG\n│  │   ├── ......\n│  ├── ......\n```\n\n\n## Models\n### MambaOut trained on ImageNet\n| Model | Resolution | Params | MACs | Top1 Acc | Log |\n| :---     |   :---:    |  :---: |  :---:  |  :---:  |  :---:  |\n| [mambaout_femto](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_femto.pth) | 224 | 7.3M | 1.2G | 78.9 | [log](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_femto.csv) |\n| [mambaout_kobe](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_kobe.pth)\\* | 224 | 9.1M | 1.5G | 80.0 | [log](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_kobe.csv) |\n| [mambaout_tiny](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_tiny.pth) | 224 | 26.5M | 4.5G | 82.7 | [log](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_tiny.csv) |\n| [mambaout_small](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_small.pth) | 224 | 48.5M | 9.0G | 84.1 | [log](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_small.csv) |\n| [mambaout_base](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_base.pth) | 224 | 84.8M | 15.8G | 84.2 | [log](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Freleases\u002Fdownload\u002Fmodel\u002Fmambaout_base.csv) |\n\n\\* [Kobe Memorial Version](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Fissues\u002F5#issuecomment-2119555019) with 24 Gated CNN blocks. \n\n#### Usage\nWe also provide a Colab notebook which runs the steps to perform inference with MambaOut: [![Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1DTJRsPczV0pOwmFhEjSWyI2NqQoR_u-K?usp=sharing).\n\n## Gradio demo\nA web demo is shown at [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fwhyu\u002FMambaOut). You can also easily run gradio demo locally. Besides PyTorch and timm==0.6.11, please install gradio by `pip install gradio`, then run\n```bash\npython gradio_demo\u002Fapp.py\n```\n\n## Validation\n\nTo evaluate models, run:\n\n```bash\nMODEL=mambaout_tiny\npython3 validate.py \u002Fpath\u002Fto\u002Fimagenet  --model $MODEL -b 128 \\\n  --pretrained\n```\n\n## Train\nWe use batch size of 4096 by default and we show how to train models with 8 GPUs. For multi-node training, adjust `--grad-accum-steps` according to your situations.\n\n\n```bash\nDATA_PATH=\u002Fpath\u002Fto\u002Fimagenet\nCODE_PATH=\u002Fpath\u002Fto\u002Fcode\u002FMambaOut # modify code path here\n\n\nALL_BATCH_SIZE=4096\nNUM_GPU=8\nGRAD_ACCUM_STEPS=4 # Adjust according to your GPU numbers and memory size.\nlet BATCH_SIZE=ALL_BATCH_SIZE\u002FNUM_GPU\u002FGRAD_ACCUM_STEPS\n\n\nMODEL=mambaout_tiny \nDROP_PATH=0.2\n\n\ncd $CODE_PATH && sh distributed_train.sh $NUM_GPU $DATA_PATH \\\n--model $MODEL --opt adamw --lr 4e-3 --warmup-epochs 20 \\\n-b $BATCH_SIZE --grad-accum-steps $GRAD_ACCUM_STEPS \\\n--drop-path $DROP_PATH # --native-amp # can also use --native-amp or --amp to acclerate training\n```\nTraining scripts of other models are shown in [scripts](\u002Fscripts\u002F).\n\n\n## Tutorial on counting Transformer FLOPs\nThis [tutorial](https:\u002F\u002Fgithub.com\u002Fyuweihao\u002FMambaOut\u002Fissues\u002F210) shows how to count Transformer FLOPs (Equation 6 in the paper). Welcome feedback, and I will continually improve it.\n\n\n## Bibtex\n```\n@inproceedings{yu2025mambaout,\n  title={MambaOut: Do We Really Need Mamba for Vision?},\n  author={Yu, Weihao and Wang, Xinchao},\n  booktitle={Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n  year={2025}\n}\n```\n\n## Acknowledgment\nWeihao was partly supported by Snap Research Fellowship, Google TPU Research Cloud (TRC), and Google Cloud Research Credits program. We thank Dongze Lian, Qiuhong Shen, Xingyi Yang, and Gongfan Fang for valuable discussions.\n\nOur implementation is based on [pytorch-image-models](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models), [poolformer](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fpoolformer), [ConvNeXt](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FConvNeXt), [metaformer](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Fmetaformer) and [inceptionnext](https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Finceptionnext).\n","MambaOut是一个基于PyTorch实现的计算机视觉模型，旨在探讨在视觉任务中是否真的需要复杂的Mamba架构。该项目的核心功能是通过使用Gated CNN块构建一系列简化模型，以较少的参数和计算量达到甚至超越现有复杂模型（如ViT-S）在ImageNet上的分类精度。技术特点包括高效的Gated CNN设计及对Transformer FLOPs的优化计算方法。MambaOut特别适合于那些追求高效性能同时又希望保持高准确率的应用场景，比如移动设备或边缘计算环境中的图像识别任务。","2026-06-11 03:41:58","high_star"]