[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72560":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":32,"discoverSource":33},72560,"VLA-Adapter","OpenHelix-Team\u002FVLA-Adapter","OpenHelix-Team","VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model","https:\u002F\u002Fvla-adapter.github.io\u002F",null,"Python",2200,199,30,32,0,2,10,38,6,71.7,"MIT License",false,"main",[26,27,28],"embodied-ai","robotics","vision-language-action-model","2026-06-12 04:01:06","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"figure\u002FLOGO2.png\" width=\"70%\" style=\"vertical-align:-7px;\" \u002F>\n\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.09372) [![Hugging Face Collection](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModels-fcd022?style=for-the-badge&logo=huggingface&logoColor=white)](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) [![Twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAK-%23000000.svg?style=for-the-badge&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1966610780838621241) [![WeChat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat--Group-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https:\u002F\u002Fgithub.com\u002FOpenHelix-Team\u002FVLA-Adapter\u002Fissues\u002F1)\n\n\u003C\u002Fdiv>\n\n### The official implementation of **VLA-Adapter**.\n\u003Cbr\u002F>\n\n\u003Cdiv id=\"top\" align=\"center\">\n\u003Cp align=\"center\">\n\u003Cimg src=figure\u002FFramework.png width=90% \u002F>\n\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n> **📝 Paper: https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09372**\u003Cbr\u002F>\n> **🌍 Project page: https:\u002F\u002Fvla-adapter.github.io\u002F**\u003Cbr\u002F>\n> **🤗 HuggingFace: https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter**\u003Cbr\u002F>\n> **Github: https:\u002F\u002Fgithub.com\u002FOpenHelix-Team\u002FVLA-Adapter**\n\n\u003Cbr\u002F>\n\n## :loudspeaker: News!\n- **[2026\u002F03\u002F16]** We added **real-world ALOHA deployment** support, verified on [Cobot Magic](https:\u002F\u002Fglobal.agilex.ai\u002Fproducts\u002Fcobot-magic). See [`experiments\u002Frobot\u002Faloha\u002F`](experiments\u002Frobot\u002Faloha\u002F) for details.\n- **[2025\u002F09\u002F22]** We released our codes! An enhanced **Pro** version is also released (this version conforms to the pipeline in the original paper, but is optimized in implementation). Everyone is welcome to use it!🎉\n- **[2025\u002F09\u002F13]** Our paper won the 🥇**first place** in the [daily list](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002Fdate\u002F2025-09-12), the 🥈**second place** in the [weekly list](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002Fweek\u002F2025-W37), and 🥉**third place** in the [Monthly list](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002Fmonth\u002F2025-09) in HF! ⭐\n- **[2025\u002F09\u002F13]** Our paper listed in the [Trending Paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002Ftrending) in HF! ⭐\n- **[2025\u002F09\u002F12]** We released the original version of the VLA-Adapter for four LIBERO models on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter).\n- **[2025\u002F09\u002F11]** We released our paper on [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09372).\n\n\u003Cbr\u002F>\n\n## :black_nib: TODO List\u003Ca name=\"todo\">\u003C\u002Fa>\n\n- [x]  Release **checkpoints** for reproduction.\n- [x]  Release [VLA-Adapter v2 paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.09372).\n- [ ]  A more **powerful version**, **VLA-Adapter++**, and a detailed **technical report** 📝 will be released soon.\u003Cbr\u002F>\n- [x]  **ALOHA real-world deployment** on [Cobot Magic](https:\u002F\u002Fglobal.agilex.ai\u002Fproducts\u002Fcobot-magic) — training, server-client inference, and evaluation ([details](experiments\u002Frobot\u002Faloha\u002F)).\u003Cbr\u002F>\n- [ ]  Continue to update the code to adapt to various **real-world systems** deployments, including the configuration of our paper, Franka, UR-5, and AGILE Piper.\u003Cbr\u002F>\n- [ ]  It will soon be compatible with **various foundation models**, including but not limited to [VPP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14803), [π0.5](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16054).\u003Cbr\u002F>\n- [ ]  We will update the **diffusion transformers** and **flow matching** policy networks in the future, and the results will be updated in the subsequent VLA-Adapter++ technical report.\n- [ ]  We will also update and give more experiments on **Frozen backbone**.\n- [ ]  We will expand its **generalization** further in the future. Work is in progress! So please stay tuned!\n- [ ]  **RL post-training** is also in progress. Interested researchers are welcome to join us in building this foundation!\n- [ ]  **The dual-system compatibility** of VLA-Adapter is under exploration!\n\n\n\u003Cbr\u002F>\n\n## 🌟 Table of Contents\n\n- [:rocket: Quick Start](#rocket-quick-start) \n  - [Conda Environment of VLA-Adapter](#conda-environment-of-vla-adapter)\n  - [Install Dependencies](#install-dependencies)\n- [:pencil: Data Preparation](#pencil-data-preparation) \n  - [LIBERO Benchmark](#libero-benchmark)\n  - [CALVIN Benchmark](#calvin-benchmark)\n  - [:video_game: Our Dependencies](#video_game-our-dependencies)\n  - [:pushpin: Benchmark Location](#pushpin-benchmark-location)\n- [⚓ VLM backbone](#vlm)\n- [:fire: Training for Different Configurations](#fire-training-for-different-configurations) &emsp; => Provides **training configurations** for GPUs ranging from **10GB** to **80GB** of VRAM.\n  - [:books: Related File for Training](#books-related-file-for-training)\n  - [:ledger: How to Train on Extremely Limited VRAM GPUs](#ledger-how-to-train-on-extremely-limited-vram-gpus) &emsp; => A card with 10GB-12GB *(e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070)*\n  - [:ledger: How to Train on Low VRAM GPUs](#ledger-how-to-train-on-low-vram-gpus) &emsp; => A card with 24GB *(e.g. NVIDIA GeForce RTX 3090 and 4090)*\n  - [:ledger: How to Train on Larger VRAM GPUs](#ledger-how-to-train-on-larger-vram-gpus) &emsp; => A Consumer GPU with 32GB *(e.g. NVIDIA GeForce RTX 5090)* &emsp; A Professional-Grade GPU with 40GB-48GB *(e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).*\n  - [:ledger: How to Train on Sufficient VRAM GPUs](#ledger-how-to-train-on-sufficient-vram-gpus) &emsp; => Professional-Grade GPUs with ≥80GB *(e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).*\n- [:mechanical_arm: Inference](#mechanical_arm-inference)\n  - [:books: Related File for Inference](#books-related-file-for-inference)\n  - [🤗 Checkpoint of VLA-Adapter](#ckpts)\n  - [:notebook: How to Eval](#evals)\n- [🌈 Success Rate Comparison](#results)\n- [📝 Citation](#cite)\n- [:heart: Acknowledgment](#heart-acknowledgment)\n\n\u003Cbr\u002F>\n\n## :rocket: Quick Start\n\n\n### Conda Environment of VLA-Adapter\n\n```bash\n# Create and activate conda environment\nconda create -n vla-adapter python=3.10.16 -y\nconda activate vla-adapter\n```\n\n### Install Dependencies\n\n```bash\n# Install PyTorch\n# Use a command specific to your machine: https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F\npip install torch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0\n\n# Clone vla-adapter repo and pip install to download dependencies\ngit clone https:\u002F\u002Fgithub.com\u002FOpenHelix-Team\u002FVLA-Adapter.git\ncd VLA-Adapter\npip install -e .\n\npip install packaging ninja\nninja --version; echo $?  # Verify Ninja --> should return exit code \"0\"\n\n# Install Flash Attention 2 for training (https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\npip install \"flash-attn==2.5.5\" --no-build-isolation\n# If you run into difficulty, try `pip cache remove flash_attn` first, or visit the\n# website to download it. (https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002Ftag\u002Fv2.5.5)\n# You can download the corresponding `.whl` file according to the cuda version of `nvidia-smi`,\n# and then run `pip install flash_attn-2.5.5+cuXX...whl` to install it. \n# We use the `flash_attn-2.5.5+cu122torch2.2cxx11abiFALSE-cp310-cp310-linux_x86_64.whl` file.\n```\n\n\u003Cbr\u002F>\n\u003Cbr\u002F>\n\n\n## :pencil: Data Preparation\n\n### LIBERO Benchmark\n\n- **(Optional)**\n\nClone and install the [LIBERO repo](https:\u002F\u002Fgithub.com\u002FLifelong-Robot-Learning\u002FLIBERO) and required packages:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLifelong-Robot-Learning\u002FLIBERO.git\npip install -e LIBERO\npip install -r experiments\u002Frobot\u002Flibero\u002Flibero_requirements.txt  # From vla-adapter base dir\n```\n\nTo download the [LIBERO datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenvla\u002Fmodified_libero_rlds) that we used in our fine-tuning experiments, run the command below. This will download the `Spatial`, `Object`, `Goal`, and `Long` datasets in `RLDS` format, i.e., `libero_spatial_no_noops`, `libero_object_no_noops`, `libero_goal_no_noops`, `libero_10_no_noops`. (`\"_no_noops\"` stands for no no-op actions, i.e., training samples with near-zero actions are filtered out). These datasets require `~10GB` of memory in total. If needed, see details on how to download the original non-RLDS datasets [here](https:\u002F\u002Fgithub.com\u002Fopenvla\u002Fopenvla?tab=readme-ov-file#libero-setup). You can use these to fine-tune Prismatic-VLMs (built on Qwen2.5-0.5B) or other VLMs.\n\n```bash\ngit clone git@hf.co:datasets\u002Fopenvla\u002Fmodified_libero_rlds\n```\n\n🌟 Attention! The dataset downloaded in this way needs to remove of the ``modified_`` word to adapt to the path of - [:pushpin: Benchmark Location](#pushpin-benchmark-location)!!!\n\nWhen using LIBERO, you may get an error message like `AttributeError: 'NoneType' object has no attribute 'eglQueryString'`. You can use:\n\n```bash\nsudo apt-get update\nsudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev\n```\n\n### CALVIN Benchmark\n\n- **(Optional)**\n\n```bash\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin.git\nexport CALVIN_ROOT=$(pwd)\u002Fcalvin\ncd $CALVIN_ROOT\n\n# Installation of `pyhash` may fail on some machines. If it fails, you can solve it by lowering the `setuptools` version: `pip install setuptools==57.5.0`\nsh install.sh\n```\n\nTo download the [CALVIN ABC→D datasets](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin\u002Ftree\u002Fmain\u002Fdataset) that we used in our fine-tuning experiments, run the command below. \n\n```bash\ncd $CALVIN_ROOT\u002Fdataset\nsh download_data.sh ABC\n```\n\nIf you want to download the RLDS format, you can visit [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhouhongyi\u002Fcalvin_abc_rlds) to download it. This dataset requires `~50GB` of memory.\n\nWhen using CALVIN, you may get an error message like `AttributeError: 'NoneType' object has no attribute 'eglQueryString'`. You can use:\n\n```bash\nsudo apt-get update\nsudo apt-get install libgl1-mesa-dev libegl1-mesa-dev libgles2-mesa-dev libglew-dev\n```\n\n\n### :video_game: Our Dependencies \n\n- **(including LIBERO and CALVIN)**\n\nAt this point, the environment is fully installed. If you want to confirm whether the environment is correct, you can see the `our_envs.txt` file we released.\n\n\n### :pushpin: Benchmark Location\n\nThe downloaded dataset can be placed in the `\u002Fdata` folder. The overall directory structure is as follows:\n\n```\n·\n├── data\n·   ├── libero\n    │   ├── libero_10_no_noops\n    │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)\n    │   ├── libero_goal_no_noops\n    │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)\n    │   ├── libero_object_no_noops\n    │   │   └── 1.0.0  (It contains some json files and 32 tfrecord files)\n    │   ├── libero_spatial_no_noops\n    │   │   └── 1.0.0  (It contains some json files and 16 tfrecord files)\n    │\n    ├── calvin_abc\n    │   └── 1.0.0  (It contains some json files, 512 train tfrecord files, and 32 valid tfrecord files)\n    │\n    └── other benchmarks ...\n```\n\n\u003Cbr\u002F>\n\u003Cbr\u002F>\n\n## ⚓ VLM backbone \u003Ca name=\"vlm\">\u003C\u002Fa>\nWe use the `Prismatic-VLMs` architecture. Since the file is large, please download it from [here](https:\u002F\u002Fhuggingface.co\u002FStanford-ILIAD\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b). Then put it in the `\u002Fpretrained_models` folder. The file structure is:\n\n```\n·\n├── pretrained_models\n·   ├── configs\n    └── prism-qwen25-extra-dinosiglip-224px-0_5b\n```\n\n\n\u003Cbr\u002F>\n\u003Cbr\u002F>\n\n## :fire: Training for Different Configurations\n\n**We provide different training configurations for different users. You can choose the configuration suitable for training based on your GPU card type.**\n\n### :books: Related File for Training\n* `vla-scripts\u002Ffinetune.py`: VLA fine-tuning script\n\n\n### :ledger: How to Train on Extremely Limited VRAM GPUs\n\n***=> Extremely Limited VRAM (A card with 10GB-12GB) (e.g. NVIDIA GeForce RTX 2080Ti, 3060, 3080, 4070, 4080, and 5070).***\n\n>***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.***\n\nIf your resources are extremely limited, you can set `--batch_size 1` and `--lora_rank 64`, it only requires `9.6GB` of VRAM. Certainly, `batch size = 1` will cause gradient updates to be greatly affected by extreme values, and loss convergence will be unstable. In this case, you can modify the `grad_accumulation_steps` parameter to simulate a similar effect. For example, `--batch_size 1` with `--grad_accumulation_steps 8` has a similar effect to `--batch_size 8`, but the training speed will be slower. This means that you can't use the [OpenVLA-OFT](https:\u002F\u002Fgithub.com\u002Fmoojink\u002Fopenvla-oft) model on a card with `10GB` because even with `batch size = 1`, it requires `25GB` of VRAM. Fortunately, you can use VLA-Adapter. However, the `batch size` is still small, you can increase `--max_steps` to achieve the performance reported in the paper.\n\n>***About `vlm_path`.***\n\nThe VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https:\u002F\u002Fhuggingface.co\u002FStanford-ILIAD\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b and place it in `\u002Fpretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b`.\n\n>***About `data_name`.***\n\nLaunch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `\u002Flogs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`.\n\n>***About `use_pro_version`.***\n\nIn addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version`, requiring only `8.6GB` of VRAM. You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`.\n\n ```bash\ndata_name=libero_spatial_no_noops\n\nCUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts\u002Ffinetune.py \\\n--vlm_path pretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b \\\n--config_file_path pretrained_models\u002Fconfigs \\\n--data_root_dir data\u002Flibero \\\n--dataset_name $data_name \\\n--run_root_dir outputs \\\n--use_film False \\\n--num_images_in_input 2 \\\n--use_proprio True \\\n--use_lora True \\\n--use_fz False \\\n--use_minivlm True \\\n--image_aug True \\\n--num_steps_before_decay 400000 \\\n--max_steps 400005 \\\n--save_freq 5000 \\\n--save_latest_checkpoint_only False \\\n--merge_lora_during_training True \\\n--batch_size 1 \\\n--grad_accumulation_steps 8 \\\n--learning_rate 2e-4 \\\n--lora_rank 64 \\\n--use_pro_version True \\\n--wandb_entity \"YOUR_WANDB_ENTITY\" \\\n--wandb_project \"$data_name\" \\\n--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \\\n> logs\u002FVLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &\n```\n\nPlease note that the obtained models will be stored in the `\u002Foutputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) and place it in this folder for inference.\n\n\u003Cbr\u002F>\n\n### :ledger: How to Train on Low VRAM GPUs\n\n***=> Low VRAM (A card with 24GB) (e.g. NVIDIA GeForce RTX 3090 and 4090).***\n\n>***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.***\n\nIf you have such a device, you can increase the `batch size` and `lora rank`: `--batch_size 4` and `--lora_rank 64`. This only takes nearly `20GB`. This is consistent with the rank in our paper. This means that you can't use the [OpenVLA-OFT](https:\u002F\u002Fgithub.com\u002Fmoojink\u002Fopenvla-oft) model on a card with `24GB` because even with `batch size = 1`, it requires `25GB` of VRAM. Fortunately, you can use VLA-Adapter. However, the `batch size` is still small, you can increase `--max_steps` to achieve the performance reported in the paper.\n\n>***About `vlm_path`.***\n\nThe VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https:\u002F\u002Fhuggingface.co\u002FStanford-ILIAD\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b and place it in `\u002Fpretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b`.\n\n>***About `data_name`.***\n\nLaunch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `\u002Flogs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`.\n\n>***About `use_pro_version`.***\n\nIn addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch), requiring only `17.6GB` of VRAM. You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`.\n\n\n ```bash\ndata_name=libero_spatial_no_noops\n\nCUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts\u002Ffinetune.py \\\n--vlm_path pretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b \\\n--config_file_path pretrained_models\u002Fconfigs \\\n--data_root_dir data\u002Flibero \\\n--dataset_name $data_name \\\n--run_root_dir outputs \\\n--use_film False \\\n--num_images_in_input 2 \\\n--use_proprio True \\\n--use_lora True \\\n--use_fz False \\\n--use_minivlm True \\\n--image_aug True \\\n--num_steps_before_decay 200000 \\\n--max_steps 200005 \\\n--save_freq 5000 \\\n--save_latest_checkpoint_only False \\\n--merge_lora_during_training True \\\n--batch_size 4 \\\n--grad_accumulation_steps 4 \\\n--learning_rate 2e-4 \\\n--lora_rank 64 \\\n--use_pro_version True \\\n--wandb_entity \"YOUR_WANDB_ENTITY\" \\\n--wandb_project \"$data_name\" \\\n--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \\\n> logs\u002FVLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &\n```\n\nPlease note that the obtained models will be stored in the `\u002Foutputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) and place it in this folder for inference.\n\n\n\n\u003Cbr\u002F>\n\n### :ledger: How to Train on Larger VRAM GPUs\n\n***=> A Consumer GPU with 32GB (e.g. NVIDIA GeForce RTX 5090) \u003Cbr\u002F> => A Professional-Grade GPU with 40GB-48GB (e.g. NVIDIA A100-40GB, A800-40GB, L20, and RTX A6000).***\n\n\n>***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.***\n\nIf you have such a device, you can increase the `batch size` and `lora rank`: `--batch_size 8` and `--lora_rank 64`. This only takes nearly `29GB`. \n\n>***About `vlm_path`.***\n\nThe VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https:\u002F\u002Fhuggingface.co\u002FStanford-ILIAD\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b and place it in `\u002Fpretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b`.\n\n>***About `data_name`.***\n\nLaunch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `\u002Flogs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`.\n\nWith this configuration, you can achieve the same results as in our paper on the `LIBERO-Object` benchmark, achieving a `99.2%` success rate, in just `8 hours`. The `LIBERO-Spatial` benchmark requires approximately 10 hours of training. However, the `LIBERO-Long` benchmark takes longer because its tasks are longer and more difficult, requiring more training steps to achieve superior performance.\n\n>***About `use_pro_version`.***\n\nIn addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch). You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`.\n\n ```bash\ndata_name=libero_spatial_no_noops\n\nCUDA_VISIBLE_DEVICES=0 torchrun --standalone --nnodes 1 --nproc-per-node 1 vla-scripts\u002Ffinetune.py \\\n--vlm_path pretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b \\\n--config_file_path pretrained_models\u002Fconfigs \\\n--data_root_dir data\u002Flibero \\\n--dataset_name $data_name \\\n--run_root_dir outputs \\\n--use_film False \\\n--num_images_in_input 2 \\\n--use_proprio True \\\n--use_lora True \\\n--use_fz False \\\n--use_minivlm True \\\n--image_aug True \\\n--num_steps_before_decay 200000 \\\n--max_steps 200005 \\\n--save_freq 5000 \\\n--save_latest_checkpoint_only False \\\n--merge_lora_during_training True \\\n--batch_size 8 \\\n--grad_accumulation_steps 2 \\\n--learning_rate 2e-4 \\\n--lora_rank 64 \\\n--use_pro_version True \\\n--wandb_entity \"YOUR_WANDB_ENTITY\" \\\n--wandb_project \"$data_name\" \\\n--run_id_note VLA-Adapter--libero_spatial_no_noops--$current_time \\\n> logs\u002FVLA-Adapter--libero_spatial_no_noops--$current_time.log 2>&1 &\n```\n\nPlease note that the obtained models will be stored in the `\u002Foutputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) and place it in this folder for inference.\n\n\n\n\u003Cbr\u002F>\n\n### :ledger: How to Train on Sufficient VRAM GPUs\n\n***=> Professional-Grade GPUs with ≥80GB (e.g. NVIDIA A100-80GB, A800-80GB, H100, H800, H20-NVLink, and GB200).***\n\n>***About `batch_size`, `lora_rank`, `grad_accumulation_steps`, and `max_steps`.***\n\nYou can use 1 to 8 GPUs for training by changing the number of `CUDA_VISIBLE_DEVICES` to the GPU number and the number of GPUs after `--nproc-per-node`. In our paper, we use 4×H100 GPU for training. In this configuration, the four suites of the LIBERO benchmark, `Spatial` (only five hours), `Object` (less than one hour), `Goal` (three hours), and `Long` (half a day); the `CALVIN` benchmark (eight hours)\n\n>***About `vlm_path`.***\n\nThe VLM in the VLA-Adapter uses the Prismatic-VLMs architecture, with the LLM backbone being `Qwen2.5-0.5B`. You can download it from https:\u002F\u002Fhuggingface.co\u002FStanford-ILIAD\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b and place it in `\u002Fpretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b`.\n\n>***About `data_name`.***\n\nLaunch the fine-tuning script with the vla-adapter configuration below. It can run in the background, and the running progress can be seen in the `\u002Flogs` folder. You can replace `libero_spatial_no_noops` with `libero_object_no_noops`, `libero_goal_no_noops`, or `libero_10_no_noops`. If you are using the `CALVIN` benchmark, you need to delete `\\libero` in `--data_root_dir` and replace `libero_spatial_no_noops` with `calvin_abc`.\n\n\n>***About `use_pro_version`.***\n\nIn addition, we recently released an enhanced version `Pro` of the VLA-Adapter. While its framework remains consistent with the original paper, it has been enhanced in the implementation, resulting in significantly improved performance. **Therefore, we strongly recommend using the Pro version!** The `Pro` version's `Policy` size is `207MB`, and training speed is virtually unchanged. The `original version` is nearly `1GB` smaller than the `pro version` (1 batch). You can choose whether to use the `Pro` version by setting the `use_pro_version` parameter, i.e., the `Pro` version is `--use_pro_version True`.\n\n```bash\ndata_name=libero_spatial_no_noops\n\nCUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --standalone --nnodes 1 --nproc-per-node 4 vla-scripts\u002Ffinetune.py \\\n--vlm_path pretrained_models\u002Fprism-qwen25-extra-dinosiglip-224px-0_5b \\\n--config_file_path pretrained_models\u002Fconfigs \\\n--data_root_dir data\u002Flibero \\\n--dataset_name $data_name \\\n--run_root_dir outputs \\\n--use_film False \\\n--num_images_in_input 2 \\\n--use_proprio True \\\n--use_lora True \\\n--use_fz False \\\n--use_minivlm True \\\n--image_aug True \\\n--num_steps_before_decay 150000 \\\n--max_steps 150005 \\\n--save_freq 5000 \\\n--save_latest_checkpoint_only False \\\n--merge_lora_during_training True \\\n--batch_size 16 \\\n--grad_accumulation_steps 1 \\\n--learning_rate 2e-4 \\\n--lora_rank 64 \\\n--use_pro_version True \\\n--wandb_entity \"YOUR_WANDB_ENTITY\" \\\n--wandb_project \"$data_name\" \\\n--run_id_note VLA-Adapter--spatial--$current_time \\\n> logs\u002FVLA-Adapter--spatial--$current_time.log 2>&1 &\n```\n\nPlease note that the obtained models will be stored in the `\u002Foutputs` folder. Each model will take up nearly `3GB` of memory, so you need to reserve enough space. We strongly recommend that you get our trained model from [VLA-Adapter HuggingFace](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) and place it in this folder for inference.\n\n## :mechanical_arm: Inference\n\n### :books: Related File for Inference\n* `experiments\u002Frobot\u002Flibero\u002F`: LIBERO eval files\n  * `run_libero_eval.py`: LIBERO eval script\n  * `libero_utils.py`: LIBERO eval utils\n* `experiments\u002Frobot\u002F`: General eval utils files\n  * `openvla_utils.py`: VLA-specific eval utils\n  * `robot_utils.py`: Other eval utils\n\n\u003Cbr\u002F>\n\n### 🤗 Checkpoint of VLA-Adapter \u003Ca name=\"ckpts\">\u003C\u002Fa>\nWe fine-tuned `Qwen2.5-0.5B` with our adapter bridge paradigm on four LIBERO task suites independently: `LIBERO-Spatial`, `LIBERO-Object`, `LIBERO-Goal`, and `LIBERO-Long`. \nThe four VLA-Adapter checkpoints for LIBERO are available on Hugging Face:\n* [VLA-Adapter\u002FLIBERO-Spatial](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Spatial) \n* [VLA-Adapter\u002FLIBERO-Object](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Object)\n* [VLA-Adapter\u002FLIBERO-Goal](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Goal)\n* [VLA-Adapter\u002FLIBERO-Long](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Long)\n\nIn addition, we also provide a `Pro` version, we used `4*H100` GPUs for training, `--batch_size 16`, `--lora rank 64`, and the `--max_steps 100000`. The Pro checkpoints is:\n\n* [VLA-Adapter\u002FLIBERO-Spatial-Pro](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Spatial-Pro) `(97.8 -> 99.6)`\n* [VLA-Adapter\u002FLIBERO-Object-Pro](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Object-Pro) `(99.2 -> 99.6)`\n* [VLA-Adapter\u002FLIBERO-Goal-Pro](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Goal-Pro) `(97.2 -> 98.2)`\n* [VLA-Adapter\u002FLIBERO-Long-Pro](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FLIBERO-Long-Pro) `(95.0 -> 96.4)`\n* [VLA-Adapter\u002FCALVIN-ABC-Pro](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter\u002FCALVIN-ABC-Pro) `(4.42 -> 4.50)`\n\nThese files need to be placed in the `\u002Foutput` folder. If you trained your own models, it will also be stored here. The subsequent eval code will call the model in this folder for inference.\n\n\n\u003Cbr\u002F>\n\n\n### :notebook: How to Eval \u003Ca name=\"evals\">\u003C\u002Fa>\n\n**We strongly recommend that you use our open source `Pro` version of the model, which has stronger performance.** To start evaluations with one of these checkpoints, run one of the commands below. Each will automatically download the appropriate checkpoint listed above. If you want to use the original version of the model, you only need to adjust the `-- use_pro_version` parameter to `False` and pass the original version of the model to the `--pretrained_checkpoint` parameter. Finally, the inference results will be displayed in the `\u002Feval_logs` folder, and the inference video will be displayed in the `\u002Frollouts\u002Fvla-adapter` folder. \n\n\n```bash\n# Launch LIBERO-Spatial-Pro evals (Background running)\nCUDA_VISIBLE_DEVICES=0 python experiments\u002Frobot\u002Flibero\u002Frun_libero_eval.py \\\n  --use_proprio True \\\n  --num_images_in_input 2 \\\n  --use_film False \\\n  --pretrained_checkpoint outputs\u002FLIBERO-Spatial-Pro \\\n  --task_suite_name libero_spatial \\\n  --use_pro_version True \\\n  > eval_logs\u002FSpatial--chkpt.log 2>&1 &\n\n\n# Launch LIBERO-Object-Pro evals (Background running)\nCUDA_VISIBLE_DEVICES=0 python experiments\u002Frobot\u002Flibero\u002Frun_libero_eval.py \\\n  --use_proprio True \\\n  --num_images_in_input 2 \\\n  --use_film False \\\n  --pretrained_checkpoint outputs\u002FLIBERO-Object-Pro \\\n  --task_suite_name libero_object \\\n  --use_pro_version True \\\n  > eval_logs\u002FObject--chkpt.log 2>&1 &\n\n\n# Launch LIBERO-Goal-Pro evals (Background running)\nCUDA_VISIBLE_DEVICES=0 python experiments\u002Frobot\u002Flibero\u002Frun_libero_eval.py \\\n  --use_proprio True \\\n  --num_images_in_input 2 \\\n  --use_film False \\\n  --pretrained_checkpoint outputs\u002FLIBERO-Goal-Pro \\\n  --task_suite_name libero_goal \\\n  --use_pro_version True \\\n  > eval_logs\u002FGoal--chkpt.log 2>&1 &\n\n\n# Launch LIBERO-Long-Pro (LIBERO-10) evals (Background running)\nCUDA_VISIBLE_DEVICES=0 python experiments\u002Frobot\u002Flibero\u002Frun_libero_eval.py \\\n  --use_proprio True \\\n  --num_images_in_input 2 \\\n  --use_film False \\\n  --pretrained_checkpoint outputs\u002FLIBERO-long-Pro \\\n  --task_suite_name libero_10 \\\n  --use_pro_version True \\\n  > eval_logs\u002FLong--chkpt.log 2>&1 &\n\n\n# Launch CALVIN ABC→D-Pro evals (Background running)\nCUDA_VISIBLE_DEVICES=0 python vla-scripts\u002Fevaluate_calvin.py \\\n  --pretrained_checkpoint outputs\u002FCALVIN-ABC-Pro \\\n  > eval_logs\u002FCALVIN--ABC.log 2>&1 &\n```\n\nIf you want to get the inference **throughput**, you can run it in the `run_libero_eval.py` file. You can add  `start = time.time()` and `end = time.time()` before and after `lines 334--345` and calculate the difference between the two. This difference is the time it takes to generate `8 chunks`. This gives you the inference throughput. We measured it multiple times and took the average value of `0.036s`.\n\n\u003Cbr\u002F>\n\n## 🌈 Success Rate Comparison \u003Ca name=\"results\">\u003C\u002Fa>\n\nAll our results are inferred on `H100`. You can find the inference `log` file in the model released on [HF](https:\u002F\u002Fhuggingface.co\u002FVLA-Adapter) for viewing. The evaluation script will run 500 trials by default (10 tasks x 50 episodes each) in LIBERO and 1,000 task sequences in CALVIN. Use the same card for training and inference whenever possible. **Note that results may vary slightly if you use a different GPU than the H100.** This phenomenon is also mentioned in the OpenVLA-OFT readme file.\n\n### Performance on LIBERO benchmark. \n\n\u003Cb>\u003Ci>XX\u003C\u002Fi>\u003C\u002Fb> represents the best performance, \u003Cb>XX\u003C\u002Fb> represents the second best performance, and \u003Ci>\u003Cu>XX*\u003C\u002Fu>\u003C\u002Fi> represents the third best performance.\n\u003Ctable>\n  \u003Ctr>\n   \u003Ctd>\u003Cstrong>LIBERO\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>Methods\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>Scale\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>Spatial\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>Object\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>Goal\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>Long\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>Avg.\u003C\u002Fstrong>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"10\">Large-scale\u003C\u002Ftd>\u003Ctd>FlowVLA (Zhong et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>8.5B\u003C\u002Ftd>\u003Ctd>93.2\u003C\u002Ftd>\u003Ctd>95.0\u003C\u002Ftd>\u003Ctd>91.6\u003C\u002Ftd>\u003Ctd>72.6\u003C\u002Ftd>\u003Ctd>88.1\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>UnifiedVLA (Wang et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>8.5B\u003C\u002Ftd>\u003Ctd>95.4\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>98.8*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd> 93.6 \u003C\u002Ftd>\u003Ctd>94.0 \u003C\u002Ftd>\u003Ctd>95.5\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>OpenVLA (Kim et al., 2024)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>84.7\u003C\u002Ftd>\u003Ctd>88.4\u003C\u002Ftd>\u003Ctd>79.2\u003C\u002Ftd>\u003Ctd>53.7\u003C\u002Ftd>\u003Ctd>76.5\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>OpenVLA-OFT (Kim et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>97.6*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>98.4\u003C\u002Ftd>\u003Ctd>\u003Cb>97.9\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>94.5*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>97.1*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>UniVLA (Bu et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>96.5\u003C\u002Ftd>\u003Ctd> 96.8\u003C\u002Ftd>\u003Ctd> 95.6 \u003C\u002Ftd>\u003Ctd>92.0 \u003C\u002Ftd>\u003Ctd>95.2\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>CoT-VLA (Zhao et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>87.5 \u003C\u002Ftd>\u003Ctd>91.6 \u003C\u002Ftd>\u003Ctd>87.6\u003C\u002Ftd>\u003Ctd> 69.0\u003C\u002Ftd>\u003Ctd> 81.1\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>WorldVLA (Cen et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>87.6\u003C\u002Ftd>\u003Ctd> 96.2\u003C\u002Ftd>\u003Ctd> 83.4\u003C\u002Ftd>\u003Ctd> 60.0\u003C\u002Ftd>\u003Ctd> 81.8\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>TraceVLA (Zheng et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>84.6\u003C\u002Ftd>\u003Ctd> 85.2\u003C\u002Ftd>\u003Ctd> 75.1\u003C\u002Ftd>\u003Ctd> 54.1\u003C\u002Ftd>\u003Ctd> 74.8\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>MolmoAct (Lee et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>87.0\u003C\u002Ftd>\u003Ctd> 95.4 \u003C\u002Ftd>\u003Ctd>87.6\u003C\u002Ftd>\u003Ctd> 77.2 \u003C\u002Ftd>\u003Ctd>86.6\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>ThinkAct (Huang et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>7B\u003C\u002Ftd>\u003Ctd>88.3 \u003C\u002Ftd>\u003Ctd>91.4\u003C\u002Ftd>\u003Ctd> 87.1\u003C\u002Ftd>\u003Ctd> 70.9\u003C\u002Ftd>\u003Ctd> 84.4\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"7\">Small-scale\u003C\u002Ftd>\u003Ctd>4D-VLA (Zhang et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>4B\u003C\u002Ftd>\u003Ctd>88.9\u003C\u002Ftd>\u003Ctd> 95.2\u003C\u002Ftd>\u003Ctd> 90.9\u003C\u002Ftd>\u003Ctd> 79.1 \u003C\u002Ftd>\u003Ctd>88.6\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>SpatialVLA (Qu et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>4B\u003C\u002Ftd>\u003Ctd>88.2\u003C\u002Ftd>\u003Ctd> 89.9\u003C\u002Ftd>\u003Ctd> 78.6\u003C\u002Ftd>\u003Ctd> 55.5 \u003C\u002Ftd>\u003Ctd>78.1\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>π0 (Black et al., 2024)\u003C\u002Ftd>\n   \u003Ctd>3B\u003C\u002Ftd>\u003Ctd>96.8\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>98.8*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>95.8\u003C\u002Ftd>\u003Ctd> 85.2\u003C\u002Ftd>\u003Ctd> 94.2\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>π0-FAST (Pertsch et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>3B\u003C\u002Ftd>\u003Ctd>96.4\u003C\u002Ftd>\u003Ctd> 96.8 \u003C\u002Ftd>\u003Ctd>88.6\u003C\u002Ftd>\u003Ctd> 60.2\u003C\u002Ftd>\u003Ctd> 85.5\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>NORA (Hung et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>3B\u003C\u002Ftd>\u003Ctd>92.2 \u003C\u002Ftd>\u003Ctd>95.4 \u003C\u002Ftd>\u003Ctd>89.4\u003C\u002Ftd>\u003Ctd> 74.6 \u003C\u002Ftd>\u003Ctd>87.9\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>SmolVLA (Shukor et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>2.2B\u003C\u002Ftd>\u003Ctd>93.0\u003C\u002Ftd>\u003Ctd> 94.0 \u003C\u002Ftd>\u003Ctd>91.0\u003C\u002Ftd>\u003Ctd> 77.0 \u003C\u002Ftd>\u003Ctd>88.8\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>GR00T N1 (NVIDIA et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>2B\u003C\u002Ftd>\u003Ctd>94.4\u003C\u002Ftd>\u003Ctd> 97.6 \u003C\u002Ftd>\u003Ctd>93.0 \u003C\u002Ftd>\u003Ctd>90.6\u003C\u002Ftd>\u003Ctd> 93.9\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"5\">Tiny-scale\u003C\u002Ftd>\u003Ctd>Seer (Tian et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>0.57B\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd> - \u003C\u002Ftd>\u003Ctd>- \u003C\u002Ftd>\u003Ctd>78.7\u003C\u002Ftd>\u003Ctd> 78.7\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>VLA-OS (Gao et al., 2025)\u003C\u002Ftd>\n   \u003Ctd>0.5B\u003C\u002Ftd>\u003Ctd>87.0 \u003C\u002Ftd>\u003Ctd>96.5\u003C\u002Ftd>\u003Ctd> 92.7 \u003C\u002Ftd>\u003Ctd>66.0\u003C\u002Ftd>\u003Ctd> 85.6\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>Diffusion Policy (Chi et al., 2023)\u003C\u002Ftd>\n   \u003Ctd>-\u003C\u002Ftd>\u003Ctd>78.3\u003C\u002Ftd>\u003Ctd> 92.5\u003C\u002Ftd>\u003Ctd> 68.3 \u003C\u002Ftd>\u003Ctd>50.5 \u003C\u002Ftd>\u003Ctd>72.4\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>\u003Cb>VLA-Adapter (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n   \u003Ctd>\u003Cb>0.5B\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>97.8\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>99.2\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>97.2*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd> \u003Cb>95.0 \u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>97.3\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>\u003Cb>VLA-Adapter-Pro (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n   \u003Ctd>\u003Cb>0.5B\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>99.6\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>99.6\u003C\u002Fi>\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>98.2\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>96.4\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>98.5\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n  \n\u003C\u002Ftable>\n\n### Performance on CALVIN ABC→D benchmark. \n\n\u003Cb>\u003Ci>XX\u003C\u002Fi>\u003C\u002Fb> represents the best performance, \u003Cb>XX\u003C\u002Fb> represents the second best performance, and \u003Ci>\u003Cu>XX*\u003C\u002Fu>\u003C\u002Fi> represents the third best performance.\n\n\u003Ctable>\n  \u003Ctr>\n   \u003Ctd>\u003Cstrong>CALVIN\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>Methods\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>Scale\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>1\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>2\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>3\u003C\u002Fstrong>\u003C\u002Ftd>\n   \u003Ctd>\u003Cstrong>4\u003C\u002Fstrong>\u003C\u002Ftd>  \u003Ctd>\u003Cstrong>5\u003C\u002Fstrong>\u003C\u002Ftd> \u003Ctd>\u003Cstrong>Avg. len\u003C\u002Fstrong>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"8\">Large-scale\u003C\u002Ftd>\u003Ctd>UniVLA (Bu et al., 2025) \u003C\u002Ftd>\u003Ctd>7B \u003C\u002Ftd>\u003Ctd>95.5 \u003C\u002Ftd>\u003Ctd>85.8 \u003C\u002Ftd>\u003Ctd>75.4\u003C\u002Ftd>\u003Ctd> 66.9 \u003C\u002Ftd>\u003Ctd>56.5 \u003C\u002Ftd>\u003Ctd>3.80\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>OpenVLA (Kim et al., 2024) \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 91.3\u003C\u002Ftd>\u003Ctd> 77.8 \u003C\u002Ftd>\u003Ctd>62.0 \u003C\u002Ftd>\u003Ctd>52.1 \u003C\u002Ftd>\u003Ctd>43.5\u003C\u002Ftd>\u003Ctd> 3.27\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>OpenVLA-OFT (Kim et al., 2025)\u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 96.3\u003C\u002Ftd>\u003Ctd> 89.1 \u003C\u002Ftd>\u003Ctd>82.4\u003C\u002Ftd>\u003Ctd> 75.8\u003C\u002Ftd>\u003Ctd> 66.5\u003C\u002Ftd>\u003Ctd> 4.10\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>VLAS (Zhao et al., 2025b) \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 87.2 \u003C\u002Ftd>\u003Ctd>64.2\u003C\u002Ftd>\u003Ctd> 40.9 \u003C\u002Ftd>\u003Ctd>28.1\u003C\u002Ftd>\u003Ctd> 19.6 \u003C\u002Ftd>\u003Ctd>2.40\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>LCB (Shentu et al., 2024) \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 73.6 \u003C\u002Ftd>\u003Ctd>50.2 \u003C\u002Ftd>\u003Ctd>28.5 \u003C\u002Ftd>\u003Ctd>16.0 \u003C\u002Ftd>\u003Ctd>9.9 \u003C\u002Ftd>\u003Ctd>1.78\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>RoboDual (Bu et al., 2024a) \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 94.4\u003C\u002Ftd>\u003Ctd> 82.7\u003C\u002Ftd>\u003Ctd> 72.1\u003C\u002Ftd>\u003Ctd> 62.4 \u003C\u002Ftd>\u003Ctd>54.4\u003C\u002Ftd>\u003Ctd> 3.66\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>OpenHelix (Cui et al., 2025)  \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> \u003Ci>\u003Cu>97.1*\u003C\u002Fu>\u003C\u002Fi> \u003C\u002Ftd>\u003Ctd>91.4 \u003C\u002Ftd>\u003Ctd>82.8\u003C\u002Ftd>\u003Ctd> 72.6\u003C\u002Ftd>\u003Ctd> 64.1 \u003C\u002Ftd>\u003Ctd>4.08\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>ReconVLA (Song et al., 2025c)  \u003C\u002Ftd>\u003Ctd> 7B\u003C\u002Ftd>\u003Ctd> 95.6 \u003C\u002Ftd>\u003Ctd>87.6 \u003C\u002Ftd>\u003Ctd>76.9\u003C\u002Ftd>\u003Ctd> 69.3\u003C\u002Ftd>\u003Ctd> 64.1 \u003C\u002Ftd>\u003Ctd>3.95\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"4\">Small-scale\u003C\u002Ftd>\u003Ctd>DeeR (Yue et al., 2024) \u003C\u002Ftd>\u003Ctd> 3B\u003C\u002Ftd>\u003Ctd> 86.2\u003C\u002Ftd>\u003Ctd> 70.1 \u003C\u002Ftd>\u003Ctd>51.8\u003C\u002Ftd>\u003Ctd> 41.5\u003C\u002Ftd>\u003Ctd> 30.4 \u003C\u002Ftd>\u003Ctd>2.82\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>RoboFlamingo (Li et al., 2024b) \u003C\u002Ftd>\u003Ctd> 3B\u003C\u002Ftd>\u003Ctd> 82.4 \u003C\u002Ftd>\u003Ctd>61.9\u003C\u002Ftd>\u003Ctd> 46.6 \u003C\u002Ftd>\u003Ctd>33.1\u003C\u002Ftd>\u003Ctd> 23.5\u003C\u002Ftd>\u003Ctd> 2.48\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>VPP (Hu et al., 2025)\u003C\u002Ftd>\u003Ctd>  1.5B\u003C\u002Ftd>\u003Ctd>  95.7\u003C\u002Ftd>\u003Ctd>  91.2\u003C\u002Ftd>\u003Ctd>  \u003Ci>\u003Cu>86.3*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>  \u003Ci>\u003Cu>81.0*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>  \u003Ci>\u003Cu>75.0*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd>  \u003Ci>\u003Cu>4.33*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>SuSIE (Black et al., 2024)\u003C\u002Ftd>\u003Ctd>1.3B\u003C\u002Ftd>\u003Ctd> 87.0\u003C\u002Ftd>\u003Ctd> 69.0\u003C\u002Ftd>\u003Ctd> 49.0 \u003C\u002Ftd>\u003Ctd>38.0\u003C\u002Ftd>\u003Ctd> 26.0\u003C\u002Ftd>\u003Ctd> 2.69\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd rowspan=\"5\">Tiny-scale\u003C\u002Ftd>\u003Ctd>Seer-Large (Tian et al., 2025)\u003C\u002Ftd>\u003Ctd>0.57B\u003C\u002Ftd>\u003Ctd> 96.3 \u003C\u002Ftd>\u003Ctd>\u003Ci>\u003Cu>91.6*\u003C\u002Fu>\u003C\u002Fi>\u003C\u002Ftd>\u003Ctd> 86.1 \u003C\u002Ftd>\u003Ctd>80.3 \u003C\u002Ftd>\u003Ctd>74.0\u003C\u002Ftd>\u003Ctd> 4.28\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>MoDE (Reuss et al., 2025) \u003C\u002Ftd>\u003Ctd> 0.44B \u003C\u002Ftd>\u003Ctd>96.2\u003C\u002Ftd>\u003Ctd> 88.9\u003C\u002Ftd>\u003Ctd> 81.1\u003C\u002Ftd>\u003Ctd> 71.8 \u003C\u002Ftd>\u003Ctd>63.5 \u003C\u002Ftd>\u003Ctd>4.01\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>Seer (Tian et al., 2025) \u003C\u002Ftd>\u003Ctd> 0.32B\u003C\u002Ftd>\u003Ctd> 94.4 \u003C\u002Ftd>\u003Ctd>87.2 \u003C\u002Ftd>\u003Ctd>79.9 \u003C\u002Ftd>\u003Ctd>72.2 \u003C\u002Ftd>\u003Ctd>64.3\u003C\u002Ftd>\u003Ctd> 3.98\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>\u003Cb>VLA-Adapter (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n   \u003Ctd>\u003Cb>0.5B\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>99.1\u003C\u002Fi>\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>94.6\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>88.8\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd> \u003Cb>82.8\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>76.5\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>4.42\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n\n  \u003Ctr>\u003Ctd>\u003Cb>VLA-Adapter-Pro (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n   \u003Ctd>\u003Cb>0.5B\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>98.5\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>95.0\u003C\u002Fi>\u003C\u002Fb> \u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>90.5\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>85.3\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>80.0\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>\u003Ci>4.50\u003C\u002Fi>\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n  \n\u003C\u002Ftable>\n\n\n\u003Cbr\u002F>\n\n\n## 📝 Citation \u003Ca name=\"cite\">\u003C\u002Fa>\n\n### 🫶 If you feel that this paper, models, or codes are helpful, please cite our paper, thanks for your support of VLA-Adapter!\n\n```bibtex\n@article{wang2025vlaadapter,\n  author={Wang, Yihao and Ding, Pengxiang and Li, Lingxiao and Cui, Can and Ge, Zirui and Tong, Xinyang and Song, Wenxuan and Zhao, Han and Zhao, Wei and Hou, Pengxu and Huang, Siteng and Tang, Yifan and Wang, Wenhui and Zhang, Ru and Liu, Jianyi and Wang, Donglin},\n  title={VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model},\n  journal={arXiv preprint arXiv:2509.09372},\n  year={2025}\n}\n```\n\n## :heart: Acknowledgment\n\nWe thank [OpenVLA-OFT](https:\u002F\u002Fgithub.com\u002Fmoojink\u002Fopenvla-oft), [MiniVLA](https:\u002F\u002Fgithub.com\u002FStanford-ILIAD\u002Fopenvla-mini), and [RoboDual](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FRoboDual) for their open-sourced work!\n\n## 🌟 Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F#OpenHelix-Team\u002FVLA-Adapter&Date\">\n  \u003Cimg src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenHelix-Team\u002FVLA-Adapter&type=Date\" width=\"400\" height=\"250\" \u002F>\n\u003C\u002Fa>\n\n","VLA-Adapter 是一个针对小规模视觉-语言-动作模型的有效范例。该项目通过结合视觉、语言和动作数据，实现了高效的多模态处理能力，特别适用于机器人和具身AI领域。其核心功能包括支持多种基础模型的适配、优化实现以及在实际机器人平台上的部署验证。技术特点方面，VLA-Adapter 采用Python开发，并且已经发布了多个版本以供研究者和开发者使用。适合用于需要高效处理视觉-语言-动作数据融合的应用场景，如服务机器人、自动化系统等。","2026-06-11 03:42:34","high_star"]