[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84132":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":13,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":14,"rankGlobal":8,"rankLanguage":8,"license":15,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":8,"trendingCount":13,"starSnapshotCount":13,"syncStatus":22,"lastSyncTime":23,"discoverSource":24},84132,"ACoT-VLA-WM","alexantaluo0\u002FACoT-VLA-WM","alexantaluo0",null,"Python",103,4,1,0,2.1,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:38","# ACOT-VLA-WM: Precise Robotic Subgoal Generation and Execution\n\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue.svg)](https:\u002F\u002Fwww.acotvla-wm.xyz)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2601.11404-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2601.11404v2)\n[![License: CC BY 4.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-CC%20BY%204.0-lightgrey.svg)](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n\n**ACOT-VLA-WM** is an improved variant of [**ACoT-VLA**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11404v2) that integrates a **Predictive World Model** into the Action Chain-of-Thought framework, enabling precise robotic subgoal generation and execution on long-horizon, high-precision manipulation tasks.\n\n> 📖 Full technical details, demos, and evaluation videos: **[acotvla-wm.xyz](https:\u002F\u002Fwww.acotvla-wm.xyz)**\n\n---\n\n## Overview\n\nFoundational robot models need not only semantic comprehension of task objectives, but also concrete instantiations of intermediate subgoals throughout training and inference. **ACOT-VLA-WM** addresses this by deeply fusing the main ACoT-VLA model with a predictive world model: the world model forecasts multi-view future frames and embeds them as visual subgoals into the main model's training, significantly improving physical fault tolerance and action precision.\n\n### Key Improvements over ACoT-VLA\n\n* **World Model Integration:** A finetuned predictive world model (BAGEL-based) generates multi-view future subgoal images that are embedded into ACoT-VLA training prompts.\n* **Mixed Subgoal Sampling (75% \u002F 12.5% \u002F 12.5%):**\n  - **75%** — uniformly sample a future frame 0–4 seconds ahead, improving robustness to execution delay and speed perturbations.\n  - **12.5%** — use the sub-step terminal frame as subgoal.\n  - **12.5%** — use world-model-generated future frames as subgoal.\n* **Robust Long-Horizon Execution:** On 5 industrial manipulation scenarios (10 rollouts each), baseline ACoT-VLA achieves **80%** overall success rate, while ACOT-VLA-WM reaches **100%**, including challenging tasks such as picking up a barcode scanner and scanning 5 QR codes on a marble table.\n\nCore training logic lives in `src\u002Fopenpi\u002Ftraining\u002Fsubgoal_dataset.py` and `src\u002Fopenpi\u002Ftraining\u002Fsampler.py`, which dynamically schedule world-model predictions and real future frames during training.\n\n---\n\n## Get Started\n\n### 1. Installation\n\nWe utilize **uv** to manage the Python environment.\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAgibotTech\u002FACOT-VLA-WM.git\ncd ACOT-VLA-WM\ngit submodule update --init --recursive\nGIT_LFS_SKIP_SMUDGE=1 uv sync\nGIT_LFS_SKIP_SMUDGE=1 uv pip install -e .\n```\n\n### 2. G2 Robot: Video Data → Images（>2× Faster Model Training）\n\nConvert LeRobot video datasets to parquet-embedded image datasets for faster training I\u002FO.\n\n```bash\nexport HF_LEROBOT_HOME=\u002Fdata\u002Fdataset\u002FRobotdataset\u002FRobotdataset\u002FG2_Robot\u002Fphone_packaging\n\nuv run python scripts\u002Fpredecode_lerobot_videos_to_images.py \\\n  --input-repo-id video_data \\\n  --output-repo-id images_data \\\n  --image-format jpeg \\\n  --jpeg-quality 98 \\\n  --num-workers 16 \\\n  --overwrite\n```\n\n### 3. Compute Normalization Statistics\n\n```bash\nuv run scripts\u002Fcompute_norm_stats.py \\\n  --config-name acot-vla-wm-task \\\n  --robot-action-dim=24\n```\n\n### 4. Model Training\n\n```bash\n# Load image-based dataset for training\nexport XLA_PYTHON_CLIENT_MEM_FRACTION=0.6\n\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 uv run scripts\u002Ftrain.py acot-vla-wm-task \\\n  --exp-name=id02 \\\n  --data.use-parquet-images \\\n  --batch-size 64 \\\n  --num-workers 16 \\\n  --overwrite\n```\n\n### 5. Model Deployment\n\n```bash\nexport CUDA_VISIBLE_DEVICES=4,5,6,7\nexport XLA_PYTHON_CLIENT_MEM_FRACTION=0.5\n\nGIT_LFS_SKIP_SMUDGE=1 uv run python scripts\u002Fserve_policy.py \\\n  --env G2SIM \\\n  --port 8067 \\\n  policy:checkpoint \\\n  --policy.config acot-vla-wm-task \\\n  --policy.dir \"\u002Fdata\u002Fcode\u002FACoT-VLA\u002Fcheckpoints\u002Facot-vla-wm-task\u002Fid01\u002F20000\"\n```\n\n---\n\n\n## How to Train World Model?\n\nACOT-VLA-WM relies on a finetuned predictive world model to generate multi-view future subgoal images. For world model training and finetuning, see the companion repository:\n\n**[BAGEL-WM](https:\u002F\u002Fgithub.com\u002Falexantaluo0\u002FBAGEL-WM)**\n\n\n\n## Acknowledgements\n\nThis repo is built upon the [OpenPI](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) framework and the [ACoT-VLA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.11404v2) codebase. We sincerely thank the authors for their contributions to the community.\n",2,"2026-06-11 04:12:22","CREATED_QUERY"]