[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80126":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":12,"forks30d":12,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":12,"starSnapshotCount":12,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},80126,"GuidedVLA","GuidedVLA\u002FGuidedVLA","[RSS 2026] GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization","https:\u002F\u002Fguidedvla.github.io\u002Fproject_page\u002F",null,"Python",62,0,54,1,2,7,3,43.7,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:26","# GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization\n\n\u003Cdiv align=\"center\">\n\n\u003Cp>\nXiaosong Jia\u003Csup>&#42;&#8224;,1,2\u003C\u002Fsup>, Bowen Yang\u003Csup>&#42;,3\u003C\u002Fsup>, Zuhao Ge\u003Csup>&#42;,1,2\u003C\u002Fsup>, Xian Nie\u003Csup>&#42;,3\u003C\u002Fsup>, Yuchen Zhou\u003Csup>&#42;,1,2\u003C\u002Fsup>, Cunxin Fan\u003Csup>&#42;&#8224;,3\u003C\u002Fsup>, Yufeng Li\u003Csup>3\u003C\u002Fsup>, Yilin Chai\u003Csup>3\u003C\u002Fsup>, Chao Jing\u003Csup>1,2\u003C\u002Fsup>, Zijian Liang\u003Csup>3\u003C\u002Fsup>, Qingwen Bu\u003Csup>4\u003C\u002Fsup>, Haidong Cao\u003Csup>1,2\u003C\u002Fsup>, Chao Wu\u003Csup>1,2\u003C\u002Fsup>, Qifeng Li\u003Csup>3\u003C\u002Fsup>, Zhenjie Yang\u003Csup>3\u003C\u002Fsup>, Chenhe Zhang\u003Csup>1,2\u003C\u002Fsup>, Hongyang Li\u003Csup>4\u003C\u002Fsup>, Zuxuan Wu\u003Csup>&#9993;,1,2\u003C\u002Fsup>, Junchi Yan\u003Csup>&#9993;,3\u003C\u002Fsup>, Yu-Gang Jiang\u003Csup>&#9993;,1,2\u003C\u002Fsup>\n\u003C\u002Fp>\n\n\u003Cp>\n\u003Csup>1\u003C\u002Fsup>Institute of Trustworthy Embodied AI (TEAI), Fudan University &nbsp;\n\u003Csup>2\u003C\u002Fsup>Shanghai Key Laboratory of Multimodal Embodied AI &nbsp;\n\u003Csup>3\u003C\u002Fsup>Shanghai Jiao Tong University &nbsp;\n\u003Csup>4\u003C\u002Fsup>OpenDriveLab, The University of Hong Kong\n\u003C\u002Fp>\n\n\u003Cp>\u003Csup>&#42;\u003C\u002Fsup> Core Contributors &nbsp;&nbsp; \u003Csup>&#8224;\u003C\u002Fsup> Project Lead &nbsp;&nbsp; \u003Csup>&#9993;\u003C\u002Fsup> Correspondence Authors\u003C\u002Fp>\n\n[[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12369) &nbsp;|&nbsp; [[Project Page]](https:\u002F\u002Fguidedvla.github.io\u002Fproject_page\u002F) &nbsp;|&nbsp; [[Code]](https:\u002F\u002Fgithub.com\u002FGuidedVLA\u002FGuidedVLA) &nbsp;|&nbsp; [[Checkpoint]](https:\u002F\u002Fhuggingface.co\u002Fybwowen\u002Fpi0-libero-object-depth-skill) &nbsp;|&nbsp; [[Dataset]](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fybwowen\u002Flibero) &nbsp;|&nbsp; [[Citation]](#citation)\n\n\u003C\u002Fdiv>\n\n---\n\n**GuidedVLA** is a VLA paradigm where the action decoder is explicitly guided to capture task-relevant information — object grounding, spatial geometry, and temporal skill logic — through per-head attention specialization. Instead of relying on end-to-end supervision to implicitly learn such features, GuidedVLA repurposes dedicated attention heads to specialize in distinct task-relevant factors, supervised by manually defined auxiliary signals.\n\nThis repository extends [openpi](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) (π₀ \u002F π₀.₅) with the GuidedVLA framework and a full PyTorch training pipeline.\n\n## Release Status\n\n- [x] Release code\n- [x] Release LIBERO training dataset: [ybwowen\u002Flibero](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fybwowen\u002Flibero)\n- [x] Release LIBERO checkpoint: [ybwowen\u002Fpi0-libero-object-depth-skill](https:\u002F\u002Fhuggingface.co\u002Fybwowen\u002Fpi0-libero-object-depth-skill)\n- [ ] Release RoboTwin training dataset\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Ffigures\u002Fguidedvla-teaser.png\" alt=\"GuidedVLA teaser\" width=\"95%\"\u002F>\n\u003C\u002Fp>\n\n## Key Results\n\n**LIBERO-Plus** (robustness benchmark across 7 perturbation dimensions):\n\n| Model | Spatial | Object | Goal | Long | **Total** |\n|---|---|---|---|---|---|\n| π₀ baseline | 77.7 | 74.1 | 61.4 | 60.1 | 68.2 |\n| w\u002F object head | 80.6 | **82.5** | 67.1 | 64.0 | 73.4 |\n| w\u002F skill head | 79.8 | 78.9 | 68.9 | 62.7 | 72.5 |\n| w\u002F depth head | 81.4 | **79.0** | 65.4 | 61.8 | 71.7 |\n| **GuidedVLA (ours)** | **84.0** | 80.9 | **70.8** | **66.2** | **75.4** |\n\n**RoboTwin 2.0** (8 manipulation tasks, out-of-domain): π₀ **77.38%** → GuidedVLA **90.63%**\n\n**Real-world** (ALOHA AgileX + PSI-Bot RealMan, 6 household\u002Flab tasks):\n\n| Generalization | Base Policy | GuidedVLA |\n|---|---|---|\n| In-domain | 55.8% | **75.8%** |\n| Scene | 44.2% | **67.5%** |\n| Lighting | 57.5% | **79.2%** |\n\n---\n\n## Method Overview\n\nGuidedVLA treats the action decoder not as a monolithic learner, but as an **assembly of functionally specialized components**. Attention heads are supervised by task-specific auxiliary signals:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Ffigures\u002Fguidedvla-model-structure.png\" alt=\"GuidedVLA model structure\" width=\"95%\"\u002F>\n\u003C\u002Fp>\n\n1. **Object Head** (Visual Grounding): Guides a subset of heads H_obj to align their attention maps with ground-truth object-region masks via a weighted negative log-likelihood loss L_object. Forces action tokens to attend to the relevant objects and suppress distractors.\n\n2. **Skill Head** (Temporal Logic): Designates heads H_skill to classify the current sub-skill or task phase from their output features, supervised by a KL-divergence loss L_skill against soft skill labels. Captures long-horizon temporal structure without requiring hard skill boundaries.\n\n3. **Depth Head** (Geometry Perception): Injects 3D spatial cues from a frozen depth encoder ([Depth Anything V3](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V3)) as additional keys and values for a subset of heads H_depth. Structural constraint; no loss term required.\n\n**Plug-and-Play via ControlNet Adapter**: Specialized heads are introduced as a lightweight control branch fused into the pretrained backbone via a zero-initialized projection (ZeroConv), matching the ControlNet residual strategy:\n\n```\nAttn_final(x) = Attn_main(x) + ZeroConv(Attn_specified(x))\n```\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Ffigures\u002Fcontrolnet_style_adapter.png\" alt=\"ControlNet-style adapter\" width=\"70%\"\u002F>\n\u003C\u002Fp>\n\nThe branch starts with zero contribution and gradually learns to inject factor-specific biases, preserving pretrained capabilities throughout training.\n\n---\n\n## Installation\n\nClone the repo with submodules:\n\n```bash\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FGuidedVLA\u002FGuidedVLA.git\ncd GuidedVLA\n\n# Or if already cloned:\ngit submodule update --init --recursive\n```\n\nWe use [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) to manage Python dependencies:\n\n```bash\nGIT_LFS_SKIP_SMUDGE=1 uv sync\nGIT_LFS_SKIP_SMUDGE=1 uv pip install -e .\n```\n\nApply the necessary patches to the `transformers` library (required for AdaRMS, activation precision, and KV-cache control):\n\n```bash\ncp -r .\u002Fsrc\u002Fopenpi\u002Fmodels_pytorch\u002Ftransformers_replace\u002F* .venv\u002Flib\u002Fpython3.11\u002Fsite-packages\u002Ftransformers\u002F\n```\n\n> **Note**: With the default uv hardlink mode, this permanently patches the transformers cache. To fully undo: `uv cache clean transformers`.\n\n### Depth Encoder Setup\n\nGuidedVLA uses [Depth Anything V3 Small](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V3) as the frozen depth encoder. Download the DA3-SMALL checkpoint and set `depth_model_name` to the local checkpoint path in your config:\n\n```python\ndepth_model_name = \"path\u002Fto\u002Fda3-small\"  # local checkpoint path\n```\n\n---\n\n## Checkpoints\n\nThe released GuidedVLA checkpoint is hosted on Hugging Face:\n\n| Model | Config | Checkpoint |\n|---|---|---|\n| GuidedVLA LIBERO object + depth + skill | `pi0_libero_object_depth_skill` | [`ybwowen\u002Fpi0-libero-object-depth-skill`](https:\u002F\u002Fhuggingface.co\u002Fybwowen\u002Fpi0-libero-object-depth-skill) |\n\nThe checkpoint contains `model.safetensors` and normalization statistics under\n`assets\u002Fybwowen\u002Flibero\u002Fnorm_stats.json`.\n\nGuidedVLA is built on top of the π₀ \u002F π₀.₅ base models from Physical Intelligence:\n\n| Model | Checkpoint |\n|---|---|\n| π₀ base | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_base` |\n| π₀.₅ base | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi05_base` |\n\nConvert a JAX checkpoint to PyTorch format before training:\n\n```bash\nuv run examples\u002Fconvert_jax_model_to_pytorch.py \\\n    --checkpoint_dir \u002Fpath\u002Fto\u002Fjax\u002Fcheckpoint \\\n    --config_name pi0_libero \\\n    --output_path \u002Fpath\u002Fto\u002Fpytorch\u002Fcheckpoint \\\n    --precision float32\n```\n\n---\n\n## Training\n\n### 1. Prepare your dataset\n\nFor GuidedVLA LIBERO training, we provide the released LeRobot-format dataset on Hugging Face:\n[`ybwowen\u002Flibero`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fybwowen\u002Flibero). The default\nGuidedVLA LIBERO configs in [src\u002Fopenpi\u002Ftraining\u002Fconfig.py](src\u002Fopenpi\u002Ftraining\u002Fconfig.py)\nuse this dataset, including the released checkpoint config\n`pi0_libero_object_depth_skill`.\n\nIf you use your own data, convert it to [LeRobot](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flerobot) format.\n\nFor GuidedVLA's auxiliary supervisions, your dataset should additionally include:\n- **Object head**: `agentview_attention_object_mask` and `wrist_attention_object_mask`\n- **Skill head**: `observation.skill_id` for online soft skill label construction\n- **Depth head**: RGB images (depth is computed on-the-fly by the frozen encoder)\n\nIf you use a different dataset schema, update the data transforms in\n[src\u002Fopenpi\u002Ftraining\u002Fconfig.py](src\u002Fopenpi\u002Ftraining\u002Fconfig.py) so the object and\nskill targets are repacked into the keys consumed by the PyTorch trainer.\n\n### 2. Configure your training run\n\nEdit your config in [src\u002Fopenpi\u002Ftraining\u002Fconfig.py](src\u002Fopenpi\u002Ftraining\u002Fconfig.py). The key GuidedVLA configs are:\n\n| Config | Description |\n|---|---|\n| `pi0_libero_object_depth_skill` | Full GuidedVLA: object + depth + skill heads |\n| `pi0_libero_object` | Object head only |\n| `pi0_libero_depth` | Depth head only |\n| `pi0_libero_skill` | Skill head only |\n| `pi0_libero` | π₀ baseline (no guided heads) |\n\nKey config fields in `Pi0Config`:\n\n```python\n# ControlNet-style attention branch\ncontrol_attention_enabled: bool = True\ncontrol_attention_target: str = \"expert\"   # \"expert\", \"paligemma\", or \"both\"\ncontrol_attention_num_heads: int | None = 8  # heads in control branch\ncontrol_attention_use_headwise_gate: bool = True\n\n# Depth\nuse_depth: bool = True\ndepth_model_name: str = \"path\u002Fto\u002Fda3-small\"\nguided_layer_indices: list = [9, 10, 11, 12]\ndepth_head_indices: list = [4, 5]\n\n# Skill\nuse_skill_loss: bool = True\nskill_num_classes: int = 4  # 3 effective skills + 1 null\u002Fbackground class\nskill_head_indices: list = [6, 7]\n```\n\nFor the default LIBERO full GuidedVLA config, the auxiliary loss weights are:\n\n```python\nobject_loss_weight: float = 0.001\nskill_loss_weight: float = 0.001\n```\n\n### 3. Compute normalization statistics\n\n```bash\nuv run scripts\u002Fcompute_norm_stats.py --config-name pi0_libero_object_depth_skill\n```\n\n### 4. Launch training\n\n```bash\n# Single GPU\nuv run scripts\u002Ftrain_pytorch.py pi0_libero_object_depth_skill \\\n    --exp_name my_run --save_interval 2000\n\n# Multi-GPU (single node)\nuv run torchrun --standalone --nnodes=1 --nproc_per_node=8 \\\n    scripts\u002Ftrain_pytorch.py pi0_libero_object_depth_skill \\\n    --exp_name my_run --save_interval 2000\n\n# Resume from latest checkpoint\nuv run scripts\u002Ftrain_pytorch.py pi0_libero_object_depth_skill \\\n    --exp_name my_run --resume\n\n# Multi-node (e.g., 2 nodes × 8 GPUs)\nuv run torchrun \\\n    --nnodes=2 --nproc_per_node=8 \\\n    --node_rank=\u003Crank> --master_addr=\u003Cip> --master_port=\u003Cport> \\\n    scripts\u002Ftrain_pytorch.py pi0_libero_object_depth_skill \\\n    --exp_name my_run --save_interval 2000\n```\n\nCheckpoints are saved to `.\u002Fcheckpoints\u002F\u003Cconfig_name>\u002F\u003Cexp_name>\u002F`.\n\n### Precision\n\nGuidedVLA trains with **float32 master weights** and `torch.autocast(bfloat16)` for mixed-precision computation.\n\n---\n\n## Evaluation\n\n### LIBERO-Plus\n\nThe current LIBERO-Plus workflow uses:\n- `scripts\u002Fserve_policy.py` for the PyTorch policy server\n- `examples\u002Flibero_plus\u002Fmain.py` for a single evaluation job\n- `examples\u002Flibero_plus\u002Feval_libero_plus.py` for multi-GPU \u002F multi-process batch evaluation\n\n#### 1. Prepare the LIBERO-Plus simulator environment\n\n`examples\u002Flibero_plus\u002Fmain.py` runs inside the LIBERO-Plus simulator environment, which is separate from the main training environment:\n\n```bash\nuv venv --python 3.8 examples\u002Flibero_plus\u002F.venv\nsource examples\u002Flibero_plus\u002F.venv\u002Fbin\u002Factivate\n\nuv pip sync examples\u002Flibero_plus\u002Frequirements.txt third_party\u002FLIBERO-plus\u002Frequirements.txt \\\n    --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu113 \\\n    --index-strategy=unsafe-best-match\n\nuv pip install -e packages\u002Fopenpi-client\nuv pip install -e third_party\u002FLIBERO-plus\nuv pip install -r third_party\u002FLIBERO-plus\u002Fextra_requirements.txt\n\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\u002Fthird_party\u002FLIBERO-plus\n```\n\n#### 2. Launch the policy server\n\nIn one terminal, start the checkpoint server from the repo root:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 uv run --no-sync scripts\u002Fserve_policy.py \\\n    --env LIBERO \\\n    --port 8000 \\\n    policy:checkpoint \\\n    --policy.config pi0_libero_object_depth_skill \\\n    --policy.dir hf:\u002F\u002Fmodels\u002Fybwowen\u002Fpi0-libero-object-depth-skill\n```\n\n#### 3. Run a single LIBERO-Plus job\n\nIn a second terminal, use the simulator environment to evaluate one suite or one perturbation category:\n\n```bash\nsource examples\u002Flibero_plus\u002F.venv\u002Fbin\u002Factivate\nexport PYTHONPATH=$PYTHONPATH:$(pwd)\u002Fthird_party\u002FLIBERO-plus\n\npython examples\u002Flibero_plus\u002Fmain.py \\\n    --host 127.0.0.1 \\\n    --port 8000 \\\n    --task-suite-name libero_object \\\n    --category \"Objects Layout\" \\\n    --num-trials-per-task 1 \\\n    --video-out-path data\u002Flibero_plus\u002Fvideos \\\n    --results-json-path data\u002Flibero_plus\u002Flibero_object.json\n```\n\nUseful `main.py` arguments:\n- `--task-suite-name`: `libero_spatial`, `libero_object`, `libero_goal`, `libero_10`, or `all`\n- `--category`: one of `Objects Layout`, `Camera Viewpoints`, `Robot Initial States`, `Language Instructions`, `Light Conditions`, `Background Textures`, `Sensor Noise`\n- `--task-ids`: e.g. `0`, `0,3,7`, or `10-19`\n- `--replan-steps`: action chunk size requested from the server\n- `--results-json-path`: rolling JSON summary; when `--category` is set, the category suffix is appended automatically\n\n#### 4. Run parallel evaluation across GPUs\n\n`examples\u002Flibero_plus\u002Feval_libero_plus.py` starts one policy server per GPU and dispatches evaluation jobs automatically:\n\n```bash\n.venv\u002Fbin\u002Fpython examples\u002Flibero_plus\u002Feval_libero_plus.py \\\n    --checkpoint-dir hf:\u002F\u002Fmodels\u002Fybwowen\u002Fpi0-libero-object-depth-skill \\\n    --policy-config pi0_libero_object_depth_skill \\\n    --gpu-ids 0,1,2,3 \\\n    --client-python examples\u002Flibero_plus\u002F.venv\u002Fbin\u002Fpython \\\n    --libero-plus-path third_party\u002FLIBERO-plus\n```\n\nUseful `eval_libero_plus.py` arguments:\n- `--task-suites`: comma-separated suites, default is `libero_spatial,libero_object,libero_goal,libero_10`\n- `--categories`: comma-separated perturbation categories\n- `--task-ids`: restrict to a subset of tasks\n- `--num-trials-per-task`: number of rollouts per task\n\nOutputs are written under:\n- `data\u002Flibero_plus\u002F` for JSON results and rollout videos\n- `logs\u002Flibero_plus\u002F` for per-worker logs\n\n### RoboTwin 2.0\n\nEvaluation scripts for RoboTwin 2.0 are provided in `examples\u002Frobotwin\u002F`.\nInitialize RoboTwin with `git submodule update --init --recursive third_party\u002FRoboTwin` before running the RoboTwin pipeline.\nRoboTwin evaluation follows the same policy-server workflow: serve a checkpoint with `scripts\u002Fserve_policy.py`, then run `examples\u002Frobotwin\u002Fmain.py` for the target task.\n\nFor the full RoboTwin setup, data conversion, training configs, and evaluation flow, see [examples\u002Frobotwin\u002FREADME.md](examples\u002Frobotwin\u002FREADME.md).\n\n---\n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@misc{jia2026guidedvla,\n  title         = {GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization},\n  author        = {Xiaosong Jia and Bowen Yang and Zuhao Ge and Xian Nie and Yuchen Zhou and Cunxin Fan and Yufeng Li and Yilin Chai and Chao Jing and Zijian Liang and Qingwen Bu and Haidong Cao and Chao Wu and Qifeng Li and Zhenjie Yang and Chenhe Zhang and Hongyang Li and Zuxuan Wu and Junchi Yan and Yu-Gang Jiang},\n  year          = {2026},\n  eprint        = {2605.12369},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.RO},\n  url           = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12369},\n}\n```\n\n---\n\n## Acknowledgements\n\nGuidedVLA is built as an extension of [openpi](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) by Physical Intelligence. We thank the openpi team for open-sourcing their codebase and pretrained models.\n\n---\n\n## License and Third-Party Notices\n\nThe GuidedVLA source code in this repository is released under the\n[Apache License 2.0](LICENSE), unless a file states otherwise.\n\nGemma and PaliGemma related components and model weights are subject to the\nGemma terms in [LICENSE_GEMMA.txt](LICENSE_GEMMA.txt). The repository also uses\nthird-party projects through submodules and dependencies, including openpi,\nDepth Anything V3, LIBERO, LIBERO-Plus, RoboTwin, and ALOHA. Those projects\nremain governed by their own licenses and model or dataset terms.\n\n---\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>openpi Documentation (original)\u003C\u002Fb>\u003C\u002Fsummary>\n\nopenpi holds open-source models and packages for robotics, published by the [Physical Intelligence team](https:\u002F\u002Fwww.physicalintelligence.company\u002F).\n\nCurrently, this repo contains three types of models:\n- the [π₀ model](https:\u002F\u002Fwww.physicalintelligence.company\u002Fblog\u002Fpi0), a flow-based vision-language-action model (VLA).\n- the [π₀-FAST model](https:\u002F\u002Fwww.physicalintelligence.company\u002Fresearch\u002Ffast), an autoregressive VLA, based on the FAST action tokenizer.\n- the [π₀.₅ model](https:\u002F\u002Fwww.physicalintelligence.company\u002Fblog\u002Fpi05), an upgraded version of π₀ with better open-world generalization.\n\n### Updates\n\n- [Sept 2025] PyTorch support released in openpi.\n- [Sept 2025] π₀.₅ released with better open-world generalization.\n- [Jun 2025] [Instructions](examples\u002Fdroid\u002FREADME_train.md) for training on the full [DROID dataset](https:\u002F\u002Fdroid-dataset.github.io\u002F).\n\n### Base Model Checkpoints\n\n| Model | Checkpoint |\n|---|---|\n| π₀ | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_base` |\n| π₀-FAST | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_fast_base` |\n| π₀.₅ | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi05_base` |\n\n### Fine-Tuned Checkpoints\n\n| Model | Checkpoint |\n|---|---|\n| π₀-FAST-DROID | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_fast_droid` |\n| π₀-DROID | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_droid` |\n| π₀-ALOHA-towel | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_aloha_towel` |\n| π₀-ALOHA-tupperware | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_aloha_tupperware` |\n| π₀-ALOHA-pen-uncap | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi0_aloha_pen_uncap` |\n| π₀.₅-LIBERO | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi05_libero` |\n| π₀.₅-DROID | `gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi05_droid` |\n\nCheckpoints are automatically downloaded and cached in `~\u002F.cache\u002Fopenpi`. Override with `OPENPI_DATA_HOME`.\n\n### Running Inference\n\n```python\nfrom openpi.training import config as _config\nfrom openpi.policies import policy_config\nfrom openpi.shared import download\n\nconfig = _config.get_config(\"pi05_droid\")\ncheckpoint_dir = download.maybe_download(\"gs:\u002F\u002Fopenpi-assets\u002Fcheckpoints\u002Fpi05_droid\")\npolicy = policy_config.create_trained_policy(config, checkpoint_dir)\n\naction_chunk = policy.infer({\n    \"observation\u002Fexterior_image_1_left\": ...,\n    \"observation\u002Fwrist_image_left\": ...,\n    \"prompt\": \"pick up the fork\",\n})[\"actions\"]\n```\n\nFor step-by-step examples: [DROID](examples\u002Fdroid\u002FREADME.md) | [ALOHA](examples\u002Faloha_real\u002FREADME.md) | [Remote Inference](docs\u002Fremote_inference.md)\n\n### Troubleshooting\n\n| Issue | Resolution |\n|---|---|\n| `uv sync` fails | Remove `.venv` and retry. Update uv: `uv self update`. |\n| Out of GPU memory | Use multi-GPU DDP (`--nproc_per_node=N`) or reduce batch size. |\n| Missing norm stats | Run `scripts\u002Fcompute_norm_stats.py --config-name \u003Cname>` first. |\n| Dataset download fails | Check internet \u002F HuggingFace login: `huggingface-cli login`. |\n| CUDA errors | Try uninstalling system CUDA; uv installs the correct version. |\n| Diverging training loss | Check `norm_stats.json` for near-zero `std`\u002F`q01`\u002F`q99` values; adjust manually. |\n\n\u003C\u002Fdetails>\n","GuidedVLA 是一种通过插件式动作注意力专业化来指定任务相关因素的视觉-语言-动作（VLA）范式。该项目的核心功能在于通过专门设计的注意力头来捕捉物体定位、空间几何和时间技能逻辑等关键信息，而不是依赖端到端的监督学习隐式地获取这些特征。技术上，它基于 PyTorch 构建了完整的训练流程，并扩展了 openpi 项目。适用于需要机器人在复杂环境中执行特定任务时提高其鲁棒性和准确性的场景，如家庭服务机器人或工业自动化领域。","2026-06-11 03:59:21","CREATED_QUERY"]