[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74279":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74279,"lingbot-va","Robbyant\u002Flingbot-va","Robbyant","[RSS 2026] Causal video-action world model for generalist robot control","https:\u002F\u002Ftechnology.robbyant.com\u002Flingbot-va",null,"Python",1318,110,12,44,0,23,58,162,69,104.14,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:14","\u003Ch1 align=\"center\">LingBot-VA: Causal World Modeling for Robot Control\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21998\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Paper&message=PDF&color=red&logo=arxiv\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Ftechnology.robbyant.com\u002Flingbot-va\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Website-blue\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Frobbyant\u002Flingbot-va\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=%F0%9F%A4%97%20Model&message=HuggingFace&color=orange\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002FRobbyant\u002FLingBot-VA\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=%F0%9F%A4%96%20Model&message=ModelScope&color=purple\">\u003C\u002Fa>\n  \u003Ca href=\"LICENSE.txt\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-green\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser_v3.png\" width=\"100%\">\n\u003C\u002Fp>\n\n\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcec7b7a6-953b-4fa4-8f1a-47efc1fce547\n\n\n\n\n## Table of Contents\n\n- [News](#-news)\n- [Model Download](#-model-download)\n- [Quick Start](#️-quick-start)\n  - [Installation](#installation)\n  - [attn_mode Configuration](#️-important-attn_mode-configuration)\n  - [Deploying LingBot-VA for Inference](#deploying-lingbot-va-for-inference)\n    - [Evaluation on RoboTwin-2.0](#evaluation-on-robotwin-20)\n    - [Evaluation on LIBERO](#evaluation-on-libero)\n    - [Run Image to Video-Action Generation](#run-image-to-video-action-generation)\n  - [Post-Training LingBot-VA](#post-training-lingbot-va)\n    - [Data Preparation](#data-preparation)\n    - [Custom Dataset Preparation](#custom-dataset-preparation)\n    - [Training](#training)\n- [Performance](#-performance)\n  - [Simulation Evaluation](#simulation-evaluation)\n  - [Real-world Deployment](#real-world-deployment)\n- [License](#-license)\n- [Citation](#citation)\n- [Acknowledgments](#-acknowledgments)\n\n---\n\n## 💫 Meet **LingBot-VA**!  We've built an AR diffusion framework for simultaneous world modeling and action! 🤖✨\n\n**LingBot-VA** has focused on:\n- **Autoregressive Video-Action World Modeling**: Architecturally unifies visual dynamics prediction and action inference within a single interleaved sequence while maintaining their conceptual distinction.\n- **High-efficiency Execution**: A dual-stream mixture-of-transformers(MoT) architecture with Asynchronous Execution and KV Cache.\n- **Long-Horizon Performance and Generalization**: High improvements in sample efficiency, long-horizon success rates, and generalization to novel scenes.\n\n# 🚀 News\n- **[2026-04-24]** Weights for post-train on **LIBERO-LONG** released! (**IMPORTANT**: Ensure that `va_libero_cfg.action_snr_shift`, `va_libero_cfg.used_action_channel_ids` and `va_libero_cfg.norm_stat` in [`wan_va\u002Fconfigs\u002Fva_libero_cfg.py`](wan_va\u002Fconfigs\u002Fva_libero_cfg.py) are synchronized with the latest version of the repository.)\n- **[2026-04-08]** Post-training and inference code for the **LIBERO** dataset is now available!\n- **[2026-02-17]** Post-training code and dataset released! Support fine-tuning LingBot-VA on custom robotic manipulation datasets.\n- **[2026-01-29]** Weights and code for shared backbone released! Please stay tuned for our separated version!\n\n\n\n\n---\n\n\n\n# 📦 Model Download\n- **Pretrained Checkpoints for Post-Training**\n\n| Model Name | Huggingface Repository | ModelScope Repository  | Description |\n| :--- | :--- | :--- | :--- |\n| lingbot-va-base &nbsp; | [🤗 robbyant\u002Flingbot-va-base &nbsp;](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-va-base) | [🤖 Robbyant\u002Flingbot-va-base &nbsp;](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-va-base)  | LingBot-VA w\u002F shared backbone|\n| lingbot-va-posttrain-robotwin &nbsp; | [🤗 robbyant\u002Flingbot-va-posttrain-robotwin &nbsp;](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-va-posttrain-robotwin) | [🤖 Robbyant\u002Flingbot-va-posttrain-robotwin &nbsp;](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-va-posttrain-robotwin)  | LingBot-VA-Posttrain-Robotwin w\u002F shared backbone|\n| lingbot-va-posttrain-libero-long &nbsp; | [🤗 robbyant\u002Flingbot-va-posttrain-libero-long &nbsp;](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-va-posttrain-libero-long) | [🤖 Robbyant\u002Flingbot-va-posttrain-libero-long &nbsp;](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-va-posttrain-libero-long)  | LingBot-VA-Posttrain-LIBERO-LONG w\u002F shared backbone|\n\n- **Post-Training Dataset**\n\n| Dataset Name | Huggingface Repository | ModelScope Repository | Description |\n| :--- | :--- | :--- | :--- |\n| robotwin-clean-and-aug-lerobot &nbsp; | [🤗 robbyant\u002Frobotwin-clean-and-aug-lerobot](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Frobbyant\u002Frobotwin-clean-and-aug-lerobot) | [🤖 Robbyant\u002Frobotwin-clean-and-aug-lerobot](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FRobbyant\u002Frobotwin-clean-and-aug-lerobot) | Cleaned & augmented RoboTwin dataset in LeRobot format for post-training |\n| libero-long-lerobot &nbsp; | [🤗 robbyant\u002Flibero-long-lerobot](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Frobbyant\u002Flibero-long-lerobot) | [🤖 Robbyant\u002Flibero-long-lerobot](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FRobbyant\u002Flibero-long-lerobot) | LIBERO-Long dataset in LeRobot format for post-training |\n---\n\n# 🛠️ Quick Start\n\n## Installation\n**Requirements**\n • Python == 3.10.16\n • Pytorch == 2.9.0\n • CUDA 12.6\n\n```bash\npip install torch==2.9.0 torchvision==0.24.0 torchaudio==2.9.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\npip install websockets einops diffusers==0.36.0 transformers==4.55.2 accelerate msgpack opencv-python matplotlib ftfy easydict\npip install flash-attn --no-build-isolation\n```\n\n\n## ⚠️ Important: `attn_mode` Configuration\n\n> **You MUST change the `attn_mode` setting depending on whether you are training or running inference.**\n> Since LingBot-VA is loaded via `from_pretrained`, this parameter is read from the model folder's **`transformer\u002Fconfig.json`**.\n> You need to **manually edit** this file before launching.\n>\n> | Mode | `attn_mode` value | Notes |\n> |---|---|---|\n> | **Training** | `\"flex\"` | Required for training. **Will not work** for inference. |\n> | **Inference \u002F Evaluation** | `\"torch\"` or `\"flashattn\"` | Required for inference. `\"flex\"` will cause errors at eval time. |\n>\n> **How to change:** Open `\u003Cyour-model-path>\u002Ftransformer\u002Fconfig.json`, find the `\"attn_mode\"` field, and set it to the appropriate value.\n\n---\n\n## Deploying LingBot-VA for Inference\nLingBot-VA supports both standalone execution and Server-Client architecture which separates the model environment from simulation. By isolating dependencies, the design avoids package clashes and supports distributed inference on GPUs, clusters, and other devices.\n\n\u003C!-- ### Standalone  Inference\n```python\npython inference.py\n```\nThis processes the example data from `examples\u002F0\u002F` and saves visualizations to `result\u002F`. -->\n\n### Evaluation on RoboTwin-2.0\n\n**Preparing the Environment**\n\nYou can follow the official instructions from the original RoboTwin-2.0 repository:  \n[https:\u002F\u002Frobotwin-platform.github.io\u002Fdoc\u002Fusage\u002Frobotwin-install.html](https:\u002F\u002Frobotwin-platform.github.io\u002Fdoc\u002Fusage\u002Frobotwin-install.html)\n\n\nIn summary:\n\n1. Install Vulkan dependencies:\n   ```bash\n   sudo apt install libvulkan1 mesa-vulkan-drivers vulkan-tools\n   ```\n\n2. Clone the RoboTwin repository:\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FRoboTwin-Platform\u002FRoboTwin.git && cd RoboTwin && git checkout 2eeec322\n   ```\n\n3. Modify `script\u002Frequirements.txt` with the following content:\n   ```txt\n   transforms3d==0.4.2\n   sapien==3.0.0b1\n   scipy==1.10.1\n   mplib==0.2.1\n   gymnasium==0.29.1\n   trimesh==4.4.3\n   open3d==0.18.0\n   imageio==2.34.2\n   pydantic\n   zarr\n   openai\n   huggingface_hub==0.36.2\n   h5py\n   # For Description Generation\n   azure==4.0.0\n   azure-ai-inference\n   pyglet\u003C2\n   wandb\n   moviepy\n   imageio\n   termcolor\n   av\n   matplotlib\n   ffmpeg\n   ```\n\n4. Modify line 8 of `script\u002F_install.sh`:\n   ```bash\n   pip install \"git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fpytorch3d.git@stable\" --no-build-isolation\n   ```\n\n5. Install dependencies:\n   ```bash\n   bash script\u002F_install.sh\n   ```\n\n6. Download assets:\n   ```bash\n   bash script\u002F_download_assets.sh\n   ```\n\n **Deploying the Inference Server**\n```bash\n# single GPU\nbash evaluation\u002Frobotwin\u002Flaunch_server.sh\n\n# multi-GPU\nbash evaluation\u002Frobotwin\u002Flaunch_server_multigpus.sh\n```\n\n **Executing the Inference Client**\n```bash\n# single GPU\ntask_name=\"adjust_bottle\";\nsave_root=\"results\u002F\";\nbash evaluation\u002Frobotwin\u002Flaunch_client.sh ${save_root} ${task_name}\n\n# multi-GPU\nsave_root=\"results\u002F\"\ntask_group_id=0;\nbash evaluation\u002Frobotwin\u002Flaunch_client_multigpus.sh ${save_root} ${task_group_id}\n```\n\nRelated experiments results will be save in `\u002Fpath\u002Fto\u002Fyour\u002FRoboTwin\u002F${save_root}`. Please note that an `eval_result` folder is also generated. This is a native output from RoboTwin and is identical to the contents in the results folder; it can be safely ignored.\nIt is important to note that the inference server and client must be deployed on the same machine. For launching multi-GPU client, we padded the original 50 tasks to 56 via duplication and partitioned them into 7 groups to align with the 8-GPU configuration of our inference node. You can specify the `task_group_id` (0-6) to select a particular group for inference. For detailed grouping configurations, please refer to `evaluation\u002Frobotwin\u002Flaunch_client_multigpus.sh`.\n\n> **GPU Memory Requirements**: Approximately **24GB VRAM** for single-GPU RoboTwin evaluation with offload mode enabled (VAE and text_encoder offloaded to CPU).\n\n\n### Evaluation on LIBERO\nFollow the official instructions to install LIBERO, then launch the server and client:\n\n\n```bash\n# server\nbash evaluation\u002Flibero\u002Flaunch_server.sh\n\n# client\nbash evaluation\u002Flibero\u002Flaunch_client.sh\n```\n\n### Run Image to Video-Action Generation\n\nWe also provide a script for image to video-action generation:\n\n```bash\nNGPU=1 CONFIG_NAME='robotwin_i2av' bash script\u002Frun_launch_va_server_sync.sh\n```\n\n> **GPU Memory Requirements**: Approximately **18GB VRAM** for single-GPU i2av inference with offload mode enabled (VAE and text_encoder offloaded to CPU).\n\n\n## Post-Training LingBot-VA\n\nWe support post-training (fine-tuning) LingBot-VA on custom robotic manipulation datasets. The training pipeline uses FSDP for distributed training and integrates with [LeRobot](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flerobot) dataset format.\n\n### Additional Dependencies\n\nOn top of the base installation, post-training requires:\n\n```bash\npip install lerobot==0.3.3 scipy wandb --no-deps\n```\n\n### Data Preparation\n\nDownload the post-training dataset from HuggingFace:\n\n```bash\nhuggingface-cli download --repo-type dataset robbyant\u002Frobotwin-clean-and-aug-lerobot --local-dir \u002Fpath\u002Fto\u002Fyour\u002Fdataset\n```\n\n### Custom Dataset Preparation\n\nIf you want to fine-tune LingBot-VA on your own robotic manipulation data, follow these steps:\n\n#### Example Dataset\n\nWe provide a converted example dataset based on data from [Issue #29](https:\u002F\u002Fgithub.com\u002FRobbyant\u002Flingbot-va\u002Fissues\u002F29). This dataset has been converted into the expected format and is fully supported for training. You can download it to understand the required data structure:\n\n- **Download**: [Example Dataset](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1D52nK4ZOJmWBXKv1nWrLb9YBwq8nKa_b\u002Fview?usp=sharing)\n\nThis example can serve as a reference for converting your own robotic manipulation data into the proper format.\n\n#### Data Pipeline Overview\n\nWhen preparing your custom dataset, the data goes through the following processing pipeline:\n\n1. **Raw Data** → Convert to LeRobot format (with metadata and video files)\n2. **Add Action Segmentation** → Add `action_config` to `episodes.jsonl`\n3. **Extract Latents** → Process videos through VAE according to video specifications\n4. **Dataset Loading** → Load processed data with proper action dimensions for training\n\nThe final data should conform to these specifications:\n\n**Action Format:**\n- Output dimension: **30 dimensions**, structured as follows:\n  - Left arm EEF (end-effector): 7 dimensions\n  - Right arm EEF (end-effector): 7 dimensions\n  - Left arm joints: 7 dimensions\n  - Right arm joints: 7 dimensions\n  - Left arm gripper: 1 dimension\n  - Right arm gripper: 1 dimension\n- In your dataset class loader, map your robot's action dimensions to this standard 30-dimensional format. Missing dimensions are padded with **0**.\n\n**Video Format:**\n- During VAE latent extraction, resize videos to **~256 × 256 pixels** and downsample to **5-15 fps** as a reference (adjust based on your task requirements).\n\n#### Implementation Steps\n\n**Step 1: Convert your data to LeRobot format**\n\nFollow the official [LeRobot dataset documentation](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flerobot\u002Ftree\u002Fv0.3.3) to convert your raw data (e.g., HDF5, video files, etc.) into the standard LeRobot dataset format. Ensure that each episode contains the required observation videos, actions, and metadata.\n\n**Step 2: Add `action_config` field to `episodes.jsonl`**\n\nAfter converting to LeRobot format, you need to modify the `meta\u002Fepisodes.jsonl` file to add an `action_config` field to each line. This field describes the temporal segmentation and natural language description of the robot's actions within each episode.\n\nEach line in `episodes.jsonl` should follow this format:\n\n```json\n{\n  \"episode_index\": 0,\n  \"tasks\": [\"task description\"],\n  \"length\": 450,\n  \"action_config\": [\n    {\n      \"start_frame\": 0,\n      \"end_frame\": 450,\n      \"action_text\": \"Natural language description of the robot action in this segment.\",\n    }\n  ]\n}\n```\n\n- `start_frame` \u002F `end_frame`: The frame range (0-indexed) of the action segment within the episode.\n- `action_text`: A natural language description of what the robot does in this segment.\n\nFor episodes with a single continuous action, `start_frame` should be `0` and `end_frame` should equal the episode `length`. You can also define multiple segments per episode if your data contains sequential sub-tasks.\n\n**Step 3: Extract video latents with Wan2.2 VAE**\n\nLingBot-VA operates on video latent representations rather than raw pixels. You need to extract the latent features using the Wan2.2 VAE encoder and place them under the converted LeRobot dataset directory. Please refer to the [Wan-Video documentation](https:\u002F\u002Fgithub.com\u002FWan-Video) for instructions on how to run the VAE encoder.\n\nThe extracted latent files should be placed under `latents\u002F` in your dataset directory, mirroring the structure of `videos\u002F`:\n\n```\nyour_dataset\u002F\n├── videos\u002F\n│   └── chunk-000\u002F\n│       └── observation.images.cam_high\u002F\n│           ├── episode_000000.mp4\n│           └── ...\n├── latents\u002F\n│   └── chunk-000\u002F\n│       └── observation.images.cam_high\u002F\n│           ├── episode_000000_0_450.pth    # named as episode_{index}_{start_frame}_{end_frame}.pth\n│           └── ...\n└── meta\u002F\n    └── episodes.jsonl\n```\n\nEach `.pth` file is a dictionary containing the following fields:\n\n| Key | Type | Description |\n| :--- | :--- | :--- |\n| `latent` | `Tensor [N, C]` (bfloat16) | Flattened VAE latent features (e.g., shape `[latent_num_frames * latent_height * latent_width, C]`) |\n| `latent_num_frames` | `int` | Number of temporal frames in the latent space |\n| `latent_height` | `int` | Spatial height in the latent space |\n| `latent_width` | `int` | Spatial width in the latent space |\n| `video_num_frames` | `int` | Number of frames in the (sampled) source video |\n| `video_height` | `int` | Original video height in pixels |\n| `video_width` | `int` | Original video width in pixels |\n| `text_emb` | `Tensor [L, D]` (bfloat16) | Text embedding of the action description (encoded by Wan2.2 text encoder) |\n| `text` | `str` | The raw action description text |\n| `frame_ids` | `list[int]` | Sampled frame indices from the original episode (at target fps) |\n| `start_frame` | `int` | Start frame index matching `action_config` in `episodes.jsonl` |\n| `end_frame` | `int` | End frame index matching `action_config` in `episodes.jsonl` |\n| `fps` | `int` | Target sampling fps used for latent extraction |\n| `ori_fps` | `int` | Original fps of the episode data |\n\nThe latent file naming convention `episode_{index}_{start_frame}_{end_frame}.pth` corresponds to the `action_config` segments defined in `episodes.jsonl`. For example, an episode with `\"start_frame\": 0, \"end_frame\": 450` produces a latent file named `episode_000000_0_450.pth`.\n\n### Training\n\n```bash\n# RoboTwin\nNGPU=8 CONFIG_NAME='robotwin_train' bash script\u002Frun_va_posttrain.sh\n\n# LIBERO\nNGPU=8 CONFIG_NAME='libero_train' bash script\u002Frun_va_posttrain.sh\n```\n\nFor better training performance, use a larger global batch size (e.g., 32, 64). If you have limited GPU resources, you can increase `gradient_accumulation_steps` to achieve a larger effective batch size.\n\n\n---\n\n# 📊 Performance\n\nWe evaluate our model on both simulation benchmarks and real-world scenarios, and achieve state-of-the-art performance.\n\n## Simulation Evaluation\n\n- **RoboTwin 2.0**\n\nWe are the first to propel RoboTwin 2.0 metrics performance past the 90+ threshold！\n\u003Ctable style=\"border-collapse: collapse; width: auto; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Arial, sans-serif; font-size: 13px; line-height: 1.2;\">\n\u003C!-- 指标说明 -->\n  \u003Cp style=\"font-size: 12px; color: #666; margin-bottom: 5px;\">* All metrics are reported in percentage (%). Higher values are \u003Cb>bolded\u003C\u002Fb>.\u003C\u002Fp>\n  \u003Cthead>\n    \u003Ctr style=\"border-top: 2px solid black; border-bottom: 1px solid black;\">\n      \u003Cth align=\"left\" style=\"padding: 6px 12px; white-space: nowrap;\">Method (Average 50 Tasks)\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 12px;\">Easy SR (%)\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 12px;\">Hard SR (%)\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 12px; white-space: nowrap;\">X-VLA\u003C\u002Ftd>\n      \u003Ctd align=\"center\">72.9\u003C\u002Ftd>\n      \u003Ctd align=\"center\">72.8\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 12px; white-space: nowrap;\">&pi;\u003Csub>0\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">65.9\u003C\u002Ftd>\n      \u003Ctd align=\"center\">58.4\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 12px; white-space: nowrap;\">&pi;\u003Csub>0.5\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">82.7\u003C\u002Ftd>\n      \u003Ctd align=\"center\">76.8\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 12px; white-space: nowrap;\">Motus\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cu>88.7\u003C\u002Fu>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cu>87.0\u003C\u002Fu>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr style=\"border-top: 1px solid black; border-bottom: 2px solid black;\">\n      \u003Ctd style=\"padding: 6px 12px; white-space: nowrap;\">\u003Cb>LingBot-VA (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>92.9\u003C\u002Fb> \u003Csmall>(+4.2)\u003C\u002Fsmall>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>91.6\u003C\u002Fb> \u003Csmall>(+4.6)\u003C\u002Fsmall>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\n- **LIBERO**\n\n\u003Ctable style=\"border-collapse: collapse; width: auto; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Arial, sans-serif; font-size: 13px; line-height: 1.2;\">\n\u003C!-- 指标说明 -->\n  \u003Cp style=\"font-size: 12px; color: #666; margin-bottom: 5px;\">* All metrics are reported in percentage (%). Higher values are \u003Cb>bolded\u003C\u002Fb>.\u003C\u002Fp>\n  \u003Cthead>\n    \u003Ctr style=\"border-top: 2px solid black; border-bottom: 1px solid black;\">\n      \u003Cth align=\"left\" style=\"padding: 6px 10px; border-right: 1px solid black; white-space: nowrap;\">Methods\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 8px;\">Spatial\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 8px;\">Object\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 8px;\">Goal\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 8px;\">Long\u003C\u002Fth>\n      \u003Cth align=\"center\" style=\"padding: 6px 8px;\">Avg\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 10px; border-right: 1px solid black; white-space: nowrap;\">&pi;\u003Csub>0\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">96.8\u003C\u002Ftd>\u003Ctd align=\"center\">98.8\u003C\u002Ftd>\u003Ctd align=\"center\">95.8\u003C\u002Ftd>\u003Ctd align=\"center\">85.2\u003C\u002Ftd>\u003Ctd align=\"center\">94.1\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 10px; border-right: 1px solid black; white-space: nowrap;\">&pi;\u003Csub>0.5\u003C\u002Fsub>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">98.8\u003C\u002Ftd>\u003Ctd align=\"center\">98.2\u003C\u002Ftd>\u003Ctd align=\"center\">98.0\u003C\u002Ftd>\u003Ctd align=\"center\">92.4\u003C\u002Ftd>\u003Ctd align=\"center\">96.9\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 10px; border-right: 1px solid black; white-space: nowrap;\">OpenVLA\u003C\u002Ftd>\n      \u003Ctd align=\"center\">84.7\u003C\u002Ftd>\u003Ctd align=\"center\">88.4\u003C\u002Ftd>\u003Ctd align=\"center\">79.2\u003C\u002Ftd>\u003Ctd align=\"center\">53.7\u003C\u002Ftd>\u003Ctd align=\"center\">76.5\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"padding: 4px 10px; border-right: 1px solid black; white-space: nowrap;\">X-VLA\u003C\u002Ftd>\n      \u003Ctd align=\"center\">98.2\u003C\u002Ftd>\u003Ctd align=\"center\">98.6\u003C\u002Ftd>\u003Ctd align=\"center\">97.8\u003C\u002Ftd>\u003Ctd align=\"center\">97.6\u003C\u002Ftd>\u003Ctd align=\"center\">98.1\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr style=\"border-top: 1.5px solid black; border-bottom: 2px solid black;\">\n      \u003Ctd style=\"padding: 5px 10px; border-right: 1px solid black; white-space: nowrap;\">\u003Cb>LingBot-VA (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>98.5 &plusmn; 0.3\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>99.6 &plusmn; 0.3\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>97.2 &plusmn; 0.2\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>98.5 &plusmn; 0.5\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>98.5\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\n\n&nbsp;\n\n## Real-world Deployment\n\nSix manipulation tasks across three categories: longhorizon tasks (Make Breakfast, Pick Screws), precision tasks (Insert Tube, Unpack Delivery), and deformable & articulated object\nmanipulation (Fold Clothes, Fold Pants). Our method achieves state-of-the-art performance on both metrics (Progress Rate and Success Rate) with \u003Cb>only 50 trials\u003C\u002Fb> per task, substantially outperforming strong baseline &pi;\u003Csub>0.5\u003C\u002Fsub>.\n\n\u003Cdiv style=\"text-align: left; font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Arial, sans-serif; line-height: 1.6;\">\n\n  \u003C!-- 第一部分：PS 说明 -->\n  \u003Cdiv style=\"margin-bottom: 5px;\">\u003Cstrong>Progress Score (PS):\u003C\u002Fstrong> The average score across all trials divided by the maximum possible score, expressed as a percentage:\u003C\u002Fdiv>\n\n  PS = Average_Progress \u002F Max_Steps &times; 100%\n\n  \u003C!-- 第二部分：SR 说明 -->\n  \u003Cdiv style=\"margin-bottom: 5px;\">\u003Cstrong>Success Rate (SR):\u003C\u002Fstrong> The number of successful trials divided by the total number of trials, expressed as a percentage:\u003C\u002Fdiv>\n\n  SR = Successful_Trials \u002F N &times; 100%\n\n\u003C\u002Fdiv>\n\n\n\n\u003Cdiv style=\"font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Arial, sans-serif;\">\n  \u003C!-- 指标说明 -->\n  \u003Cp style=\"font-size: 12px; color: #666; margin-bottom: 5px;\">* All metrics are reported in percentage (%). Higher values are \u003Cb>bolded\u003C\u002Fb>.\u003C\u002Fp>\n  \n  \u003Ctable style=\"border-collapse: collapse; width: auto; font-size: 13px; line-height: 1.2;\">\n    \u003Cthead>\n      \u003Ctr style=\"border-top: 2px solid black;\">\n        \u003Cth rowspan=\"2\" align=\"left\" style=\"padding: 4px 10px; border-bottom: 1px solid black; white-space: nowrap;\">\u003Cb>Task\u003C\u002Fb>\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Make Breakfast\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Pick Screws\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Insert Tube\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Unpack Delivery\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Fold Clothes\u003C\u002Fth>\n        \u003Cth colspan=\"2\" style=\"padding: 4px 10px; border-bottom: 1px solid black;\">Fold Pants\u003C\u002Fth>\n      \u003C\u002Ftr>\n      \u003Ctr style=\"border-bottom: 1px solid black;\">\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">PS\u003C\u002Fth>\n        \u003Cth style=\"padding: 4px 8px;\">SR\u003C\u002Fth>\n      \u003C\u002Ftr>\n    \u003C\u002Fthead>\n    \u003Ctbody>\n      \u003Ctr>\n        \u003Ctd style=\"padding: 6px 10px; white-space: nowrap;\">&pi;\u003Csub>0.5\u003C\u002Fsub>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">73.0\u003C\u002Ftd>\u003Ctd align=\"center\">70.0\u003C\u002Ftd>\n        \u003Ctd align=\"center\">74.0\u003C\u002Ftd>\u003Ctd align=\"center\">50.0\u003C\u002Ftd>\n        \u003Ctd align=\"center\">79.2\u003C\u002Ftd>\u003Ctd align=\"center\">30.0\u003C\u002Ftd>\n        \u003Ctd align=\"center\">73.0\u003C\u002Ftd>\u003Ctd align=\"center\">25.0\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>62.9\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">30.0\u003C\u002Ftd>\n        \u003Ctd align=\"center\">30.0\u003C\u002Ftd>\u003Ctd align=\"center\">30.0\u003C\u002Ftd>\n      \u003C\u002Ftr>\n      \u003Ctr style=\"border-bottom: 2px solid black;\">\n        \u003Ctd style=\"padding: 6px 10px; white-space: nowrap;\">\u003Cb>LingBot-VA (Ours)\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>97.0\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>75.0\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>82.5\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>70.0\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>85.8\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>40.0\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>84.5\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>65.0\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">48.8\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>35.0\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd align=\"center\">\u003Cb>76.7\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>70.0\u003C\u002Fb>\u003C\u002Ftd>\n      \u003C\u002Ftr>\n    \u003C\u002Ftbody>\n  \u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n# 🪪 License\n\nThis project is released under the Apache License 2.0. See [LICENSE](LICENSE.txt) file for details.\n\n# 📚Citation\n\n```bibtex\n@article{lingbot-va2026,\n  title={Causal World Modeling for Robot Control},\n  author={Li, Lin and Zhang, Qihang and Luo, Yiming and Yang, Shuai and Wang, Ruilin and Han, Fei and Yu, Mingrui and Gao, Zelin and Xue, Nan and Zhu, Xing and Shen, Yujun and Xu, Yinghao},\n  journal={arXiv preprint arXiv:2601.21998},\n  year={2026}\n}\n```\n\n# 🧩 Acknowledgments\n\nThis work builds upon several excellent open-source projects:\n\n- [Wan-Video](https:\u002F\u002Fgithub.com\u002FWan-Video) - Vision transformer backbone\n- [MoT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FMixture-of-Transformers) - Mixture-of-Transformers architecture\n- The broader open-source computer vision and robotics communities\n\n---\n\nFor questions, discussions, or collaborations:\n\n- **Issues**: Open an [issue](https:\u002F\u002Fgithub.com\u002Frobbyant\u002Flingbot-va\u002Fissues) on GitHub\n- **Email**: Contact Dr. [Qihang Zhang](https:\u002F\u002Fzqh0253.github.io\u002F) (liuhuan.zqh@antgroup.com) or Dr. [Lin Li](https:\u002F\u002Flilin-hitcrt.github.io\u002F) (fengchang.ll@antgroup.com) \n","LingBot-VA 是一个用于通用机器人控制的因果视频-动作世界模型。该项目通过自回归框架统一了视觉动态预测和动作推断，同时保持两者在概念上的区别。其核心技术特点包括高效的双流混合变压器架构（MoT），支持异步执行与KV缓存机制，从而提升了样本效率、长时间任务成功率以及对新场景的泛化能力。适用于需要高精度和长时序操作的机器人应用场合，如自动化生产线、服务机器人等。",2,"2026-06-11 03:49:47","high_star"]