[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1368":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":31,"discoverSource":32},1368,"mimic-video","mimic-video\u002Fmimic-video","Video-Action Models for Generalizable Robot Control Beyond VLAs","https:\u002F\u002Fmimic-video.github.io\u002F",null,"Python",266,25,2,0,5,31,6,53.84,"Apache License 2.0",false,"main",[23,24,25,26,27],"robot-learning","robotics","vam","video-action-models","vla","2026-06-12 04:00:09","# mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15692\">\u003Cstrong>Paper\u003C\u002Fstrong>\u003C\u002Fa>\n    &nbsp;·&nbsp;\n    \u003Ca href=\"https:\u002F\u002Fmimic-video.github.io\">\u003Cstrong>Website\u003C\u002Fstrong>\u003C\u002Fa>\n    &nbsp;·&nbsp;\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fjonpai\u002Fmimic-video\">\u003Cstrong>Checkpoints\u003C\u002Fstrong>\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Introduction\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Ffigure_one.png\" alt=\"Figure One\">\n\u003C\u002Fp>\n\nmimic-video extracts generalist language-conditioned robot policies (Video-Action Models \u002F VAMs) from pretrained video models by conditioning small action decoders on the video backbones' latent representations. By drawing on the video model's knowledge of real-world dynamics and behaviors, performant action decoders can be learned efficiently and without updating the video model. Employing decoupled flow times for video and for actions, efficient inference can be performed with a single video model forward pass per action chunk.\n\nWe instantiate our approach with the lightweight 2B Cosmos-Predict2 video model and release trained checkpoints for Bridge and LIBERO.\n\n## Repository Overview\n\nWe provide our data ([DATA.md](DATA.md)), modeling ([MODEL.md](MODEL.md)), and evaluation ([EVAL.md](EVAL.md)) code. See the respective markdowns for details.\n\n```\nmimic-video\n├── data_preprocessing  # data preprocessing\n├── eval                # evaluation\n└── model               # dataloading, model architecture, training, and inference\n```\n\n## Environment Setup and Downloading Checkpoints\n\n1. Create uv environment.\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\ncd model\nuv sync --extra cu126\nsource .venv\u002Fbin\u002Factivate\n```\n\n2. (Optional) Download trained bridge or libero checkpoints.\n```bash\nhf auth login\n\npython scripts\u002Fdownload_checkpoints.py\n```\nIf you want to run guardrails (default enabled), additionally request access to and download `nvidia\u002FCosmos-Guardrail1` and `meta-llama\u002FLlama-Guard-3-8B` to the checkpoints directory.\n\n## Training\n\nYou can find an overview over the training repository (which is built on the [Cosmos-Predict2](https:\u002F\u002Fgithub.com\u002Fnvidia-cosmos\u002Fcosmos-predict2) repo) in [MODEL.md](.\u002FMODEL.md). A quickstart to training your own models is given below.\n\nMulti-node & multi-gpu configuration is handled through [torchrun](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Felastic\u002Frun.html).\n\n### Video Model Finetuning\n\nThis assumes you have downloaded at least the text encoder, video tokenizer, and `v2w_pretrained_cosmos`.\n\n1. Extract videos and language instructions.\n   1. Choose a `\u002Fpath\u002Fto\u002Fdataset\u002F`.\n   2. Populate `\u002Fpath\u002Fto\u002Fdataset\u002Fvideo\u002F` with `ep.mp4` and `\u002Fpath\u002Fto\u002Fdataset\u002Fmetas\u002F` with `ep.txt` files. Example scripts for bridge and libero are provided in [data_preprocessing\u002Fvideo](.\u002Fdata_preprocessing\u002Fvideo\u002F).\n2. Precompute language embeddings in `\u002Fpath\u002Fto\u002Fdataset\u002Ft5_xxl\u002F`.\n```bash\ncd data_preprocessing\u002Fvideo\u002F\npython get_t5_embeddings.py --dataset_path \u002Fpath\u002Fto\u002Fdataset\u002F\n```\n3. Create video finetuning config.\n   1. Add your dataset to `train_datasets` in [data_video.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fdefaults\u002Fdata_video.py) (line 24).\n   2. Add your experiment hyperparameters to [video2world.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fvideo2world.py).\n4. Start training with [torchrun](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Felastic\u002Frun.html). The experiment name is defined in [video2world.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fvideo2world.py) from the step before.\n```bash\ntorchrun -m scripts.train --config=cosmos_predict2\u002Fconfigs\u002Fconfig.py -- experiment=...\n```\n\n### Action Decoder Pretraining\n\nThis assumes you have downloaded the text encoder, video tokenizer, and the video backbone you would like to train an action decoder for.\n\n#### Bridge\n\n1. Download raw data and unzip.\n```bash\naria2c -x 16 -s 16 -c \"https:\u002F\u002Frail.eecs.berkeley.edu\u002Fdatasets\u002Fbridge_release\u002Fdata\u002Fdemos_8_17.zip\"\n7z x demos_8_17.zip -obridge\u002F\n# todo: maybe untar that one file? still don't know what it is. have to look inside.\n```\n2. Convert to zarr.\n```bash\ncd data_preprocessing\u002Faction\u002F\npython process_bridge.py --raw-dir ..\u002F..\u002Fbridge\u002Fraw --output-dir \u002Fpath\u002Fto\u002Fdata\u002Fbridge\u002F\n```\n3. Precompute language embeddings.\n```bash\npython precompute_t5.py --dataset-path \u002Fpath\u002Fto\u002Fdata\u002Fbridge\u002F\n```\n1. Create training config.\n   1. Adapt dataset.data_dir in [bridge.yaml](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fdataloading\u002Fdataset\u002Fbridge.yaml) to point to the directory containing the data you want to train on. See [DATA.md](.\u002FDATA.md) for details on the data config structure.\n   2. Choose training hyperparameters (cross-attention layer, learning rate, batch size) and the video model checkpoint in [experiment\u002Fworld2action.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fworld2action.py). To use the same hyperparameters as the pretrained checkpoints you can select the correct configuration via the experiment name without changing code.\n2. Start training with [torchrun](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Felastic\u002Frun.html). The experiment name is defined in [world2action.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fworld2action.py) from the step before.\n```bash\ncd ..\u002F..\u002Fmodel\ntorchrun -m scripts.train --config=cosmos_predict2\u002Fconfigs\u002Fconfig.py -- experiment=...\n```\n\n#### LIBERO\n\n1. Follow the [LIBERO dependency installation](#install-dependencies-1) steps.\n2. Download the official datasets.\n```bash\ncd LIBERO\npython benchmark_scripts\u002Fdownload_libero_datasets.py --use-huggingface\n```\n3. Regenerate h5 recordings (filter success, filter no-op, rotate image, re-render at higher resolution).\n```bash\ncd ..\u002F..\u002F..\u002Fdata_preprocessing\u002Faction\u002F\nPYTHONPATH=..\u002F..\u002Feval\u002Flibero\u002FLIBERO\u002F python regenerate_libero.py --in-dir \u002Fpath\u002Fto\u002Flibero\u002Fdatasets\u002F --out-dir \u002Fpath\u002Fto\u002Flibero\u002Fregenerated_datasets\u002F\n```\n4. Convert to zarr.\n```bash\npython process_libero.py --input-dir \u002Fpath\u002Fto\u002Flibero\u002Fregenerated_datasets\u002F --output-dir \u002Fpath\u002Fto\u002Fdata\u002F\n```\n5. Precompute language embeddings.\n```bash\npython precompute_t5.py --dataset-path \u002Fpath\u002Fto\u002Fdata\u002Flibero_*\n```\n6. Create training config.\n   1. Adapt dataset.data_dir in [libero.yaml](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fdataloading\u002Fdataset\u002Flibero.yaml) to point to the directory containing the data you want to train on. See [DATA.md](.\u002FDATA.md) for details on the data config structure. If you want to train on different LIBERO subsets, you might want to set it from the top level of the [data config](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fdataloading\u002Flibero.yaml) (and have several of those).\n   2. Choose training hyperparameters (cross-attention layer, learning rate, batch size) and the video model checkpoint in [experiment\u002Fworld2action.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fworld2action.py). To use the same hyperparameters as the pretrained checkpoints you can select the correct configuration via the experiment name without changing code.\n7. Start training with [torchrun](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Felastic\u002Frun.html). The experiment name is defined in [world2action.py](.\u002Fmodel\u002Fcosmos_predict2\u002Fconfigs\u002Fexperiment\u002Fworld2action.py) from the step before.\n```bash\ncd ..\u002F..\u002Fmodel\ntorchrun -m scripts.train --config=cosmos_predict2\u002Fconfigs\u002Fconfig.py -- experiment=...\n```\n\n## Evaluation\n\nWe have integrated [vanilla SIMPLER-Bridge](.\u002Feval\u002Fbridge\u002FSimplerEnv\u002Fsimpler_env\u002Fmain_inference.py), [human-in-the-loop SIMPLER-Bridge](.\u002Feval\u002Fbridge\u002FSimplerEnv\u002Fsimpler_env\u002Fmain_inference_hil.py) (for ground-truth future video generation), and [vanilla LIBERO](.\u002Feval\u002Flibero\u002Frun.py) evals in this repo. To reproduce the sim results with our checkpoints, follow these quick steps:\n\n### SIMPLER Bridge\n\n#### Install dependencies\n\n```bash\nsudo apt install libvulkan1\n\ncd eval\u002Fbridge\n\nuv pip install -r SimplerEnv\u002Frequirements.txt\nuv pip install -e SimplerEnv\u002FManiSkill2_real2sim\nuv pip install -e SimplerEnv\n```\n\n#### Run evaluation\n\nThis assumes you have the checkpoints from [Environment Setup and Downloading Checkpoints](#environment-setup-and-downloading-checkpoints).\n\n##### Normal policy eval\n\n```bash\n# Adapt the `GPUS` list (line 1) to which GPUs to parallelize over (now: 0-7).\n# Adapt line 8 for how many evals can run in parallel per GPU (now: 2).\n# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 33).\nbash eval.sh\n```\n\n##### Human-in-the-loop evaluation (oracle study)\n\nFor this one, you have to sit down and teleop. The policy will get the ground-truth future video from the teleop, add noise, and then decode actions.\n\n```bash\n# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 1).\nbash eval_hil.sh\n```\n\n### LIBERO\n\n#### Install dependencies\n\n```bash\ncd eval\u002Flibero\n\nuv pip install -r LIBERO\u002Frequirements.txt\nuv pip install -e LIBERO\n```\n\n#### Run evaluation\n\nThis also assumes you have the checkpoints from [Environment Setup and Downloading Checkpoints](#environment-setup-and-downloading-checkpoints).\n\n```bash\n# Adapt the `GPUS` list (line 1) to which GPUs to parallelize over (now: 0-7).\n# Adapt line 8 for how many evals can run in parallel per GPU (now: 2).\n# Fill in `checkpoint_dir` with the path to the checkpoint directory (line 29).\nbash eval.sh\n```\n\n# License\n\n```\nCopyright 2026 mimic-video authors and mimic robotics AG\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this repository except in compliance with the License.\nYou may obtain a copy of the License at\n\n      http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n\n# BibTeX\n\n```bibtex\n@misc{pai2025mimicvideo,\n      title={mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs}, \n      author={Jonas Pai and Liam Achenbach and Victoriano Montesinos and Benedek Forrai and Oier Mees and Elvis Nava},\n      year={2025},\n      eprint={2512.15692},\n      archivePrefix={arXiv},\n      primaryClass={cs.RO},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.15692}, \n}\n```\n","mimic-video项目通过从预训练视频模型中提取通用语言条件机器人策略（Video-Action Models \u002F VAMs），实现超越传统视觉-语言动作（VLAs）的机器人控制。该项目利用视频模型对现实世界动态和行为的理解，高效地学习性能良好的动作解码器，而无需更新视频模型。采用解耦的视频与动作流时间，使得单次视频模型前向传递即可完成高效的推理。基于轻量级2B Cosmos-Predict2视频模型，mimic-video提供了在Bridge和LIBERO数据集上的训练检查点。适合需要灵活且高效地将视频理解能力转化为机器人操作指令的应用场景，如自动化仓储、服务机器人等。","2026-06-11 02:43:19","CREATED_QUERY"]