[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2214":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":12,"stars7d":13,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":20,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},2214,"LARYBench","meituan-longcat\u002FLARYBench","meituan-longcat",null,"Python",149,8,1,3,0,23,45.66,"MIT License",false,"main",true,[],"2026-06-12 04:00:13","# LARY — A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flary.jpg\" alt=\"LARYBench\" width=\"100%\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fmeituan-longcat.github.io\u002FLARYBench\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue?style=flat-square&logo=github\" alt=\"Project Page\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11689\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-red?style=flat-square&logo=arxiv\" alt=\"arXiv\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmeituan-longcat\u002FLARYBench\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-HuggingFace-yellow?style=flat-square\" alt=\"HuggingFace\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fmeituan-longcat\u002FLARYBench\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-ModelHub-blue\" alt=\"ModelScope\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FEXsG52D8SW\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join%20Chat-5865F2?style=flat-square&logo=discord&logoColor=white\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green?style=flat-square\" alt=\"MIT License\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n**LARY** is a unified evaluation framework for **latent action representations**.\nGiven any model that produces latent action representations (LAMs or visual encoders), LARY provides three complementary evaluation pipelines:\n\n| Pipeline | Task |\n|---|---|\n| **`get_latent_action`** | Extract latent action representations from videos or image pairs |\n| **`classification`** | Probe how well latent actions capture *action semantics* (action-type recognition) |\n| **`regression`** | Probe how well latent actions can *decode physical robot actions* (action regression) |\n\n---\n\n## News\n- **[2026-05-01]** LARYBench now supports SigLIP2, relative-action regression evaluation (`target = action_tgt - action_src`), and a fast dataset integrity checker. Happy Labor Day!\n- **[2026-04-27]** We have open-sourced all datasets on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmeituan-longcat\u002FLARYBench).\n- **[2026-04-21]** We release the general LAMs trained in ablation studies, [LAPA-DINOv3](https:\u002F\u002Fhuggingface.co\u002FAGI-Eval\u002FLAPA-DINOv3) and  [LAPA-DINOv2](https:\u002F\u002Fhuggingface.co\u002FAGI-Eval\u002FLAPA-DINOv2). Even though these models are still rough experimental prototypes, with clear flaws in both training data and methods, we’re sharing them anyway to help push latent action research forward together. Have fun~\n- **[2026-04-15]** We release partial training datasets due to the license limitation.\n- **[2026-04-13]** We release the code, text annotations, and partial validation datasets. Training datasets are coming soon.\n\n## Release Checklist\n\n- [x] Code\n- [x] Text annotations [link](https:\u002F\u002Fgithub.com\u002Fmeituan-longcat\u002FLARYBench\u002Ftree\u002Fmain\u002Fdata)\n- [x] Partial Validation datasets \n- [x] Partial Training datasets\n- [x] Full datasets\n\n---\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Contributions](#contributions)\n3. [Environment Setup](#environment-setup)\n4. [Data Preparation](#data-preparation)\n5. [Quick Start](#quick-start)\n6. [Relative-Action Regression](#relative-action-regression)\n7. [Supported Models](#supported-models)\n8. [Adding a Custom Model](#adding-a-custom-model)\n9. [Supported Datasets](#supported-datasets)\n10. [Evaluation Outputs](#evaluation-outputs)\n\n---\n\n## Overview\n\nWhile the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated.\n\nWe introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (*what to do*) and low-level robotic control (*how to do*). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied LAMs. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\" alt=\"LARYBench Overview\" width=\"100%\">\n\u003C\u002Fp>\n\n## Contributions\n\n- **LARYBench**: We introduce LARYBench, a comprehensive benchmark that first decouples the evaluation of latent action representations from downstream policy performance. LARYBench probes representations along two complementary dimensions — high-level semantic action (*what to do*) encoding and the low-level physical dynamics required for robotic control (*how to do it*) — enabling direct, standardized measurement of representation quality itself.\n\n- **Large-Scale Data Engine**: To support rigorous evaluation, we develop an automated data engine to re-segment and re-annotate a large-scale corpus, yielding 1.2M videos, 620K image pairs, and 595K trajectories across 151 action categories and 11 robotic embodiments, covering both human and robotic agents from egocentric and exocentric perspectives in simulated and real-world environments.\n\n- **Key Findings**: Through systematic evaluation of 11 models, we reveal two consistent findings: (i) action-relevant features can emerge from large-scale visual pre-training without explicit action supervision, and (ii) latent-based feature spaces tend to align with robotic control better than pixel-based ones. These results suggest that future VLA systems may benefit more from leveraging general visual representations than from learning action spaces solely on scarce robotic data.\n\n---\n\n## Environment Setup\n\nUse `larybench` as the base environment.\n\n```bash\nconda create -n larybench python=3.10 -y\nconda activate larybench\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install -r requirements.txt\n```\n\nSome model families keep their original dependencies and should be configured from their upstream projects when you evaluate them:\n\n| Model family | Environment guidance |\n|---|---|\n| `dinov2`, `dinov3`, `siglip2`, `dinov2-origin`, `dinov3-origin`, `siglip2-origin`, `lapa`, `magvit2`, `univla`, `flux2`, `wan2-2` | Use `larybench` |\n| `vjepa2` | Follow [facebookresearch\u002Fvjepa2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvjepa2) and activate your `vjepa2` env |\n| `villa-x` | Follow [microsoft\u002Fvilla-x](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvilla-x) and set `VILLA_X_DIR` |\n\nConfigure paths in `env.sh`, then source it before running commands. Example:\n\n```bash\nLARY_ROOT=\u002Fyour_name\u002Fcode\u002FLARYBench\nLARY_LOG_DIR=\u002Fyour_data_disk\u002FLARYBench\u002Flogs\nDATA_DIR=\u002Fyour_data_disk\u002FLARYBench\u002Fdata\nMODEL_DIR=\u002Fyour_data_disk\u002FLARYBench\u002Fmodels\nLARY_LA_DIR=\u002Fyour_data_disk\u002FLARYBench\u002Flatent_actions\nDINO_V2_PATH=\u002Fyour_data_disk\u002FLARYBench\u002Fmodels\u002FDINOv2\nDINO_V3_PATH=\u002Fyour_data_disk\u002FLARYBench\u002Fmodels\u002FDINOv3\nSIGLIP2_PATH=\u002Fyour_data_disk\u002FLARYBench\u002Fmodels\u002FSigLIP2\nsource env.sh\n```\n\n## Data Preparation\n\nThe dataset root should be `DATA_DIR`:\n\n```text\n\u002Fyour_data_disk\u002FLARYBench\u002Fdata\u002F\n├── classification\u002F\n│   ├── AgiBotWorld-Beta\u002F\n│   ├── Ego4D\u002F\n│   ├── EgoDex\u002F\n│   ├── EPIC-KITCHENS\u002F\n│   ├── HoloAssist\u002F\n│   ├── SSv2\u002F\n│   └── TACO\u002F\n├── regression\u002F\n│   ├── agibot_45\u002F\n│   ├── calvin\u002F{train_stride5,val_stride5}\u002F\n│   ├── robocoin_10\u002F\n│   └── vlabench\u002F\n└── regression_relative\u002F        # optional; generated for relative-action regression\n```\n\nMetadata CSVs are committed in this repository under `data\u002F`. They store relative paths and are resolved against `DATA_DIR` at runtime.\n\nData setup flow:\n\n1. Download the LARYBench archives from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmeituan-longcat\u002FLARYBench) or [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fmeituan-longcat\u002FLARYBench).\n2. Extract them so the folder layout matches the example above.\n3. Download SSv2 and EgoDex separately, then generate their clipped videos. This is required for the `Human_1st` classification task.\n\n```bash\npython utils\u002Fprepare_ssv2_egodex.py \\\n  --ssv2-root \u002Fpath\u002Fto\u002F20bn-something-something-v2 \\\n  --egodex-root \u002Fpath\u002Fto\u002FEgoDex \\\n  --output-dir $DATA_DIR\u002Fclassification \\\n  --workers 16\n```\n\nFor AgibotBeta and RoboCOIN regression, compute per-robot action normalization statistics from the train split before training absolute-action probes. The outputs are read automatically by `lary.cli regress`.\n\n```bash\npython utils\u002Fcompute_robot_action_stats.py \\\n  --dataset agibotbeta \\\n  --data-root $DATA_DIR\n\npython utils\u002Fcompute_robot_action_stats.py \\\n  --dataset robocoin \\\n  --data-root $DATA_DIR\n```\n\nThis writes `DATA_DIR\u002Fregression\u002Fagibot_45\u002Fagibotbeta_stats.json` and `DATA_DIR\u002Fregression\u002Frobocoin_10\u002Frobocoin_stats.json`.\n\nTo check whether images, videos, and regression `.npy` files exist and can be opened, run the integrity checker. Use `--groups` to scan only selected datasets.\n\n```bash\npython utils\u002Fcheck_dataset_integrity.py \\\n  --data-root $DATA_DIR \\\n  --output dataset_integrity_report.txt \\\n  --workers 64 \\\n  --timeout 3\n\npython utils\u002Fcheck_dataset_integrity.py \\\n  --data-root $DATA_DIR \\\n  --groups CALVIN \\\n  --output dataset_integrity_calvin.txt \\\n  --workers 64 \\\n  --timeout 3\n```\n\n## Quick Start\n\nThe evaluation has two steps:\n\n1. Extract latent actions from image pairs or videos with the model you want to evaluate.\n2. Train a lightweight probe on the extracted latent actions: classification for action semantics, or regression for physical actions.\n\nGPU defaults are explicit in the examples below. `extract` is single-GPU by default (`--gpus 0`); pass a comma-separated list (`--gpus 0,1,2,3,4,5,6,7`) to start one extraction partition per GPU and merge partition CSVs automatically. `classify` defaults to `--gpus 0,1,2,3,4,5,6,7`, so pass `--gpus 0` for single-card probing. `regress` follows `CUDA_VISIBLE_DEVICES`; if it is unset, the CLI assumes `0,1,2,3,4,5,6,7`, so set it explicitly for single-card runs.\n\n### Example A: Regression on CALVIN with LAPA-DINOv2\n\n```bash\nconda activate larybench\nsource env.sh\n\n# Step 1: extract latent actions from image pairs.\n# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.\n# --stride: calvin=5, vlabench=5, agibotbeta=45, robocoin=10.\n# --split: calvin\u002Fvlabench use train,val; agibotbeta\u002Frobocoin use seen_train,seen_val.\n# Single GPU by default; use --gpus 0,1,2,3 for multi-GPU partitioned extraction.\npython -m lary.cli extract \\\n  --model dinov2 \\\n  --dataset calvin \\\n  --split train \\\n  --mode image \\\n  --stride 5 \\\n  --gpus 0,1,2,3\n\npython -m lary.cli extract \\\n  --model dinov2 \\\n  --dataset calvin \\\n  --split val \\\n  --mode image \\\n  --stride 5 \\\n  --gpus 0,1,2,3\n\n# Step 2: train the regression probe.\n# Uses CUDA_VISIBLE_DEVICES; one visible GPU means single-card training, multiple visible GPUs use accelerate.\n# Keep --dataset and --stride consistent with the extracted CSV names.\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m lary.cli regress \\\n  --model dinov2 \\\n  --dataset calvin \\\n  --stride 5 \\\n  --model-type mlp\n```\n\nThe extraction step writes latent-action `.npz` files under `$LARY_LA_DIR` and CSVs such as `data\u002Ftrain_la_calvin_5_dinov2.csv`. Regression logs and metrics are written under `$LARY_LOG_DIR\u002Fregression\u002F`.\n\n### Example B: Classification on Robot_1st with LAPA-DINOv2\n\n```bash\nconda activate larybench\nsource env.sh\n\n# Step 1: extract latent actions from videos.\n# --model: dinov3, siglip2, magvit2, lapa, univla, villa-x, dinov2-origin, dinov3-origin, siglip2-origin.\n# --dataset: robot_1st, human_1st\n\npython -m lary.cli extract \\\n  --model dinov2 \\\n  --dataset robot_1st \\\n  --split train \\\n  --mode video \\\n  --gpus 0,1,2,3\n\npython -m lary.cli extract \\\n  --model dinov2 \\\n  --dataset robot_1st \\\n  --split val \\\n  --mode video \\\n  --gpus 0,1,2,3\n\n# Step 2: train the classification probe.\n# Robot_1st has 54 classes; Human_1st has 123 classes.\n# --dim: dinov2=1024, dinov3=1024, siglip2=768, magvit2=18, \n#        lapa=1024, univla=128, villa-x:32,\n#        dinov2-origin=1024, dinov3-origin=1024, vjepa2=1024,\n#        siglip2-origin=768, wan2-2=48, flux2=128\npython -m lary.cli classify \\\n  --model dinov2 \\\n  --dataset robot_1st \\\n  --dim 1024 \\\n  --classes 54 \\\n  --gpus 0,1,2,3\n```\n\nClassification outputs are written under `$LARY_LOG_DIR\u002Fclassification\u002F`.\n\n## Relative-Action Regression\n\nAbsolute regression predicts the absolute action chunk. Relative regression predicts relative motion between two frames. Generate non-overwriting relative-action files first:\n\n```bash\npython utils\u002Fprepare_relative_actions.py \\\n  --dataset calvin \\\n  --input-root $DATA_DIR \\\n  --output-root $DATA_DIR \\\n  --csv data\u002Ftrain_la_calvin_5_dinov2.csv \\\n  --csv data\u002Fval_la_calvin_5_dinov2.csv \\\n  --workers 32\n```\n\nThis creates `DATA_DIR\u002Fregression_relative\u002F...` and writes relative-action mean\u002Fstd statistics, for example `DATA_DIR\u002Fregression_relative\u002Fcalvin\u002Frelative_action_stats_calvin.json`.\n\nRun relative regression with the same latent-action CSVs:\n\n```bash\nCUDA_VISIBLE_DEVICES=0 python -m lary.cli regress \\\n  --model dinov2 \\\n  --dataset calvin \\\n  --stride 5 \\\n  --model-type mlp \\\n  --action-mode relative\n```\n\n## Supported Models\n\n| Model key | What it extracts | Environment |\n|---|---|---|\n| `dinov2` | LAPA-DINOv2 latent actions | `larybench` |\n| `dinov3` | LAPA-DINOv3 latent actions | `larybench` |\n| `siglip2` | LAPA-SigLIP2 latent actions | `larybench` |\n| `magvit2` | Open-MAGVIT2 based latent actions | `larybench`; set `MAGVIT2_CONFIG_PATH` and `MAGVIT2_TOKENIZER_PATH` |\n| `dinov2-origin` | Raw DINOv2 visual features | `larybench` |\n| `dinov3-origin` | Raw DINOv3 visual features | `larybench` |\n| `siglip2-origin` | Raw SigLIP2 visual features | `larybench` |\n| `lapa` | LAPA \u002F LAQ latent actions | `larybench` |\n| `univla` | UniVLA latent actions | `larybench`; set `UNIVLA_CKPT_PATH` |\n| `villa-x` | villa-X latent actions | upstream villa-X env |\n| `flux2` | FLUX.2 VAE features | `larybench`; set `AE_MODEL_PATH` |\n| `vjepa2` | V-JEPA2 video features | upstream `vjepa2` env |\n| `wan2-2` | Wan2.2 VAE features | upstream `wan` env |\n\n## Adding a Custom Model\n\nLARYBench only needs your model to convert a video or image pair into a numeric `tokens` array saved in each latent-action `.npz` file.\n\n1. Add model-specific imports in [get_latent_action\u002Fdynamics.py](get_latent_action\u002Fdynamics.py), guarded by `USE_MODEL` if the dependency is optional.\n\n```python\nenv_model = os.environ.get(\"USE_MODEL\")\nif env_model == \"my-model\":\n    from my_project import MyModel\n```\n\n2. Register the model loader in `get_dynamic_tokenizer(model)`.\n\n```python\nelif model == \"my-model\":\n    dynamics = MyModel.from_pretrained(os.environ[\"MY_MODEL_CKPT\"]).cuda()\n```\n\n3. Add the forward branch in `get_latent_action(x, tokenizer, model_name)` and return either `(tokens, indices)` or `tokens`. Classification and regression use `tokens`; `tokens.shape[-1]` is the `--dim` value for classification.\n\n```python\nelif model_name == \"my-model\":\n    tokens = tokenizer(x)          # expected shape: (B, ..., D)\n    indices = np.array([])\n```\n\n4. If the model needs a different input format, add a matching branch in [lary\u002Fextract.py](lary\u002Fextract.py) for dataset preprocessing and batch execution. Reuse existing branches such as `dinov2-origin`, `vjepa2`, or `wan2-2` as templates.\n\n5. Set any required environment variables in `env.sh`, then run:\n\n```bash\npython -m lary.cli extract \\\n  --model my-model \\\n  --dataset calvin \\\n  --split train\u002Fval \\\n  --mode image \\\n  --stride 5 \\\n  --gpus 0\n```\n\nAfter extraction creates `data\u002Fval_la_\u003Cdataset>_\u003Cstride>_my-model.csv` or `data\u002Fval_la_\u003Cdataset>_my-model.csv`, the existing `classify` and `regress` commands can evaluate it without model-specific changes.\n\n## Supported Datasets\n\n### Classification Datasets\n\n| Dataset key | Splits | Input mode | Notes |\n|---|---|---|---|\n| `human_1st` | `train`, `val` | video | 123-class. Including EgoDex, SSv2, Ego4D, HoloAssist, EPIC-KITCHENS, TACO |\n| `robot_1st` | `train`, `val` | video | 54-class. Made by AgiBotWorld-Beta |\n\n### Regression Datasets\n\n| Dataset key | Splits | Stride |\n|---|---|---|\n| `calvin` | `train`, `val` | 5 |\n| `vlabench` | `train`, `val` | 5 |\n| `vlabench_15` | `train`, `val` | 15 |\n| `vlabench_30` | `train`, `val` | 30 |\n| `agibotbeta` | `seen_train`, `seen_val` | 45 |\n| `robocoin` | `seen_train`, `seen_val` | 10 |\n\n## Evaluation Outputs\n\nExtraction creates `.npz` latent actions under `$LARY_LA_DIR` and a metadata CSV under this repository's `data\u002F` directory. Classification writes checkpoints, logs, confusion matrices, and class metrics under `$LARY_LOG_DIR\u002Fclassification\u002F`. Regression writes checkpoints, best-result CSVs, and trajectory visualizations under `$LARY_LOG_DIR\u002Fregression\u002F`.\n\n---\n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@misc{nie2026larylatentactionrepresentation,\n      title={LARY: A Latent Action Representation Yielding Benchmark for Generalizable Vision-to-Action Alignment}, \n      author={Dujun Nie and Fengjiao Chen and Qi Lv and Jun Kuang and Xiaoyu Li and Xuezhi Cao and Xunliang Cai},\n      year={2026},\n      eprint={2604.11689},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11689}, \n}\n```\n\n---\n\n## Data Statements\n\nLARYBench is built upon the following publicly available datasets. We gratefully acknowledge the efforts of their creators and ask users to comply with each dataset's respective license and terms of use.\n\n| Dataset | Link |\n|---|---|\n| EgoDex | [github.com\u002Fapple\u002Fml-egodex](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-egodex) |\n| Something-Something V2 | [something-something-v2](https:\u002F\u002Fwww.qualcomm.com\u002Fdeveloper\u002Fsoftware\u002Fsomething-something-v-2-dataset) |\n| Ego4D | [github.com\u002Ffacebookresearch\u002FEgo4d](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FEgo4d) |\n| HoloAssist | [holoassist.github.io](https:\u002F\u002Fholoassist.github.io\u002F) |\n| EPIC-KITCHENS | [epic-kitchens.github.io](https:\u002F\u002Fepic-kitchens.github.io\u002F) |\n| TACO | [taco2024.github.io](https:\u002F\u002Ftaco2024.github.io\u002F) |\n| AgiBotWorld-Beta | [github.com\u002FOpenDriveLab\u002FAgiBot-World](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FAgiBot-World) |\n| LIBERO | [github.com\u002FLifelong-Robot-Learning\u002FLIBERO](https:\u002F\u002Fgithub.com\u002FLifelong-Robot-Learning\u002FLIBERO) |\n| RoboCOIN | [github.com\u002FFlagOpen\u002FRoboCOIN](https:\u002F\u002Fgithub.com\u002FFlagOpen\u002FRoboCOIN) |\n| VLABench | [github.com\u002FOpenMOSS\u002FVLABench](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FVLABench) |\n| CALVIN | [github.com\u002Fmees\u002Fcalvin](https:\u002F\u002Fgithub.com\u002Fmees\u002Fcalvin) |\n\n\n## License\n\nThe code and tools in this repository are released under the [MIT License](LICENSE).\n\nHowever, this dataset is derived from multiple third-party datasets, each governed by its own license. **The overall dataset is subject to the most restrictive terms among all included sources.** Users must comply with the respective licenses for each subset.\n\n### Dataset License Summary\n\n| Dataset | License |\n|---|---|\n| EPIC-KITCHENS | [CC BY-NC 4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F) |\n| TACO | [CC BY 4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002F) |\n| AgiBotWorld-Beta | [CC BY-NC-SA 4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F) |\n| Ego4D, HoloAssist, LIBERO, RoboCOIN, VLABench, CALVIN | [MIT](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT) |\n\n### Important Notices\n\n- **Non-commercial use only**: Subsets derived from EPIC-KITCHENS, and AgiBotWorld-Beta are restricted to **non-commercial research and educational purposes only**, due to the NC (NonCommercial) clauses in their respective licenses.\n\n- **ShareAlike**: The AgiBotWorld-Beta-derived subset is subject to the **SA (ShareAlike)** clause. Any redistribution of this subset must be made available under the same [CC BY-NC-SA 4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F) license.\n\n- **Attribution required**: All subsets derived from Creative Commons-licensed sources require proper attribution to the original dataset authors.\n\n### Usage Recommendation\n\nIf you intend to use this dataset for **commercial purposes**, please use only the subsets released under MIT or CC BY 4.0 licenses (i.e., TACO and other datasets). The remaining subsets are **strictly non-commercial**.\n\nFor any questions regarding licensing, please refer to the original dataset sources or contact the respective dataset authors.\n\n---\n\n## Acknowledgements\n\nWe thank the following open-source projects for their contributions:\n\n- [V-JEPA2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvjepa2)\n- [UniVLA](https:\u002F\u002Fgithub.com\u002FOpenDriveLab\u002FUniVLA)\n- [Wan2.2](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2)\n- [flux2](https:\u002F\u002Fgithub.com\u002Fblack-forest-labs\u002Fflux2)\n- [villa-x](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fvilla-x)\n- [Open-MAGVIT2](https:\u002F\u002Fgithub.com\u002Ftencentarc\u002Fseed-voken)\n- [SigLIP2](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbig_vision\u002Ftree\u002Fmain)\n- [DINOv2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov2)\n- [DINOv3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov3)\n\n## Support\nPlease contact us at \u003Ca href=\"mailto:longcat-team@meituan.com\">longcat-team@meituan.com\u003C\u002Fa> or join our WeChat Group if you have any questions.\n\n#### WeChat Group\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fmeituan-longcat\u002FLongCat-Flash-Chat\u002Fmain\u002Fwechat-assets\u002FWechat.png\" width=\"200px\">\n","LARYBench 是一个用于评估潜在动作表示的统一框架。它为生成潜在动作表示的模型提供了三个互补的评估管道：提取视频或图像对中的潜在动作表示、分类任务以测试动作语义的捕捉能力，以及回归任务来检验解码物理机器人动作的能力。该项目使用Python开发，具备模块化设计，易于扩展和定制。适合于研究视觉到动作对齐、机器人学习及自动化控制等场景下的模型性能评估。MIT许可证使得该工具在学术界和工业界都易于采用。",2,"2026-06-11 02:48:55","CREATED_QUERY"]