[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1361":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":13,"lastSyncTime":28,"discoverSource":29},1361,"DiT4DiT","Mondo-Robotics\u002FDiT4DiT","Mondo-Robotics","This is the official code repo for DiT4DiT, a Vision-Action-Model (VAM) framework that combines video generation model with flow-matching-based action prediction for generalizable robotic manipulation.",null,"Python",335,15,2,9,0,18,27,88,54,85.91,"MIT License",false,"main",[],"2026-06-12 04:00:09","\u003Cdiv align=\"center\">\n\n  \u003Cimg src=\"media\u002Flogo.svg\" width=\"480\" alt=\"DiT4DiT\">\n\n  \u003Ch2>Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control\u003C\u002Fh2>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2603.10448-FF5500.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10448)\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-FF8C00.svg)](https:\u002F\u002Fdit4dit.github.io\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-FF8C00.svg)](LICENSE)\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n[Teli Ma](https:\u002F\u002Fteleema.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup> &nbsp;&nbsp;\n[Jia Zheng](https:\u002F\u002Fjiaazheng.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup> &nbsp;&nbsp;\n[Zifan Wang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=GaJXZ-UAAAAJ&hl=en)\u003Csup>1,2\u003C\u002Fsup> &nbsp;&nbsp;\n[Chunli Jiang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=nvzF-RMAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup> &nbsp;&nbsp;\nAndy Cui\u003Csup>1\u003C\u002Fsup> &nbsp;&nbsp;\n[Junwei Liang](https:\u002F\u002Fjunweiliang.me\u002Findex.html)\u003Csup>2,3,\\*\u003C\u002Fsup> &nbsp;&nbsp;\n[Shuo Yang](https:\u002F\u002Fshuoyangrobotics.github.io\u002F)\u003Csup>1,\\*\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup>Mondo Robotics &nbsp;&nbsp; \u003Csup>2\u003C\u002Fsup>HKUST(GZ) &nbsp;&nbsp; \u003Csup>3\u003C\u002Fsup>HKUST &nbsp;&nbsp; \u003Csup>\\*\u003C\u002Fsup>Corresponding author\n\n\u003C\u002Fdiv>\n\n---\n\nDiT4DiT is a \u003Cb>\u003Cspan style=\"color: #FF8C00;\">Vision-Action-Model (VAM)\u003C\u002Fspan>\u003C\u002Fb> framework that combines video generation transformers with flow-matching-based action prediction for generalizable robotic manipulation. It supports both the tabletop and whole-body control for manipulation tasks. Notably, DiT4DiT stands as the \u003Cb>first\u003C\u002Fb> efficient VAM to achieve real-time whole-body control of humanoid robots.\n\n\n\n## News\n\n- **[2026-04-15]** Initial release of DiT4DiT with training, evaluation, and deployment code.\n- **[2026-03-11]** We release the [arXiv paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10448).\n\n\n### Whole-Body Control (all 1x speed & autonomous)\n\n\u003Cdiv align=\"center\">\n\u003Ctable>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cb>Shelf Organization\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cimg src=\"media\u002Fshelf_organization.webp\" width=\"800\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cb>Relocate Chair\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cimg src=\"media\u002Frelocate_chair.webp\" width=\"800\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cb>Assembly Line Work\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\" colspan=\"2\">\u003Cimg src=\"media\u002Fassembly_line.webp\" width=\"800\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n### Tabletop Manipulation (all 1x speed, 1 policy for all tasks)\n\n\u003Cdiv align=\"center\">\n\u003Ctable>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cb>Stack Cups\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cb>Drawer Interaction\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fcups.gif\" width=\"400\">\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fdrawer.gif\" width=\"400\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cb>Pick and Place\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cb>Arrange Flower\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Feggplant.gif\" width=\"400\">\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fflower.gif\" width=\"400\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cb>Move Spoon\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cb>Insert Plate\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fspoon.gif\" width=\"400\">\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fplate.gif\" width=\"400\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cb>Box Packing\u003C\u002Fb>\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cb>Twist Cap\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Fpacking.gif\" width=\"400\">\u003C\u002Ftd>\n\u003Ctd align=\"center\">\u003Cimg src=\"media\u002Ftwist_cap.gif\" width=\"400\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n## Table of Contents\n\n- [News](#news)\n- [TODOs](#todos)\n- [Project Structure](#project-structure)\n- [Installation](#installation)\n- [Quick Start](#quick-start)\n  - [Simulation](#simulation)\n  - [Real Robot](#real-robot)\n- [Acknowledgements](#acknowledgements)\n- [License](#license)\n\n\n\n## TODOs\n\n- [ ] Release teleoperation, training and deployment code for Unitree G1 tabletop tasks.\n- [ ] Release teleoperation, training and deployment code for Unitree G1 whole-body control tasks.\n\n## Project Structure\n\n```\nDiT4DiT\u002F\n├── DiT4DiT\u002F                    # Core package\n│   ├── config\u002F                 # Configurations\n│   │   ├── deepseeds\u002F          # DeepSpeed configs\n│   │   ├── robocasa\u002F           # RoboCasa experiment configs\n│   │   └── real_robot\u002F         # Real robot configs\n│   ├── dataloader\u002F             # Dataset loading (LeRobot)\n│   ├── model\u002F                  # Model architecture\n│   │   ├── framework\u002F          # DiT4DiT framework\n│   │   └── modules\u002F            # Backbone & action model\n│   └── training\u002F               # Training scripts & utilities\n├── deployment\u002F                 # WebSocket-based model server\n├── docs\u002F                       # Documentation\n├── examples\u002F\n│   ├── Robocasa_tabletop\u002F      # RoboCasa simulation example\n│   │   ├── train_files\u002F        # Training scripts\n│   │   └── eval_files\u002F         # Evaluation & simulation\n│   └── Real_G1\u002F                # Real Unitree G1 example\n│       ├── train_files\u002F        # Training scripts\n│       └── eval_files\u002F         # Evaluation\n└── requirements.txt\n```\n\n## Installation\n\n### Prerequisites\n\n- Python >= 3.10\n- CUDA 12.4+\n- \\>8x GPUs recommended for training\n\n### Setup\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FMondo-Robotics\u002FDiT4DiT.git\ncd DiT4DiT\n\n# Create conda environment\nconda create -n dit4dit python=3.10 -y\nconda activate dit4dit\n\n# Install PyTorch (CUDA 12.8 recommended)\npip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\n\n# Install dependencies\npip install -r requirements.txt\n\n# Install the package\npip install -e .\n```\n\n### Download Pretrained Backbone\n\nDownload the Cosmos-Predict2.5-2B model from Hugging Face:\n\n```bash\nhuggingface-cli download nvidia\u002FCosmos-Predict2.5-2B --revision diffusers\u002Fbase\u002Fpost-trained --local-dir \u002Fpath\u002Fto\u002FCosmos-Predict2.5-2B\n```\n\n## Model Zoo\n\nWe release pretrained checkpoints to facilitate reproduction.\n\n### Available Checkpoints\n\n| Model | Description | Dataset | Success Rate | Link |\n| --- | --- | --- | --- | --- |\n| **DiT4DiT-LIBERO** | DiT4DiT for LIBERO benchmark | LIBERO | 98.6 | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmondo-robotics\u002Fdit4dit-model\u002Ftree\u002Fmain\u002Fdit4dit_libero) |\n| **DiT4DiT-RoboCasa-GR1** | DiT4DiT for RoboCasa-GR1 tabletop tasks | RoboCasa-GR1 | 56.7 | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmondo-robotics\u002Fdit4dit-model\u002Ftree\u002Fmain\u002Fdit4dit_robocasa_gr1) |\n\n> **Note:** More checkpoints will be released soon. Stay tuned!\n\n## Quick Start\n\n### Simulation\n\n- **LIBERO**: See the full training and evaluation guide [here](docs\u002Flibero.md).\n- **RoboCasa-GR1 Tabletop**: See the full training and evaluation guide [here](docs\u002Frobocasa_tabletop.md).\n\n### Real Robot\n\nComing soon.\n\n## Results\n\n### LIBERO Benchmark\n\n| Task Suite | Success Rate |\n|------------|-------------|\n| LIBERO-Spatial | 98.6 |\n| LIBERO-Object | 100.0 |\n| LIBERO-Goal | 99.2 |\n| LIBERO-10 | 96.6 |\n| **Average** | **98.6** |\n\n### Robocasa-GR1 Benchmark\n\nThe following results are obtained using the default training parameters described in [Configure Training](#configure-training). We report three independent evaluation runs of the same checkpoint to demonstrate reproducibility. The model consistently achieves an average success rate above 56% across all runs.\n\n| Task | Run 1 | Run 2 | Run 3 |\n|------|-------|-------|-------|\n| BottleToCabinetClose | 50.0 | 72.0 | 68.0 |\n| CanToDrawerClose | 80.0 | 80.0 | 82.0 |\n| CupToDrawerClose | 50.0 | 34.0 | 50.0 |\n| MilkToMicrowaveClose | 58.0 | 60.0 | 38.0 |\n| PotatoToMicrowaveClose | 40.0 | 40.0 | 36.0 |\n| WineToCabinetClose | 60.0 | 48.0 | 60.0 |\n| FromCuttingboardToBasket | 54.0 | 48.0 | 46.0 |\n| FromCuttingboardToCardboardbox | 50.0 | 60.0 | 48.0 |\n| FromCuttingboardToPan | 80.0 | 74.0 | 78.0 |\n| FromCuttingboardToPot | 52.0 | 46.0 | 66.0 |\n| FromCuttingboardToTieredbasket | 44.0 | 54.0 | 50.0 |\n| FromPlacematToBasket | 58.0 | 40.0 | 44.0 |\n| FromPlacematToBowl | 64.0 | 66.0 | 72.0 |\n| FromPlacematToPlate | 66.0 | 62.0 | 64.0 |\n| FromPlacematToTieredshelf | 44.0 | 48.0 | 40.0 |\n| FromPlateToBowl | 64.0 | 74.0 | 54.0 |\n| FromPlateToCardboardbox | 50.0 | 54.0 | 52.0 |\n| FromPlateToPan | 58.0 | 68.0 | 70.0 |\n| FromPlateToPlate | 62.0 | 64.0 | 72.0 |\n| FromTrayToCardboardbox | 52.0 | 50.0 | 60.0 |\n| FromTrayToPlate | 64.0 | 64.0 | 58.0 |\n| FromTrayToPot | 68.0 | 70.0 | 66.0 |\n| FromTrayToTieredbasket | 50.0 | 46.0 | 50.0 |\n| FromTrayToTieredshelf | 42.0 | 36.0 | 28.0 |\n| **Average** | **56.7** | **56.6** | **56.3** |\n\n### LIBERO Benchmark\n\n\n## Citation\n\nIf you find this work useful, please consider citing:\n\n```bibtex\n@article{ma2026dit4dit,\n  title={DiT4DiT: Jointly Modeling Video Dynamics and Actions for Generalizable Robot Control},\n  author={Ma, Teli and Zheng, Jia and Wang, Zifan and Jiang, Chunli and Cui, Andy and Liang, Junwei and Yang, Shuo},\n  journal={arXiv preprint arXiv:2603.10448},\n  year={2026}\n}\n```\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.\n\n## Acknowledgements\n\nThis project builds upon:\n- [StarVLA](https:\u002F\u002Fgithub.com\u002FstarVLA\u002FstarVLA)\n- [Cosmos-Predict2.5](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FCosmos) by NVIDIA\n- [GR00T](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FIsaac-GR00T) by NVIDIA\n- [Robocasa](https:\u002F\u002Fgithub.com\u002Frobocasa\u002Frobocasa)\n- [LeRobot](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flerobot) by Hugging Face\n\n\n\n","DiT4DiT 是一个结合视频生成模型与基于流匹配的动作预测的视觉-动作-模型 (VAM) 框架，旨在实现通用的机器人操作。其核心功能包括通过视频生成变换器和动作预测技术，支持桌面及全身控制任务，尤其在实时全身体人形机器人控制方面表现出色。该项目采用 Python 编写，是首个能够高效实现这一目标的 VAM 框架。适用于需要高度灵活性和适应性的机器人应用场景，如仓库整理、家具移动以及生产线作业等复杂环境下的自动化任务处理。","2026-06-11 02:43:15","CREATED_QUERY"]