[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82284":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":9,"languages":9,"totalLinesOfCode":9,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},82284,"Qwen-VLA","QwenLM\u002FQwen-VLA","QwenLM","The official repository of Qwen-VLA",null,560,20,22,9,0,13,69,328,56,94.97,false,"main",[],"2026-06-12 04:01:37","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fqwen-logo.png\" alt=\"Qwen-VLA\" width=\"260\"\u002F>\n\n\u003Ch1 style=\"border: none;\">Qwen-VLA\u003C\u002Fh1>\n\n\u003Cp>\u003Cb>Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments\u003C\u002Fb>\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Qwen Team\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30280\">📑 Technical Report\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fqwen.ai\u002Fblog?id=qwenvla\">📖 Blog\u003C\u002Fa> |\n    \u003Ca href=\"https:\u002F\u002Fqianwen-res.oss-accelerate.aliyuncs.com\u002FQwen-VLA\u002Fdemo.mp4\">🖥️ Demo\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003C\u002Fdiv>\n\nWelcome to the official repository of **Qwen-VLA**. Here, you can find official information about Qwen-VLA and post your questions ([Issues](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VLA\u002Fissues)).\n\n\n## 🎬 Demo\n\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7521d371-a1d5-4743-928d-aa3b5ce7374e\" width=\"100%\" controls>\u003C\u002Fvideo>\n\u003C\u002Fdiv>\n\n\n## 💡 Introduction\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fqwenvla_overview.png\" alt=\"Qwen-VLA Overview\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n**Qwen-VLA** is a unified vision-language-action generalist model built upon **Qwen3.5-4B** (vision-language backbone) and a **1.15B DiT flow-matching action decoder**. It casts manipulation, navigation, and trajectory prediction into a shared action-and-trajectory prediction framework, enabling a unified model to learn from heterogeneous embodied data across tasks, environments, and robot embodiments via embodiment-aware prompt conditioning, no per-platform output heads needed.\n\nA unified Qwen-VLA generalist **matches or outperforms task-specific specialists** fine-tuned independently per benchmark across multiple simulation and real-world evaluations, pushing embodied intelligence from \"skill specialists\" toward \"generalist actors.\"\n\n### ✨ Key Highlights\n\n- **🏆 One Generalist Beats Specialists.** A unified model matches or outperforms per-benchmark specialists across multiple simulation and real-world evaluations.\n\n- **🔗 Unified Action-and-Trajectory Framework.** Manipulation, navigation, egocentric action modeling, and trajectory prediction share one action-and-trajectory prediction space.\n\n- **🤖 Embodiment-Aware Prompt Conditioning.** One set of weights serves multiple platforms; switching embodiments requires only changing a text prompt.\n\n- **📈 Progressive Training Recipe.** A progressive training recipe that includes large-scale action pretraining, multimodal continued pretraining, supervised fine-tuning, and reinforcement learning, bridging the gap between discrete vision-language tokens and continuous action trajectories.\n\n- **🌍 Strong Real-World OOD Generalization.** Large-scale embodied pretraining enables robust generalization to unseen conditions in real-world deployment, significantly outperforming specialist baselines.\n\n## 🏆 Benchmarks\n\n### Manipulation & Navigation\n\nAs a **unified generalist policy**, Qwen-VLA is trained once on all embodiments jointly and evaluated across all platforms without per-benchmark adaptation, simultaneously handling both manipulation and navigation.\n\n| Model | LIBERO | RoboCasa-GR1 | Simpler-WidowX | RoboTwin-Easy | RoboTwin-Hard | R2R OS | R2R SR | RxR SR |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| **Qwen-VLA-Base** | 90.8 | 40.4 | 64.3 | 64.3 | 66.4 | 61.7 | 53.8 | 55.1 |\n| **Qwen-VLA-Instruct** | **97.9** | **56.7** | **73.7** | **86.1** | **87.2** | **69.0** | **57.5** | **59.6** |\n\n\n**Out-of-Distribution Generalization**\n\n| Model | SimplerEnv-OOD SR (%) | DOMINO SR (%) | DOMINO MS (%) |\n| :--- | :---: | :---: | :---: |\n| **Qwen-VLA-Base** | 25.3 | 21.1 | 37.4 |\n| **Qwen-VLA-Instruct** | **32.0** | **26.6** | **39.5** |\n\n> SimplerEnv-OOD: fine-tuned solely on simple pick-and-place, evaluated on unseen spatial and visual tasks.\n> DOMINO: zero-shot evaluation on dynamic manipulation with moving objects, no dynamic training data used.\n\n### Real-World Results\n\nOn the ALOHA bimanual platform, GR00T N1.6 and &pi;\u003Csub>0.5\u003C\u002Fsub> are **per-task specialist** models fine-tuned independently, while **Qwen-VLA is a unified all-in-one generalist** that handles all tasks, embodiments, and modalities within one unified model.\n\n**In-Domain Performance (Success Rate %)**\n\n| Model | Pick & Place | Table Cleaning | Bowl Stacking | Bowl Pick & Place | Towel Folding | Fine-grained | **Avg** |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| GR00T N1.6 | 30.8 | 38.5 | 53.8 | 19.2 | 19.2 | 10.3 | 28.6 |\n| &pi;\u003Csub>0.5\u003C\u002Fsub> | 73.1 | 84.6 | 88.5 | 69.2 | **80.8** | 33.3 | 71.6 |\n| Qwen-VLA-aloha (w\u002Fo pretrain) | 30.8 | 53.8 | 61.5 | 64.1 | 50.0 | 30.8 | 48.5 |\n| **Qwen-VLA-aloha (w\u002F pretrain)** | **96.2** | **92.3** | **98.7** | **87.2** | 65.4 | **61.5** | **83.6** |\n\n**OOD Performance (Success Rate %)**\n\n| Model | Color | Instance | Position | Background | Instruction | **Avg** |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: |\n| GR00T N1.6 | 46.2 | 38.5 | 3.8 | 19.2 | 19.2 | 25.4 |\n| &pi;\u003Csub>0.5\u003C\u002Fsub> | 57.7 | 61.5 | 19.2 | 26.9 | 42.3 | 41.5 |\n| Qwen-VLA-aloha (w\u002Fo pretrain) | 42.3 | 30.8 | 34.6 | 30.8 | 42.3 | 36.2 |\n| **Qwen-VLA-aloha (w\u002F pretrain)** | **88.5** | **76.9** | **53.8** | **80.8** | **84.6** | **76.9** |\n\n\n\n## 📜 Citation\n\nIf you find our work helpful, feel free to give us a cite.\n\n```bibtex\n@misc{qwenvla,\n      title={Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments}, \n      author={Qiuyue Wang and Mingsheng Li and Jian Guan and Jinhui Ye and Sicheng Xie and Yitao Liu and Junhao Chen and Zhixuan Liang and Jie Zhang and Xintong Hu and Xuhong Huang and Pei Lin and Junyang Lin and Dayiheng Liu and Shuai Bai and Jingren Zhou and Jiazhao Zhang and Haoqi Yuan and Gengze Zhou and Hang Yin and Ye Wang and Yiyang Huang and Zixing Lei and Wujian Peng and Delin Chen and Yingming Zheng and Jingyang Fan and Xianwei Zhuang and Xin Zhou and Haoyang Li and Anzhe Chen and Tong Zhang and Xuejing Liu and Yuchong Sun and Ruizhe Chen and Zhaohai Li and Chenxu Lü and Zhibo Yang and Tao Yu and Xionghui Chen},\n      year={2026},\n      eprint={2605.30280},\n      archivePrefix={arXiv},\n      primaryClass={cs.RO},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30280}, \n}\n```\n\n","Qwen-VLA 是一个统一的视觉-语言-动作通用模型，基于 Qwen3.5-4B 视觉-语言骨干和1.15B DiT流匹配动作解码器构建。其核心功能包括将操作、导航和轨迹预测整合到一个共享的动作与轨迹预测框架中，通过实体感知提示条件实现跨任务、环境和机器人实体的异构数据学习，无需为每个平台单独设置输出头。该模型适用于需要在不同任务、环境和机器人实体间进行高效学习和适应的场景，如机器人控制、自主导航等。此外，Qwen-VLA 采用渐进式训练方法，支持从大规模动作预训练到强化学习的过程，展现出强大的现实世界泛化能力，能够在未见过的情况下优于专门针对特定任务优化的模型。",2,"2026-06-11 04:08:16","CREATED_QUERY"]