[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82875":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},82875,"VLM3","facebookresearch\u002FVLM3","facebookresearch","Official implementation of paper \"VLM³: Vision Language Models Are Native 3D Learners\".","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.30561",null,"Jupyter Notebook",274,9,7,1,0,51,75,144,153,93,"Other",false,"main",[26,27,28,29,30,31,32],"3d-foundation-model","camera-pose-estimation","depth-estimation","image-matching","large-language-models","object-level-3d","vlms","2026-06-12 04:01:39","# [VLM³: Vision Language Models Are Native 3D Learners](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.30561)\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.30561-b31b1b.svg)]([https:\u002F\u002Farxiv.org\u002Fabs\u002FXXXX.XXXXX](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30561))\n\n| Model      |                                               Coming Soon!                                                |\n|:----:|:-------------------------------------------------------------------------------------------------:|\n\n\u003Cdiv align=center>\n\u003Cimg width=100% src=\".\u002Fmedia\u002Fteaser.svg\"\u002F>\n\u003C\u002Fdiv>\n\n\n## Summary\n\nWe show that **standard VLMs** are native 3D learners. We propose VLM³, which **without complex data augmentations and any architecture\u002Floss change**, can make standard VLMs:\n- Surpass SpatialRGPT on object-level 3D understanding (both qualitative and quantitative in SpatialRGPT-bench), without using extra encoders.\n- Match [UnidepthV2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.20110) and [Moge-2](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.02546) on metric depth estimation, improving the accuracy of [DepthLM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDepthLM_Official) from 0.84 to 0.9;\n- Surpass [DKM](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FEdstedt_DKM_Dense_Kernelized_Feature_Matching_for_Geometry_Estimation_CVPR_2023_paper.pdf) and [RoMa](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.15404) for pixel correspondence estimation;\n- Match [DepthAnything3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.10647) and surpass [VGGT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.11651) for camera pose estimation;\n\nVLM³ opens up a new paradigm for simple and scalable 3D learning. Now you dont need to spend a year designing:\n- complex models with different backbones, prediction heads, routings.\n- complex losses for different prediction heads, balancing weights for different losses\n- complex data augmentations like image cropping, rotatin, translation, appearance augmentation etc.\n\n**All you need to do is collect data, and scale the training with a standard VLM!**\n\n\nOur findings provide a new perspective on what is and is not necessary for 3D vision:\n- Large models, task-specific architectures, losses, data-augmentations, and even the regression formulation that sets the foundation of most SOTA 3D expert vision models, are all not necessary conditions for effective 3D learning.\n- A generalist foundation model (VLM) with unified output domain (text) + data scaling are sufficient.\n\n\n## Method Overview\n\nGiven the input images, VLM³ first resizes them so that the focal length is the same for all input images (e.g., 1000 pixels). This solves camera ambiguity without the need for adding extra VLM encoders\u002Fmodules. To refer to an object or pixel, VLM³ simply uses text with the pixel range normalized (e.g., [0, 2000) or [0, 1000)) for both horizontal and vertical axes. This requires no architecture or marker rendering, and makes VLM³ much more flexible and scalable. Standard VLM architectures and text-based training (SFT) are used to train the model.\n\n\u003Cdiv align=center>\n\u003Cimg width=100% src=\".\u002Fmedia\u002Fpipeline.svg\"\u002F>\n\u003C\u002Fdiv>\n\n## Results\n\n\u003Cdiv align=center>\n\u003Cimg width=100% src=\".\u002Fmedia\u002Fvisualizations.svg\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=center>\n\u003Cimg width=100% src=\".\u002Fmedia\u002Ftable1.png\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=center>\n\u003Cimg width=100% src=\".\u002Fmedia\u002Ftable2.png\"\u002F>\n\u003C\u002Fdiv>\n\n\n## Contact\nZhipeng Cai, Meta Inc, homepage: https:\u002F\u002Fzhipengcai.github.io\u002F, email: czptc2h at gmail dot com.\n\n# Quickstart\n\nInstall transformers to do inference with our model.\n\n    pip install transformers>=5.4.0\n\nSince VLM³ maintain the architecture of the base model (Qwen3-vl-4B), we can call the model for inference the same as the original VLM.\n\n\n### Using 🤗 Transformers to Chat\n\nHere we show a code snippet to show you how to use the chat model with `transformers`:\n\n```python\nfrom transformers import AutoModelForImageTextToText, AutoProcessor\n\n# default: Load the model on the available device(s)\nmodel = AutoModelForImageTextToText.from_pretrained(\n    \"facebook\u002FVLM3-depth\", dtype=\"auto\", device_map=\"auto\"\n)\n\nprocessor = AutoProcessor.from_pretrained(\"facebook\u002FVLM3-depth\")\n\nmessages = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"image\",\n                \"image\": \"https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FVLM3\u002Fblob\u002Fmain\u002Fsample_data\u002Fdepth.jpeg\",\n            },\n            {\n                \"type\": \"text\",\n                \"text\": (\n                    f\"Given this image, how far is the point at coordinates ({norm_x}, {norm_y}) \"\n                    \"from the camera? The coordinates are in normalised [0, 2000] format relative \"\n                    \"to image width and height. Output the thinking process in \u003Cthink> \u003C\u002Fthink> \"\n                    \"and final answer (the meter number only, without the unit) in \u003Canswer> \u003C\u002Fanswer> tags.\"\n                ),\n            },\n        ],\n    }\n]\n\n# Preparation for inference\ninputs = processor.apply_chat_template(\n    messages,\n    tokenize=True,\n    add_generation_prompt=True,\n    return_dict=True,\n    return_tensors=\"pt\"\n)\ninputs = inputs.to(model.device)\n\n# Inference: Generation of the output\ngenerated_ids = model.generate(**inputs, max_new_tokens=128)\ngenerated_ids_trimmed = [\n    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)\n]\noutput_text = processor.batch_decode(\n    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False\n)\nprint(output_text)\n```\n\nPlease check this [Cookbook](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FVLM3\u002Fblob\u002Fmain\u002Finference.ipynb) for detailed examples on how to use different checkpoints for different tasks.\n\n## Citation\n\n    @article{cai2026vlm3,\n        title={VLM³: Vision Language Models Are Native 3D Learners},\n        author={Cai, Zhipeng and Liu, Zhuang and Xiong, Yunyang and Liu, Zechun and Vikas, Chandra and Shi, Yangyang},\n        journal={arXiv preprint arXiv:xxxx.yyyy},\n        year={2026},\n    }\n\n## Related projects\n\nThis work is largely motivated by our previous project [DepthLM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDepthLM_Official)\n\n    @article{cai2025depthlm,\n        title={DepthLM: Metric Depth from Vision Language Models},\n        author={Cai, Zhipeng and Yeh, Ching-Feng and Hu, Xu and Liu, Zhuang and Meyer, Gregory and Lei, Xinjie and Zhao, Changsheng and Li, Shang-Wen and Chandra, Vikas and Shi, Yangyang},\n        journal={arXiv preprint arXiv:2509.25413},\n        year={2025},\n    }\n\n## License\nVLM³ is FAIR CC-BY-NC licensed, as found in the LICENSE file.\n","VLM³项目展示了标准视觉语言模型（VLMs）作为原生3D学习者的能力。它通过简单的数据处理和无需改变模型架构或损失函数，使标准VLMs在多个3D任务上达到甚至超越当前最先进水平，包括对象级3D理解、深度估计、像素对应关系估计及相机姿态估计。该项目的核心技术特点在于其能够利用统一的输出域（文本）和大规模数据训练来实现高效3D学习，而无需复杂的模型设计、特定任务架构、复杂的数据增强或特定损失函数。VLM³适用于需要快速开发且具有良好扩展性的3D视觉应用场景，如机器人导航、虚拟现实环境构建等，极大地简化了3D学习流程。",2,"2026-06-11 04:09:29","CREATED_QUERY"]