[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77427":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":27,"discoverSource":28},77427,"DepthVLM","hanxunyu\u002FDepthVLM","hanxunyu","🔥 Official code repository for \"Unlocking Dense Metric Depth Estimation in VLMs\"",null,"Python",128,6,3,1,0,2,71,4,2.54,false,"main",true,[],"2026-06-12 02:03:43","\u003Ch1 align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\" height=\"48\" alt=\"DepthVLM Logo\" align=\"absmiddle\">\n  &nbsp;Unlocking Dense Metric Depth Estimation in VLMs\n\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhanxunyu.github.io\u002F\" target=\"_blank\">Hanxun Yu\u003Csup>1,2*\u003C\u002Fsup>\u003C\u002Fa>,\n    \u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=%7EXuan_Qu1\" target=\"_blank\">Xuan Qu\u003Csup>1,2*\u003C\u002Fsup>\u003C\u002Fa>,\n    \u003Ca href=\"https:\u002F\u002Fw-ted.github.io\u002F\" target=\"_blank\">Yuxin Wang\u003Csup>2,3\u003C\u002Fsup>\u003C\u002Fa>,\n    \u003Ca href=\"https:\u002F\u002Fperson.zju.edu.cn\u002Fen\u002Fjkzhu\" target=\"_blank\">Jianke Zhu\u003Csup>1,4\u003C\u002Fsup>\u003C\u002Fa>,\n    \u003Ca href=\"https:\u002F\u002Fwww.kelei.site\u002F\" target=\"_blank\">Lei Ke\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa>\n    \u003Cbr>\n    \u003Csup>1\u003C\u002Fsup>Zhejiang University,\n    \u003Csup>2\u003C\u002Fsup>Tencent Hunyuan LLM,\n    \u003Csup>3\u003C\u002Fsup>HKUST,\n    \u003Csup>4\u003C\u002Fsup>Shenzhen Loop Area Institute\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n    \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15876' target=\"_blank\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.15876-b31b1b?logo=arxiv&logoColor=red'>\u003C\u002Fa>  \n    \u003Ca href='https:\u002F\u002Fdepthvlm.github.io\u002F' target=\"_blank\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Home%20Page-Green?logo=safari&logoColor=white'>\u003C\u002Fa>  \n    \u003Ca href='https:\u002F\u002Fhuggingface.co\u002FJonnyYu828\u002FDepthVLM-4B' target=\"_blank\">\n        \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%93%A6%EF%B8%8F%20Hugging%20Face-Model-orange'>\n    \u003C\u002Fa>\n    \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJonnyYu828\u002FDepthVLM-Bench' target=\"_blank\">\n        \u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Benchmark-blue'>\n    \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\n\n\n\u003Cp align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F772c15d7-2fac-4bbb-9874-112073feefe7\"\n         width=\"80%\"\n         autoplay\n         muted\n         loop\n         playsinline\n         controls>\n    Your browser does not support the video tag.\n  \u003C\u002Fvideo>\n\u003C\u002Fp>\n\n\n\n\n\n\n## 🔍 Overview\n\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fteaser1.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\u003Cbr>\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fteaser2.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\n**DepthVLM** serves as a unified foundation model for both low-level dense geometry prediction and high-level multimodal understanding, while achieving substantially faster inference compared with existing VLM-based approaches such as DepthLM and Youtu-VL.\n\n\n## 📰 News\n- [2026-05-18] 🔥 We release [DepthVLM-Bench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJonnyYu828\u002FDepthVLM-Bench) in Hugging Face 🤗.\n- [2026-05-18] 🔥 We release the checkpoint of [DepthVLM-4B](https:\u002F\u002Fhuggingface.co\u002FJonnyYu828\u002FDepthVLM-4B) in Hugging Face 🤗.\n- [2026-05-18] 🔥 We release the training and inference code.\n- [2026-05-15] 🔥 We release the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15876) of DepthVLM.\n\n\n## 🛠️ Installation\n\n```\ngit clone https:\u002F\u002Fgithub.com\u002Fhanxunyu\u002FDepthVLM.git\ncd DepthVLM\n\nconda create -n depthvlm python=3.10 -y\nconda activate depthvlm\npip install -r requirements.txt\npip install flash-attn==2.6.3 --no-build-isolation\n```\n## 📊 Data Preparation\n- Due to licensing restrictions, we are unable to directly release the curated data. Instead, we provide the full data curation pipeline for reproducibility. Please refer to [data_process.md](.\u002Fdata_process\u002Fdata_process.md) for detailed dataset-specific preparation instructions.\n- We provide visualization examples from ScanNet++ in the [examples](.\u002Fexamples) folder.\n- We also release the curated annotations of [DepthVLM-Bench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJonnyYu828\u002FDepthVLM-Bench) on Hugging Face 🤗.\n\n## 📦️ Pretrained Models\nWe provide the pretrained model [DepthVLM-4B](https:\u002F\u002Fhuggingface.co\u002FJonnyYu828\u002FDepthVLM-4B) in Hugging Face 🤗. \n\n\n## 🤖 Inference Examples \n\nRun our example inference script to generate the predicted depth maps and 3D point clouds.\n```\n# visualization examples\nbash examples\u002Frun_demo.sh\n```\n\nSpecify the annotation and dataset paths in [configs\u002Feval_datasets.conf](configs\u002Feval_datasets.conf), choose the evaluation protocol with `EVAL_MODE=\"sparse\"` for sparse-point evaluation or `EVAL_MODE=\"dense\"` for full-depth-map evaluation, and then run the script on [DepthVLM-Bench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FJonnyYu828\u002FDepthVLM-Bench).\n```\nbash eval\u002Feval.sh\n```\n\n\n## 🚀 Two-Stage Training\nSpecify the annotation and dataset paths in [configs\u002Ftrain_datasets.conf](configs\u002Ftrain_datasets.conf), then run the following training scripts.\n\nStage1: depth head-only training\n```\n# stage-1 \nbash train\u002Ftrain-stage1.sh\n```\nStage2: end-to-end fine-tuning\n```\nbash train\u002Ftrain-stage2.sh\n```\n[DepthVLM-4B](https:\u002F\u002Fhuggingface.co\u002FJonnyYu828\u002FDepthVLM-4B) is trained for two days on 80 NVIDIA H20 GPUs (96GB).\n\n\n## 🔬 Experiment Results\n\n### Comparison with VLMs (Sparse Points)\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Ftable1.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\n### Comparison with Pure Vision Models (Sparse Points)\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Ftable2.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\n### Comparison with Pure Vision Models (Full Depth Map)\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Ftable2-full-map.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\n### Visualization Comparison\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fvisualization.png\" width=\"99%\" alt=\"model\">\n\u003C\u002Fdiv>\n\u003Cbr>\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fexample1.gif\" width=\"99%\" alt=\"example 1\">\n\u003C\u002Fdiv>\n\u003Cbr>\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fexample2.gif\" width=\"99%\" alt=\"example 2\">\n\u003C\u002Fdiv>\n\u003Cbr>\n\u003Cdiv align=\"left\">\n\u003Cimg src=\"assets\u002Fexample3.gif\" width=\"99%\" alt=\"example 3\">\n\u003C\u002Fdiv>\n\n## 👏 Acknowledgements\nWe are grateful for the open-source contributions of other projects:\n- [DepthLM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDepthLM_Official)\n- [Youtu-VL](https:\u002F\u002Fgithub.com\u002FTencentCloudADP\u002Fyoutu-vl)\n- [Qwen3-VL](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-VL)\n\n\n## 🖊️ Citation\n\n```BibTeX\n@article{yu2026unlocking,\n  title={Unlocking Dense Metric Depth Estimation in VLMs},\n  author={Hanxun Yu and Xuan Qu and Yuxin Wang and Jianke Zhu and Lei Ke},\n  journal={arXiv preprint arXiv:2605.15876},\n  year={2026}\n}\n```\n","DepthVLM 是一个统一的基础模型，旨在实现低级密集几何预测和高级多模态理解。该项目通过结合深度估计与视觉语言模型（VLM），在保持高精度的同时显著提升了推理速度，优于现有的基于VLM的方法如DepthLM和Youtu-VL。技术上，它利用了先进的深度学习框架，并针对大规模数据集进行了优化。适合于需要高效处理图像深度信息并进行语义理解的应用场景，比如自动驾驶、机器人导航以及增强现实等。","2026-06-11 03:55:27","CREATED_QUERY"]