[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74277":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},74277,"lingbot-depth","Robbyant\u002Flingbot-depth","Robbyant","Masked Depth Modeling for Spatial Perception","https:\u002F\u002Ftechnology.robbyant.com\u002Flingbot-depth",null,"Python",1207,94,14,17,0,9,44,62.83,"Apache License 2.0",false,"main",[24,25,26],"depth","depth-camera","masked-image-modeling","2026-06-11 04:06:10","# LingBot-Depth: Masked Depth Modeling for Spatial Perception\n\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg)](LICENSE)\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![PyTorch 2.6+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-2.6+-green.svg)](https:\u002F\u002Fpytorch.org\u002F)\n\n📄 **[Technical Report](https:\u002F\u002Fgithub.com\u002FRobbyant\u002Flingbot-depth\u002Fblob\u002Fmain\u002Ftech-report.pdf)** |\n📄 **[arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.17895)** |\n🌐 **[Project Page](https:\u002F\u002Ftechnology.robbyant.com\u002Flingbot-depth)** |\n💻 **[Code](https:\u002F\u002Fgithub.com\u002Frobbyant\u002Flingbot-depth)** |\n🤗 **[Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Frobbyant\u002Fmdm_depth)** |\n🤗 **[Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Frobbyant\u002Flingbot-depth)** |\n🤖 **[ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fcollections\u002FRobbyant\u002FLingBot-Depth)** ｜\n🤖 **[Video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1oa6uBXEyh)**\n\n\n**LingBot-Depth** transforms incomplete and noisy depth sensor data into high-quality, metric-accurate 3D measurements. By jointly aligning RGB appearance and depth geometry in a unified latent space, our model serves as a powerful spatial perception foundation for robot learning and 3D vision applications.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser\u002Fteaser-crop.png\" width=\"100%\">\n\u003C\u002Fp>\n\nOur approach refines raw sensor depth into clean, complete measurements, enabling:\n- **Depth Completion & Refinement**: Fills missing regions with metric accuracy and improved quality\n- **Scene Reconstruction**: High-fidelity indoor mapping with a strong depth prior\n- **4D Point Tracking**: Accurate dynamic tracking in metric space for robot learning\n- **Dexterous Manipulation**: Robust grasping with precise geometric understanding\n\n## News\n\n- **[2026.03.31]** Our dataset for masked depth modeling is now public.\n- **[2026.02.15]** Upload LingBot-Depth-v0.5 which fixes the bug in previous version.\n\n## Artifacts Release\n\n### Model Zoo\n\nWe provide pretrained models for different scenarios:\n\n| Model | Hugging Face Model | ModelScope Model | Description |\n|-------|-----------|-----------|-------------|\n| LingBot-Depth-v0.5 | [robbyant\u002Flingbot-depth-pretrain-vitl-14-v0.5](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-depth-pretrain-vitl-14-v0.5\u002Ftree\u002Fmain) | [robbyant\u002Flingbot-depth-pretrain-vitl-14-v0.5](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-depth-pretrain-vitl-14-v0.5)| ⭐ **Recommended!** General-purpose depth refinement and completion for both dense and sparse raw depth(fixes the bug in LingBot-Depth-v0.1)|\n| LingBot-Depth-v0.1 | [robbyant\u002Flingbot-depth-pretrain-vitl-14](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-depth-pretrain-vitl-14\u002Ftree\u002Fmain) | [robbyant\u002Flingbot-depth-pretrain-vitl-14](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-depth-pretrain-vitl-14)| General-purpose depth refinement |\n| LingBot-Depth-DC | [robbyant\u002Flingbot-depth-postrain-dc-vitl14](https:\u002F\u002Fhuggingface.co\u002Frobbyant\u002Flingbot-depth-postrain-dc-vitl14\u002Ftree\u002Fmain) | [robbyant\u002Flingbot-depth-postrain-dc-vitl14](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FRobbyant\u002Flingbot-depth-postrain-dc-vitl14)| Optimized for sparse depth completion |\n\n### Data Release\n- The curated 3M RGB-D dataset is now available at [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Frobbyant\u002Fmdm_depth) and [ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002FRobbyant\u002FLingBot-Depth-Dataset)\n- Dataset overview:\n\n| Name | Description | Samples |\n|------|-------------|--------:|\n| **RobbyReal** | Real-world indoor scenes captured with multiple RGB-D cameras | 1,400,000 |\n| **RobbyVla** | Real-world data collected during VLA (Vision-Language-Action) robot manipulation tasks | 580,960 |\n| **RobbySim** | Simulated data rendered from two camera viewpoints | 999,264 |\n| **RobbySimVal** | Validation split of simulated data | 38,976 |\n| **Total** | | **3,019,200** |\n\n## Installation\n\n### Requirements\n\n• Python ≥ 3.9 • PyTorch ≥ 2.0.0 • CUDA-capable GPU (recommended)\n\n### From source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frobbyant\u002Flingbot-depth\ncd lingbot-depth\n\n# Install the package (use 'python -m pip' to ensure correct environment)\nconda create -n lingbot-depth python=3.9\nconda activate lingbot-depth\npython -m pip install -e .\n```\n## Quick Start\n\n**Inference:**\n\n```python\nimport torch\nimport cv2\nimport numpy as np\nfrom mdm.model.v2 import MDMModel\n\n# Load model\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel = MDMModel.from_pretrained('robbyant\u002Flingbot-depth-pretrain-vitl-14-v0.5').to(device)\n\n# Load and prepare inputs\nimage = cv2.cvtColor(cv2.imread('examples\u002F0\u002Frgb.png'), cv2.COLOR_BGR2RGB)\nh, w = image.shape[:2]\nimage = torch.tensor(image \u002F 255, dtype=torch.float32, device=device).permute(2, 0, 1)[None]\n\ndepth = cv2.imread('examples\u002F0\u002Fraw_depth.png', cv2.IMREAD_UNCHANGED).astype(np.float32) \u002F 1000.0\ndepth = torch.tensor(depth, dtype=torch.float32, device=device)[None]\n\nintrinsics = np.loadtxt('examples\u002F0\u002Fintrinsics.txt')\nintrinsics[0] \u002F= w  # Normalize fx and cx by width\nintrinsics[1] \u002F= h  # Normalize fy and cy by height\nintrinsics = torch.tensor(intrinsics, dtype=torch.float32, device=device)[None]\n\n# Run inference\noutput = model.infer(\n    image,\n    depth_in=depth,\n    intrinsics=intrinsics)\n\ndepth_pred = output['depth']  # Refined depth map\npoints = output['points']      # 3D point cloud\n```\n\n**Run example:**\n\nThe model is automatically downloaded from Hugging Face on first run (no manual download needed):\n\n```bash\n# Basic usage - processes example 0\npython example.py\n\n# Use a different example (0-7 available)\npython example.py --example 1\n\n# Use depth completion optimized model\npython example.py --model robbyant\u002Flingbot-depth-postrain-dc-vitl14-v0.5\n\n# Custom output directory\npython example.py --output my_results\n\n# See all options\npython example.py --help\n```\n\nThis processes the example data and saves results to `result\u002F` (or custom directory):\n```\nresult\u002F\n├── rgb.png                 # Input RGB image\n├── depth_input.npy        # Input depth (float32, meters)\n├── depth_refined.npy      # Refined depth (float32, meters)\n├── depth_input.png        # Input depth visualization\n├── depth_refined.png      # Refined depth visualization\n├── depth_comparison.png   # Side-by-side comparison\n└── point_cloud.ply       # 3D point cloud\n```\n\n**Available examples:** 8 example scenes (0-7) included in `examples\u002F` directory.\n\n## Method\n\nWe introduce a masked depth modeling approach that learns robust RGB-D representations through self-supervised learning. The model employs a Vision Transformer encoder with specialized depth-aware attention mechanisms to jointly process RGB and depth inputs.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fattention\u002Ffig-attention-vis.png\" width=\"100%\">\n\u003C\u002Fp>\n\n**Depth-aware attention visualization.** Visualizing attention from depth queries (Q1–Q3, marked with ⋆) to RGB tokens in two scenes: (a) aquarium and (b) indoor shelf. Each row shows masked input depth, attention weights on RGB, and refined output. Different queries attend to spatially corresponding regions, demonstrating cross-modal alignment.\n\n**Key Innovations:**\n- **Masked Depth Modeling**: Self-supervised pre-training via depth reconstruction\n- **Cross-Modal Attention**: Joint RGB-Depth alignment in unified latent space\n- **Metric-Scale Preservation**: Maintains real-world measurements for downstream tasks\n\n## Training Data\n\nOur model is trained on a large-scale diverse dataset combining real-world and simulated RGB-D captures:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdataset\u002Fdiversity_figure.png\" width=\"100%\">\n\u003C\u002Fp>\n\n**Training dataset.** 2M real-world and 1M simulated samples spanning diverse indoor environments (top). Representative RGB-D inputs with ground truth depth (bottom).\n\n**Dataset Composition:**\n- **Real Captures**: 2M samples from residential, office, and commercial environments\n- **Simulated Data**: 1M photo-realistic renders with perfect ground truth\n- **Modalities**: RGB images, raw depth, refined ground truth depth\n- **Diversity**: Multiple sensor types, lighting conditions, and scene complexities\n\n## Applications\n\n### 4D Point Tracking\n\nLingBot-Depth provides metric-accurate 3D geometry essential for tracking dynamic targets:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdownstream_tracking\u002Ffig-dynamic-tracking.png\" width=\"100%\">\n\u003C\u002Fp>\n\n**4D point tracking.** Robust tracking in gym environments with dynamic human motion. Top: query point selection. Middle: 3D tracking on deforming geometry. Bottom: refined depth maps. Demonstrated on scooter, rowing machine, gym equipment, and pull-up bar.\n\n### Dexterous Manipulation\n\nHigh-quality geometric understanding enables reliable robotic grasping across diverse objects and materials:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdownstream_grasp\u002Ffig-grasp-demo.png\" width=\"100%\">\n\u003C\u002Fp>\n\n**Dexterous grasping.** Robust manipulation enabled by refined depth. Top: point cloud reconstruction. Bottom: successful grasps on steel cup, glass cup, storage box, and toy car.\n\n## Hardware Setup\n\nWe developed a scalable RGB-D capture system for large-scale data collection:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdevice\u002Fdevice-full.jpg\" width=\"60%\">\n\u003C\u002Fp>\n\n**RGB-D capture system.** Multi-sensor setup with Intel RealSense, Orbbec Gemini, and Azure Kinect for scalable real-world data collection.\n\n## Model Details\n\n### Architecture\n\n- **Encoder**: Vision Transformer (Large) with RGB-D fusion\n- **Decoder**: Multi-scale feature pyramid with specialized heads\n- **Heads**: Depth regression\n- **Training**: Masked depth modeling with reconstruction objective\n\n### Input Format\n\n**RGB Image:**\n- Shape: `[B, 3, H, W]` normalized to [0, 1]\n- Format: PyTorch tensor, float32\n\n**Depth Map:**\n- Shape: `[B, H, W]`\n- Unit: Meters (configurable via scale parameter)\n- Invalid regions: 0 or NaN\n\n**Camera Intrinsics:**\n- Shape: `[B, 3, 3]`\n- Normalized format: `fx'=fx\u002FW, fy'=fy\u002FH, cx'=cx\u002FW, cy'=cy\u002FH`\n- Example:\n  ```\n  [[fx\u002FW,   0,   cx\u002FW],\n   [  0,  fy\u002FH,  cy\u002FH],\n   [  0,    0,    1  ]]\n  ```\n\n### Output Format\n\nThe model returns a dictionary:\n\n```python\n{\n    'depth': torch.Tensor,   # Refined depth [B, H, W]\n    'points': torch.Tensor,  # Point cloud [B, H, W, 3] in camera space\n}\n```\n\n### Inference Parameters\n\n```python\nmodel.infer(\n    image,                                   # RGB tensor [B, 3, H, W]\n    depth_in=None,                           # Input depth [B, H, W]\n    use_fp16=True,                           # Mixed precision inference\n    intrinsics=None,                         # Camera intrinsics [B, 3, 3]\n)\n```\n\n## Citation\n\nIf you find this work useful for your research, please cite:\n\n```bibtex\n@article{lingbot-depth2026,\n  title={Masked Depth Modeling for Spatial Perception},\n  author={Tan, Bin and Sun, Changjiang and Qin, Xiage and Adai, Hanat and Fu, Zelin and Zhou, Tianxiang and Zhang, Han and Xu, Yinghao and Zhu, Xing and Shen, Yujun and Xue, Nan},\n  journal={arXiv preprint arXiv:2601.17895},\n  year={2026}\n}\n```\n\nPlease also consider citing DINOv2, which serves as our backbone:\n\n```bibtex\n@article{oquab2023dinov2,\n  title={DINOv2: Learning Robust Visual Features without Supervision},\n  author={Oquab, Maxime and Darcet, Timothée and Moutakanni, Theo and Vo, Huy and Szafraniec, Marc and Khalidov, Vasil and Fernandez, Pierre and Haziza, Daniel and Massa, Francisco and El-Nouby, Alaaeldin and others},\n  journal={Transactions on Machine Learning Research},\n  year={2024}\n}\n```\n\n## License\n\nThis project is released under the Apache License 2.0. See [LICENSE](LICENSE) file for details.\n\n## Acknowledgments\n\nThis work builds upon several excellent open-source projects:\n\n- [DINOv2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov2) - Self-supervised vision transformer backbone\n- [Masked Autoencoders](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmae) - Self-supervised learning framework\n- The broader open-source computer vision and robotics communities\n\n## Contact\n\nFor questions, discussions, or collaborations:\n\n- **Issues**: Open an [issue](https:\u002F\u002Fgithub.com\u002Frobbyant\u002Flingbot-depth\u002Fissues) on GitHub\n- **Email**: Contact Dr. [Bin Tan](https:\u002F\u002Fhttps:\u002F\u002Ficetttb.github.io\u002F) (tanbin.tan@antgroup.com) or Dr. [Nan Xue](https:\u002F\u002Fxuenan.net) (xuenan.xue@antgroup.com)\n\n","LingBot-Depth 是一个用于空间感知的深度建模项目，通过处理不完整和噪声深度传感器数据生成高质量、高精度的3D测量结果。该项目利用统一的潜在空间联合对齐RGB外观和深度几何信息，从而为机器人学习和3D视觉应用提供强大的支持。其核心功能包括深度补全与优化、场景重建、4D点跟踪以及灵巧操作等，能够有效提升从稀疏到密集深度数据的精度与完整性。适用于需要高精度3D环境感知的各种场景，如室内导航、物体识别与抓取等任务。",2,"2026-06-11 03:49:47","high_star"]