[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82089":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":13,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":15,"starSnapshotCount":15,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},82089,"ViGeo","aigc3d\u002FViGeo","aigc3d","ViGeo: Towards Consistent Video Geometry Estimation","",null,"Python",79,1,36,0,12,34,0.9,"Apache License 2.0",false,"main",[23,24,25,26],"depth-estimation","feed-forward-models","foundation-models","pytorch","2026-06-12 02:04:23","\u003Cdiv align=\"center\">\n  \u003Ch1 align=\"center\">\u003Csup>ViGeo\u003C\u002Fsup>\u003C\u002Fh1>\n    \u003Cp>\n        \u003Cstrong>\n        \u003Ca href=\"https:\u002F\u002Fpkqbajng.github.io\u002F\" style=\"text-decoration: none; color: inherit;\">Zhu Yu\u003C\u002Fa>\u003Csup>*\u003C\u002Fsup>,\n        \u003Ca href=\"https:\u002F\u002Fg-1nonly.github.io\u002F\" style=\"text-decoration: none; color: inherit;\">Jingnan Gao\u003C\u002Fa>\u003Csup>*\u003C\u002Fsup>,\n        \u003Ca href=\"https:\u002F\u002Frm-zhang.github.io\u002F\" style=\"text-decoration: none; color: inherit;\">Runmin Zhang\u003C\u002Fa>,\n        \u003Ca href=\"https:\u002F\u002Flingtengqiu.github.io\u002F\" style=\"text-decoration: none; color: inherit;\">Linteng Qiu\u003C\u002Fa>,\n        et al.\n        \u003C\u002Fstrong>\n    \u003C\u002Fp>\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fpkqbajng.github.io\u002FViGeo\u002F\" style=\"text-decoration: none; margin: 0 8px;\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHomepage-ViGeo-blue?style=flat\" alt=\"Homepage\">\u003C\u002Fa>\n        \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.30060\" style=\"text-decoration: none; margin: 0 8px;\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red?style=flat&logo=arxiv\" alt=\"arXiv\">\u003C\u002Fa>\n        \u003Ca href=\"\" style=\"text-decoration: none; margin: 0 8px;\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-HuggingFace-yellow?style=flat&logo=huggingface\" alt=\"Model\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\nViGeo estimates scene geometry from either video clips or single-frame inputs, including depth, 3D points, normals, confidence, and camera poses for sequences. VideoLDCM supports depth completion for both videos and single images; in our paper, it is used as the data-refinement model to turn sparse or noisy depth observations into cleaner dense depth supervision.\n\nViGeo supports both `offline` and `online` inference. Use `offline` when the full input is available, `online` for streaming frame-by-frame inference, and `chunk` for long videos that should be processed in segments with cached context.\n\nFor training, please refer to the [train branch](https:\u002F\u002Fgithub.com\u002Faigc3d\u002FViGeo\u002Ftree\u002Ftrain).\nFor full benchmark evaluation, please refer to the [benchmark branch](https:\u002F\u002Fgithub.com\u002Faigc3d\u002FViGeo\u002Ftree\u002Fbenchmark).\n\n## To Do List\n\nWe have fixed several numerical errors in the paper and submitted an updated version to arXiv. Before the update is reflected on arXiv, please refer to assets\u002Fpaper.pdf for the correct version.\n\n- [x] Release ViGeo\n\n  A preliminary ViGeo checkpoint has been released. Please note that the current checkpoint was trained with a known issue in the loss implementation, which may cause minor visualization artifacts in camera poses and distant regions. This checkpoint is consistent with the results reported in the paper and can be used to obtain dense geometry estimation results. We are preparing an updated checkpoint with a sky mask head and will release it soon.\n\n- [ ] Release Hugging Face demo\n- [ ] Update pose benchmarks\n\n## Installation\n\nViGeo uses Python 3.10. We use PyTorch 2.7.1 with CUDA 12.6 by default, following the training environment; other compatible PyTorch\u002FCUDA versions should also work.\n\n```bash\nconda create -n vigeo python=3.10 -y\nconda activate vigeo\n\npip install torch==2.7.1 torchvision==0.22.1 torchaudio==2.7.1 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\n\ngit clone https:\u002F\u002Fgithub.com\u002Faigc3d\u002FViGeo.git\ncd ViGeo\npip install -r requirements.txt\npip install -e .\n```\n\nThe lightweight inference package does not require `utils3d`. For the optional VideoLDCM data refinement demo, install the extra dependencies:\n\n```bash\npip install xformers==0.0.31 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu126\npip install -r requirements_refine.txt\n```\n\n## Pretrained Models\n\n| Model | Download | Description |\n| --- | --- | --- |\n| ViGeo | [LINK](https:\u002F\u002Fhuggingface.co\u002Fpkqbajng\u002FViGeo) | Main visual geometry model for depth, points, normals, poses, and confidence. |\n| VideoLDCM | [LINK](https:\u002F\u002Fhuggingface.co\u002Fpkqbajng\u002FVideoLDCM) | Data-refinement model for sparse-depth filtering, Poisson completion, and depth refinement. |\n\n## Quick Start for ViGeo\n\nInputs are RGB tensors in `[0, 1]` with shape `[T, 3, H, W]` or `[B, T, 3, H, W]`.\n\n```python\nimport torch\n\nfrom vigeo import ViGeo\nfrom utils import load_image_sequence\n\ndevice = torch.device(\"cuda\")\nimage_paths = [\"path\u002Fto\u002FimageA.png\", \"path\u002Fto\u002FimageB.png\", \"path\u002Fto\u002FimageC.png\"]\nimages = load_image_sequence(image_paths).to(device)  # [T, 3, H, W], RGB in [0, 1]\nmodel = ViGeo.from_pretrained(\"pkqbajng\u002FViGeo\").to(device).eval()\n\nwith torch.inference_mode():\n    output = model.infer(images, mode=\"offline\")\n\ndepth = output[\"depth_pred\"]      # [T, 1, H, W]\npoints = output[\"points_pred\"]    # [T, H, W, 3]\nnormals = output[\"normal_pred\"]   # [T, H, W, 3], inward normals\nnormals_out = -normals            # outward normals for visualization\u002Fevaluation\nposes = output[\"pose_pred\"]       # [T, 3, 4], camera-to-world\nconf = output[\"conf_pred\"]        # [T, 1, H, W]\n```\n\nFor batched input `[B, T, 3, H, W]`, tensor outputs keep the leading batch dimension.\n\nViGeo uses a right-handed camera coordinate system with `(X, Y, Z) = (right, down, front)`. The raw `normal_pred` output follows the inward normal convention. The demo and normal benchmarks use outward normals for RGB normal-map visualization and evaluation; for example, a fronto-parallel wall facing the camera is visualized\u002Fevaluated with normal `(0, 0, 1)`. Please use `normals = -normal_pred` when outward normals are needed.\n\n## Inference Modes\n\nViGeo provides `offline`, `chunk`, and `online` inference modes. `offline` processes the full input sequence at once and is preferred when the complete video or image set is available.\n\n```python\noutput = model.infer(images, mode=\"offline\")\n```\n\n`chunk` is designed for long-video inference. The caller can split a long sequence into segments and keep `kv_caches` between calls. The default chunk size is `16`.\n\n```python\nkv_caches = None\nfor image_chunk in images.split(16, dim=0):\n    output = model.infer(\n        image_chunk,\n        mode=\"chunk\",\n        chunk_size=16,\n        kv_caches=kv_caches,\n    )\n    kv_caches = output[\"kv_caches\"]\n```\n\n`online` supports streaming inference, usually with one new frame per call.\n\n```python\nkv_caches = None\nfor image_chunk in images.split(1, dim=0):\n    output = model.infer(\n        image_chunk,\n        mode=\"online\",\n        kv_caches=kv_caches,\n    )\n    kv_caches = output[\"kv_caches\"]\n```\n\n## Quick Start for VideoLDCM\n\n`videoldcm\u002F` is kept as a package beside `vigeo\u002F`. This demo loads RGB images and sparse depth maps, then `infer` runs MoGe, Poisson completion, and VideoLDCM refinement.\n\n```python\nimport torch\n\nfrom videoldcm import videoldcm\nfrom utils import load_depth_sequence, load_image_sequence\n\nimage_paths = [\"path\u002Fto\u002Fimage_000.png\", \"path\u002Fto\u002Fimage_001.png\"]\nsparse_depth_paths = [\"path\u002Fto\u002Fsparse_depth_000.npy\", \"path\u002Fto\u002Fsparse_depth_001.npy\"]\ndevice = torch.device(\"cuda\")\n\nimage = load_image_sequence(image_paths).to(device)          # [S, 3, H, W]\nsparse_depth = load_depth_sequence(sparse_depth_paths).to(device)  # [S, 1, H, W]\n\ncompletion_model = videoldcm.from_pretrained(\"pkqbajng\u002FVideoLDCM\").to(device).eval()\n\nwith torch.inference_mode():\n    output = completion_model.infer(image=image, sparse_depth=sparse_depth)\n    refined_depth = output[\"depth_pred\"]  # [S, 1, H, W]\n```\n\n## VideoLDCM Data Refinement\n\nFor data refinement, you can expose the sparse-depth filtering and Poisson completion steps explicitly.\n\n\u003Cdetails>\n\u003Csummary>Show data refinement code\u003C\u002Fsummary>\n\n```python\nimport torch\nimport utils3d\n\nfrom videoldcm import videoldcm\nfrom videoldcm.poisson_completion import poisson_completion\nfrom utils import (\n    load_depth_sequence,\n    load_image_sequence,\n    load_intrinsic,\n    multi_scale_filter_depth,\n)\n\nimage_paths = [\"path\u002Fto\u002Fimage_000.png\", \"path\u002Fto\u002Fimage_001.png\"]\nsparse_depth_paths = [\"path\u002Fto\u002Fsparse_depth_000.npy\", \"path\u002Fto\u002Fsparse_depth_001.npy\"]\ndevice = torch.device(\"cuda\")\n\nimage = load_image_sequence(image_paths).to(device)                # [S, 3, H, W]\nsparse_depth = load_depth_sequence(sparse_depth_paths).to(device)  # [S, 1, H, W]\n\nS, _, H, W = image.shape\nintrinsic, focal = load_intrinsic(\"path\u002Fto\u002Fintrinsic.npy\", H, W)\nintrinsic = intrinsic.to(device)  # [3, 3]\nfocal = focal.to(device)          # scalar\n\ncompletion_model = videoldcm.from_pretrained(\"pkqbajng\u002FVideoLDCM\").to(device).eval()\n\nwith torch.inference_mode():\n    moge_out = completion_model.moge.infer(\n        image,\n        apply_mask=False,\n        force_projection=True,\n    )\n    moge_mask = moge_out[\"mask\"].unsqueeze(1)  # [S, 1, H, W]\n    mono_depth = moge_out[\"depth\"].unsqueeze(1).float().masked_fill(~moge_mask, 0.0)\n\n    points_gt = utils3d.pt.depth_map_to_point_map(\n        sparse_depth.squeeze(1),\n        intrinsics=intrinsic.unsqueeze(0).expand(S, -1, -1),\n    )  # [S, H, W, 3]\n\n    # Removes wrong sparse-depth points with multi-scale geometry consistency.\n    filtered_mask = multi_scale_filter_depth(\n        moge_out[\"points\"],\n        points_gt,\n        moge_mask & (sparse_depth > 0.0001),\n        focal=focal.expand(S),\n    )  # [S, H, W]\n    prior = sparse_depth * filtered_mask.unsqueeze(1).float()\n\n    coarse_depth = poisson_completion(\n        sparse=prior,\n        mono_depth=mono_depth,\n        confidence=moge_mask.float(),\n        num_scales=5,\n        max_iter_per_scale=[5000, 2000, 1000, 500, 250],\n        max_resolution_ratio=0.5,\n    )  # [S, 1, H, W]\n    output = completion_model.infer_without_poisson(\n        image=image,\n        prior=prior,\n        coarse_depth=coarse_depth,\n        mask=moge_mask,\n    )\n    refined_depth = output[\"depth_pred\"]  # [S, 1, H, W]\n```\n\n\u003C\u002Fdetails>\n\n## License\n\nViGeo is licensed under the Apache License, Version 2.0. See `LICENSE` for details.\n\n## Bibtex\n```bibtex\n@article{yu2026vigeo,\n  title={Towards Consistent Video Geometry Estimation},\n  author={Yu, Zhu and Gao, Jingnan and Zhang, Runmin and Qiu, Lingteng and Zhao, Zhengyi and Peng, Rui\n          and Yan, Yichao and Qiu, Kejie and Zhu, Siyu and Dong, Zilong and Cao, Si-Yuan and Shen, Hui-Liang},\n  journal={arXiv:2605.30060},\n  year={2026}\n}\n```\n\n","ViGeo是一个用于视频几何估计的项目，能够从视频片段或单帧图像中估计场景的深度、3D点、法线、置信度及相机姿态。其核心技术基于PyTorch框架，采用了前馈模型和基础模型的方法来实现深度估计与数据精炼，特别是通过VideoLDCM模块将稀疏或噪声深度观测转化为更清晰密集的深度监督信息。该项目支持离线和在线两种推理模式，适用于需要从视频流中实时获取几何信息的应用场景，如自动驾驶、增强现实等。此外，ViGeo还提供了针对长视频分段处理的功能，以适应不同长度视频的需求。",2,"2026-06-11 04:07:42","CREATED_QUERY"]