[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71930":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},71930,"vggt","facebookresearch\u002Fvggt","facebookresearch","[CVPR 2025 Best Paper Award] VGGT: Visual Geometry Grounded Transformer","",null,"Python",13301,1481,479,247,0,34,75,228,102,119.51,"Other",false,"main",[],"2026-06-12 04:01:02","\u003Cdiv align=\"center\">\n\u003Ch1>VGGT: Visual Geometry Grounded Transformer\u003C\u002Fh1>\n\n\u003Ca href=\"https:\u002F\u002Fjytime.github.io\u002Fdata\u002FVGGT_CVPR25.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-VGGT\" alt=\"Paper PDF\">\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11651\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2503.11651-b31b1b\" alt=\"arXiv\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fvgg-t.github.io\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject_Page-green\" alt=\"Project Page\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002Fvggt\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Demo-blue'>\u003C\u002Fa>\n\n\n**[Visual Geometry Group, University of Oxford](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002F)**; **[Meta AI](https:\u002F\u002Fai.facebook.com\u002Fresearch\u002F)**\n\n\n[Jianyuan Wang](https:\u002F\u002Fjytime.github.io\u002F), [Minghao Chen](https:\u002F\u002Fsilent-chen.github.io\u002F), [Nikita Karaev](https:\u002F\u002Fnikitakaraevv.github.io\u002F), [Andrea Vedaldi](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vedaldi\u002F), [Christian Rupprecht](https:\u002F\u002Fchrirupp.github.io\u002F), [David Novotny](https:\u002F\u002Fd-novotny.github.io\u002F)\n\u003C\u002Fdiv>\n\n```bibtex\n@inproceedings{wang2025vggt,\n  title={VGGT: Visual Geometry Grounded Transformer},\n  author={Wang, Jianyuan and Chen, Minghao and Karaev, Nikita and Vedaldi, Andrea and Rupprecht, Christian and Novotny, David},\n  booktitle={Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n  year={2025}\n}\n```\n\n## Updates\n\n- [May 15, 2026] We fixed an implementation issue that was keeping redundant intermediate tensors in memory. With the same GPU memory budget, VGGT can now run on roughly 2-3x more input frames! See [VGGT-Omega](https:\u002F\u002Fvggt-omega.github.io\u002F) for more details.\n\n\n\n- [July 29, 2025] We've updated the license for VGGT to permit **commercial use** (excluding military applications). All code in this repository is now under a commercial-use-friendly license. However, only the newly released checkpoint [**VGGT-1B-Commercial**](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002FVGGT-1B-Commercial) is licensed for commercial usage — the original checkpoint remains non-commercial. Full license details are available [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt\u002Fblob\u002Fmain\u002FLICENSE.txt). Access to the checkpoint requires completing an application form, which is processed by a system similar to LLaMA's approval workflow, automatically. The new checkpoint delivers similar performance to the original model. Please submit an issue if you notice a significant performance discrepancy.\n\n\n\n- [July 6, 2025] Training code is now available in the `training` folder, including an example to finetune VGGT on a custom dataset. \n\n\n- [June 13, 2025] Honored to receive the Best Paper Award at CVPR 2025! Apologies if I’m slow to respond to queries or GitHub issues these days. If you’re interested, our oral presentation is available [here](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1JVuPnuZx6RgAy-U5Ezobg73XpBi7FrOh\u002Fedit?usp=sharing&ouid=107115712143490405606&rtpof=true&sd=true). Another long presentation can be found [here](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1aSv0e5PmH1mnwn2MowlJIajFUYZkjqgw\u002Fedit?usp=sharing&ouid=107115712143490405606&rtpof=true&sd=true) (Note: it’s shared in .pptx format with animations — quite large, but feel free to use it as a template if helpful.)\n\n\n- [June 2, 2025] Added a script to run VGGT and save predictions in COLMAP format, with bundle adjustment support optional. The saved COLMAP files can be directly used with [gsplat](https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat) or other NeRF\u002FGaussian splatting libraries.\n\n\n- [May 3, 2025] Evaluation code for reproducing our camera pose estimation results on Co3D is now available in the [evaluation](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt\u002Ftree\u002Fevaluation) branch. \n\n\n## Overview\n\nVisual Geometry Grounded Transformer (VGGT, CVPR 2025) is a feed-forward neural network that directly infers all key 3D attributes of a scene, including extrinsic and intrinsic camera parameters, point maps, depth maps, and 3D point tracks, **from one, a few, or hundreds of its views, within seconds**.\n\n\n## Quick Start\n\nFirst, clone this repository to your local machine, and install the dependencies (torch, torchvision, numpy, Pillow, and huggingface_hub). \n\n```bash\ngit clone git@github.com:facebookresearch\u002Fvggt.git \ncd vggt\npip install -r requirements.txt\n```\n\nAlternatively, you can install VGGT as a package (\u003Ca href=\"docs\u002Fpackage.md\">click here\u003C\u002Fa> for details).\n\n\nNow, try the model with just a few lines of code:\n\n```python\nimport torch\nfrom vggt.models.vggt import VGGT\nfrom vggt.utils.load_fn import load_and_preprocess_images\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n# bfloat16 is supported on Ampere GPUs (Compute Capability 8.0+) \ndtype = torch.bfloat16 if torch.cuda.get_device_capability()[0] >= 8 else torch.float16\n\n# Initialize the model and load the pretrained weights.\n# This will automatically download the model weights the first time it's run, which may take a while.\nmodel = VGGT.from_pretrained(\"facebook\u002FVGGT-1B\").to(device)\n\n# Load and preprocess example images (replace with your own image paths)\nimage_names = [\"path\u002Fto\u002FimageA.png\", \"path\u002Fto\u002FimageB.png\", \"path\u002Fto\u002FimageC.png\"]  \nimages = load_and_preprocess_images(image_names).to(device)\n\nwith torch.no_grad():\n    with torch.cuda.amp.autocast(dtype=dtype):\n        # Predict attributes including cameras, depth maps, and point maps.\n        predictions = model(images)\n```\n\nThe model weights will be automatically downloaded from Hugging Face. If you encounter issues such as slow loading, you can manually download them [here](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002FVGGT-1B\u002Fblob\u002Fmain\u002Fmodel.pt) and load, or:\n\n```python\nmodel = VGGT()\n_URL = \"https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002FVGGT-1B\u002Fresolve\u002Fmain\u002Fmodel.pt\"\nmodel.load_state_dict(torch.hub.load_state_dict_from_url(_URL))\n```\n\n## Detailed Usage\n\n\u003Cdetails>\n\u003Csummary>Click to expand\u003C\u002Fsummary>\n\nYou can also optionally choose which attributes (branches) to predict, as shown below. This achieves the same result as the example above. This example uses a batch size of 1 (processing a single scene), but it naturally works for multiple scenes.\n\n```python\nfrom vggt.utils.pose_enc import pose_encoding_to_extri_intri\nfrom vggt.utils.geometry import unproject_depth_map_to_point_map\n\nwith torch.no_grad():\n    with torch.cuda.amp.autocast(dtype=dtype):\n        images = images[None]  # add batch dimension\n        aggregated_tokens_list, ps_idx = model.aggregator(images)\n                \n    # Predict Cameras\n    pose_enc = model.camera_head(aggregated_tokens_list)[-1]\n    # Extrinsic and intrinsic matrices, following OpenCV convention (camera from world)\n    extrinsic, intrinsic = pose_encoding_to_extri_intri(pose_enc, images.shape[-2:])\n\n    # Predict Depth Maps\n    depth_map, depth_conf = model.depth_head(aggregated_tokens_list, images, ps_idx)\n\n    # Predict Point Maps\n    point_map, point_conf = model.point_head(aggregated_tokens_list, images, ps_idx)\n        \n    # Construct 3D Points from Depth Maps and Cameras\n    # which usually leads to more accurate 3D points than point map branch\n    point_map_by_unprojection = unproject_depth_map_to_point_map(depth_map.squeeze(0), \n                                                                extrinsic.squeeze(0), \n                                                                intrinsic.squeeze(0))\n\n    # Predict Tracks\n    # choose your own points to track, with shape (N, 2) for one scene\n    query_points = torch.FloatTensor([[100.0, 200.0], \n                                        [60.72, 259.94]]).to(device)\n    track_list, vis_score, conf_score = model.track_head(aggregated_tokens_list, images, ps_idx, query_points=query_points[None])\n```\n\n\nFurthermore, if certain pixels in the input frames are unwanted (e.g., reflective surfaces, sky, or water), you can simply mask them by setting the corresponding pixel values to 0 or 1. Precise segmentation masks aren't necessary - simple bounding box masks work effectively (check this [issue](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt\u002Fissues\u002F47) for an example).\n\n\u003C\u002Fdetails>\n\n\n## Interactive Demo\n\nWe provide multiple ways to visualize your 3D reconstructions. Before using these visualization tools, install the required dependencies:\n\n```bash\npip install -r requirements_demo.txt\n```\n\n### Interactive 3D Visualization\n\n**Please note:** VGGT typically reconstructs a scene in less than 1 second. However, visualizing 3D points may take tens of seconds due to third-party rendering, independent of VGGT's processing time. The visualization is slow especially when the number of images is large.\n\n\n#### Gradio Web Interface\n\nOur Gradio-based interface allows you to upload images\u002Fvideos, run reconstruction, and interactively explore the 3D scene in your browser. You can launch this in your local machine or try it on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Ffacebook\u002Fvggt).\n\n\n```bash\npython demo_gradio.py\n```\n\n\u003Cdetails>\n\u003Csummary>Click to preview the Gradio interactive interface\u003C\u002Fsummary>\n\n![Gradio Web Interface Preview](https:\u002F\u002Fjytime.github.io\u002Fdata\u002Fvggt_hf_demo_screen.png)\n\u003C\u002Fdetails>\n\n\n#### Viser 3D Viewer\n\nRun the following command to run reconstruction and visualize the point clouds in viser. Note this script requires a path to a folder containing images. It assumes only image files under the folder. You can set `--use_point_map` to use the point cloud from the point map branch, instead of the depth-based point cloud.\n\n```bash\npython demo_viser.py --image_folder path\u002Fto\u002Fyour\u002Fimages\u002Ffolder\n```\n\n## Exporting to COLMAP Format\n\nWe also support exporting VGGT's predictions directly to COLMAP format, by:\n\n```bash \n# Feedforward prediction only\npython demo_colmap.py --scene_dir=\u002FYOUR\u002FSCENE_DIR\u002F \n\n# With bundle adjustment\npython demo_colmap.py --scene_dir=\u002FYOUR\u002FSCENE_DIR\u002F --use_ba\n\n# Run with bundle adjustment using reduced parameters\n# Reduces max_query_pts from 4096 (default) to 2048 and query_frame_num from 8 (default) to 5\n# Trade-off: Potentially less robust reconstruction in complex scenes (you may consider setting query_frame_num equal to your total number of images) \n# See demo_colmap.py for additional bundle adjustment configuration options\npython demo_colmap.py --scene_dir=\u002FYOUR\u002FSCENE_DIR\u002F --use_ba --max_query_pts=2048 --query_frame_num=5\n```\n\nPlease ensure that the images are stored in `\u002FYOUR\u002FSCENE_DIR\u002Fimages\u002F`. This folder should contain only the images. Check the examples folder for the desired data structure. \n\nThe reconstruction result (camera parameters and 3D points) will be automatically saved under `\u002FYOUR\u002FSCENE_DIR\u002Fsparse\u002F` in the COLMAP format, such as:\n\n``` \nSCENE_DIR\u002F\n├── images\u002F\n└── sparse\u002F\n    ├── cameras.bin\n    ├── images.bin\n    └── points3D.bin\n```\n\n## Integration with Gaussian Splatting\n\n\nThe exported COLMAP files can be directly used with [gsplat](https:\u002F\u002Fgithub.com\u002Fnerfstudio-project\u002Fgsplat) for Gaussian Splatting training. Install `gsplat` following their official instructions (we recommend `gsplat==1.3.0`):\n\nAn example command to train the model is:\n```\ncd gsplat\npython examples\u002Fsimple_trainer.py  default --data_factor 1 --data_dir \u002FYOUR\u002FSCENE_DIR\u002F --result_dir \u002FYOUR\u002FRESULT_DIR\u002F\n```\n\n\n\n## Zero-shot Single-view Reconstruction\n\nOur model shows surprisingly good performance on single-view reconstruction, although it was never trained for this task. The model does not need to duplicate the single-view image to a pair, instead, it can directly infer the 3D structure from the tokens of the single view image. Feel free to try it with our demos above, which naturally works for single-view reconstruction.\n\n\nWe did not quantitatively test monocular depth estimation performance ourselves, but [@kabouzeid](https:\u002F\u002Fgithub.com\u002Fkabouzeid) generously provided a comparison of VGGT to recent methods [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt\u002Fissues\u002F36). VGGT shows competitive or better results compared to state-of-the-art monocular approaches such as DepthAnything v2 or MoGe, despite never being explicitly trained for single-view tasks. \n\n## Research Progression\n\nOur work builds upon a series of previous research projects. If you're interested in understanding how our research evolved, check out our previous works:\n\n\n\u003Ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n  \u003Ctr>\n    \u003Ctd align=\"left\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjytime\u002FDeep-SfM-Revisited\">Deep SfM Revisited\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd style=\"white-space: pre;\">──┐\u003C\u002Ftd>\n    \u003Ctd>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"left\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FPoseDiffusion\">PoseDiffusion\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd style=\"white-space: pre;\">─────►\u003C\u002Ftd>\n    \u003Ctd>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggsfm\">VGGSfM\u003C\u002Fa> ──►\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt\">VGGT\u003C\u002Fa>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"left\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fco-tracker\">CoTracker\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd style=\"white-space: pre;\">──┘\u003C\u002Ftd>\n    \u003Ctd>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n## Acknowledgements\n\nThanks to these great repositories: [PoseDiffusion](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FPoseDiffusion), [VGGSfM](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggsfm), [CoTracker](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fco-tracker), [DINOv2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov2), [Dust3r](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fdust3r), [Moge](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmoge), [PyTorch3D](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fpytorch3d), [Sky Segmentation](https:\u002F\u002Fgithub.com\u002Fxiongzhu666\u002FSky-Segmentation-and-Post-processing), [Depth Anything V2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2), [Metric3D](https:\u002F\u002Fgithub.com\u002FYvanYin\u002FMetric3D) and many other inspiring works in the community.\n\n## Checklist\n\n- [x] Release the training code\n- [ ] Release VGGT-500M and VGGT-200M\n\n\n## License\nSee the [LICENSE](.\u002FLICENSE.txt) file for details about the license under which this code is made available.\n\nPlease note that only this [model checkpoint](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002FVGGT-1B-Commercial) allows commercial usage. This new checkpoint achieves the same performance level (might be slightly better) as the original one, e.g., AUC@30: 90.37 vs. 89.98 on the Co3D dataset.\n","VGGT（Visual Geometry Grounded Transformer）是一个在视觉几何基础上的Transformer模型，获得了CVPR 2025最佳论文奖。该项目的核心功能是通过结合视觉几何信息来提升Transformer在计算机视觉任务中的性能，特别是在需要理解图像中物体间空间关系的任务上表现出色。技术特点包括高效的内存管理和支持商业用途的新许可协议。适合于需要高精度和对物体间几何关系有深入理解的应用场景，如自动驾驶、机器人视觉等。",2,"2026-06-11 03:39:31","high_star"]