[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72417":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72417,"FoundationStereo","NVlabs\u002FFoundationStereo","NVlabs","[CVPR 2025 Best Paper Nomination] FoundationStereo: Zero-Shot Stereo Matching","https:\u002F\u002Fnvlabs.github.io\u002FFoundationStereo\u002F",null,"Python",2748,262,51,78,0,9,20,68,27,94.06,"Other",false,"master",true,[],"2026-06-12 04:01:05","# FoundationStereo: Zero-Shot Stereo Matching\n\nThis is the official implementation of our paper accepted by CVPR 2025 Oral (**Best Paper Nomination**)\n\n[[Website]](https:\u002F\u002Fnvlabs.github.io\u002FFoundationStereo\u002F) [[Paper]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09898) [[Video]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=R7RgHxEXB3o)\n\nAuthors: Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, Stan Birchfield\n\n# Abstract\nTremendous progress has been made in deep stereo matching to excel on benchmark datasets through per-domain fine-tuning. However, achieving strong zero-shot generalization — a hallmark of foundation models in other computer vision tasks — remains challenging for stereo matching. We introduce FoundationStereo, a foundation model for stereo depth estimation designed to achieve strong zero-shot generalization. To this end, we first construct a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism, followed by an automatic self-curation pipeline to remove ambiguous samples. We then design a number of network architecture components to enhance scalability, including a side-tuning feature backbone that adapts rich monocular priors from vision foundation models to mitigate the sim-to-real gap, and long-range context reasoning for effective cost volume filtering. Together, these components lead to strong robustness and accuracy across domains, establishing a new standard in zero-shot stereo depth estimation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002FFoundationStereo\u002Fwebsite\u002Fstatic\u002Fimages\u002Fintro.jpg\" width=\"800\"\u002F>\n\u003C\u002Fp>\n\n\n**TLDR**: Our method takes as input a pair of stereo images and outputs a dense disparity map, which can be converted to a metric-scale depth map or 3D point cloud.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fteaser\u002Finput_output.gif\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\n# Changelog\n| Date       | Description                                                                                                         |\n|------------|---------------------------------------------------------------------------------------------------------------------|\n| 2025\u002F12\u002F15 | Checkout our real-time model [Fast-FoundationStereo](https:\u002F\u002Fnvlabs.github.io\u002FFast-FoundationStereo\u002F)\n| 2025\u002F08\u002F05 | Our commercial model is available now at [here](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fteams\u002Ftao\u002Fmodels\u002Ffoundationstereo)! |\n| 2025\u002F07\u002F03 | Improve ONNX and TRT support. Add support for Jetson                                                                |\n\n\n# Leaderboards 🏆\nWe obtained the 1st place on the world-wide [Middlebury leaderboard](https:\u002F\u002Fvision.middlebury.edu\u002Fstereo\u002Feval3\u002F) and [ETH3D leaderboard](https:\u002F\u002Fwww.eth3d.net\u002Flow_res_two_view).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002FFoundationStereo\u002Fwebsite\u002Fstatic\u002Fimages\u002Fmiddlebury_leaderboard.jpg\" width=\"700\"\u002F>\n  \u003Cbr>\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002FFoundationStereo\u002Fwebsite\u002Fstatic\u002Fimages\u002Feth_leaderboard.png\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n\n# Comparison with Monocular Depth Estimation\nOur method outperforms existing approaches in zero-shot stereo matching tasks across different scenes.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002FFoundationStereo\u002Fwebsite\u002Fstatic\u002Fimages\u002Fmono_comparison.png\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n# Installation\n\nWe've tested on Linux with GPU 3090, 4090, A100, V100, Jetson Orin. Other GPUs should also work, but make sure you have enough memory\n\n```\nconda env create -f environment.yml\nconda run -n foundation_stereo pip install flash-attn\nconda activate foundation_stereo\n```\n\nNote that `flash-attn` needs to be installed separately to avoid [errors during environment creation](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F20).\n\n\n# Model Weights\n- Download the foundation model for zero-shot inference on your data. Put the entire folder (e.g. `23-51-11`) under `.\u002Fpretrained_models\u002F`.\n\n\n| Model     | Description                                                                 |\n|-----------|-----------------------------------------------------------------------------|\n| [23-51-11](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1VhPebc_mMxWKccrv7pdQLTvXYVcLYpsf?usp=sharing)  | Our best performing model for general use, based on Vit-large               |\n| [11-33-40](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1VhPebc_mMxWKccrv7pdQLTvXYVcLYpsf?usp=sharing)  | Slightly lower accuracy but faster inference, based on Vit-small            |\n| [NVIDIA-TAO](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fteams\u002Ftao\u002Fmodels\u002Ffoundationstereo)       | For commercial usage (adapted from Vit-small model)                 |\n\n# Run demo\n```\npython scripts\u002Frun_demo.py --left_file .\u002Fassets\u002Fleft.png --right_file .\u002Fassets\u002Fright.png --ckpt_dir .\u002Fpretrained_models\u002F23-51-11\u002Fmodel_best_bp2.pth --out_dir .\u002Ftest_outputs\u002F\n```\nYou can see output point cloud.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fteaser\u002Foutput.jpg\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\nTips:\n- The input left and right images should be **rectified and undistorted**, which means there should not be fisheye kind of lens distortion and the epipolar lines are horizontal between the left\u002Fright images. If you obtain images from stereo cameras such as Zed, they usually have [handled this](https:\u002F\u002Fgithub.com\u002Fstereolabs\u002Fzed-sdk\u002Fblob\u002F3472a79fc635a9cee048e9c3e960cc48348415f0\u002Frecording\u002Fexport\u002Fsvo\u002Fpython\u002Fsvo_export.py#L124) for you.\n- Do not swap left and right image. The left image should really be obtained from the left-side camera (objects will appear righter in the image).\n- We recommend to use PNG files with no lossy compression\n- Our method works best on stereo RGB images. However, we have also tested it on monochrome or IR stereo images (e.g. from RealSense D4XX series) and it works well too.\n- For all options and instructions, check by `python scripts\u002Frun_demo.py --help`\n- To get point cloud for your own data, you need to specify the intrinsics. In the intrinsic file in args, 1st line is the flattened 1x9 intrinsic matrix, 2nd line is the baseline (distance) between the left and right camera, unit in meters.\n- For high-resolution image (>1000px), you can either (1) run with `--hiera 1` to enable hierarchical inference to get full resolution depth but slower; or (2) run with smaller scale, e.g. `--scale 0.5` to get downsized resolution depth but faster.\n- For faster inference, you can reduce the input image resolution by e.g. `--scale 0.5`, and reduce refine iterations by e.g. `--valid_iters 16`.\n\n\n\n# ONNX\u002FTensorRT(TRT) Inference\n\nWe only support docker setup for ONNX\u002FTRT version.\n\n- Build docker (tested on NVIDIA Driver Version: 560.35.03, CUDA Version: 12.6)\n```bash\nexport DIR=$(pwd)\ncd docker && docker build --network host -t foundation_stereo .\nbash run_container.sh\ncd \u002F\ngit clone https:\u002F\u002Fgithub.com\u002Fonnx\u002Fonnx-tensorrt.git\ncd onnx-tensorrt\npython3 setup.py install\napt-get install -y libnvinfer-dispatch10 libnvinfer-bin tensorrt\ncd $DIR\n```\n\n\n- Make ONNX:\n```\nXFORMERS_DISABLED=1 python scripts\u002Fmake_onnx.py --save_path .\u002Fpretrained_models\u002Ffoundation_stereo.onnx --ckpt_dir .\u002Fpretrained_models\u002F23-51-11\u002Fmodel_best_bp2.pth --height 448 --width 672 --valid_iters 20\n```\n\n- Convert to TRT:\n```\ntrtexec --onnx=pretrained_models\u002Ffoundation_stereo.onnx --verbose --saveEngine=pretrained_models\u002Ffoundation_stereo.plan --fp16\n```\n\n- Run TRT:\n```\npython scripts\u002Frun_demo_tensorrt.py \\\n        --left_img ${PWD}\u002Fassets\u002Fleft.png \\\n        --right_img ${PWD}\u002Fassets\u002Fright.png \\\n        --save_path ${PWD}\u002Foutput \\\n        --pretrained pretrained_models\u002Ffoundation_stereo.plan \\\n        --height 448 \\\n        --width 672 \\\n        --pc \\\n        --z_far 100.0\n```\n\nWe have observed 6X speed on the same GPU 3090 with TensorRT FP16. Although how much it speeds up depends on various factors, we recommend trying it out if you care about faster inference. Also remember to adjust the args setting based on your need.\n\n# Running on Jetson\nPlease refer to [readme_jetson.md](readme_jetson.md).\n\n# FSD Dataset\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FNVlabs\u002FFoundationStereo\u002Fwebsite\u002Fstatic\u002Fimages\u002Fsdg_montage.jpg\" width=\"800\"\u002F>\n\u003C\u002Fp>\n\nYou can download the whole dataset [here](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1YdC2a0_KTZ9xix_HyqNMPCrClpm0-XFU?usp=sharing) (>1TB). We also provide a small [sample data](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1dJwK5x8xsaCazz5xPGJ2OKFIWrd9rQT5\u002Fview?usp=drive_link) (3GB) to peek. The whole dataset contains ~1M data points, where each consists of:\n- Left and right images\n- Ground-truth disparity\n\nYou can check how to read data by using our example with the sample data:\n```\npython scripts\u002Fvis_dataset.py --dataset_path .\u002FDATA\u002Fsample\u002Fmanipulation_v5_realistic_kitchen_2500_1\u002Fdataset\u002Fdata\u002F\n```\n\nIt will produce:\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fteaser\u002Ffsd_sample.png\" width=\"800\"\u002F>\n\u003C\u002Fp>\n\nFor dataset license, please check [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fblob\u002Fmaster\u002FLICENSE).\n\n\n# FAQ\n- Q: Conda install does not work for me?\u003Cbr>\n  A: Check [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F20)\n\n- Q: I'm not getting point cloud or getting incomplete point cloud?\u003Cbr>\n  A: Check the flags in argparse about point cloud processing, such as `--z_far`, `--remove_invisible`, `--denoise_cloud`.\n\n- Q: My GPU doesn't support Flash attention?\u003Cbr>\n  A: See [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F13#issuecomment-2708791825)\n\n- Q: RuntimeError: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.\u003Cbr>\n  A: This may indicate OOM issue. Try reducing your image resolution or use a GPU with more memory.\n\n- Q: How to run with RealSense?\u003Cbr>\n  A: See [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F26) and [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F80)\n\n- Q: I have two or multiple RGB cameras, can I run this? \u003Cbr>\n  A: You can first rectify a pair of images using this [OpenCV function](https:\u002F\u002Fdocs.opencv.org\u002F4.x\u002Fd9\u002Fd0c\u002Fgroup__calib3d.html#ga617b1685d4059c6040827800e72ad2b6) into stereo image pair (now they don't have relative rotations), then feed into FoundationStereo.\n\n- Q: How to run on Windows? \u003Cbr>\n  A: See [this](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FFoundationStereo\u002Fissues\u002F219).\n\n- Q: Can I use it for commercial purpose? \u003Cbr>\n  A: We released a commercial version [here](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fteams\u002Ftao\u002Fmodels\u002Ffoundationstereo). You can also drop me an email at bowenw@nvidia.com for further inquiries.\n\n\n# BibTeX\n```\n@article{wen2025stereo,\n  title={FoundationStereo: Zero-Shot Stereo Matching},\n  author={Bowen Wen and Matthew Trepte and Joseph Aribido and Jan Kautz and Orazio Gallo and Stan Birchfield},\n  journal={CVPR},\n  year={2025}\n}\n```\n\n# Acknowledgement\nWe would like to thank Gordon Grigor, Jack Zhang, Karsten Patzwaldt, Hammad Mazhar and other NVIDIA Isaac team members for their tremendous engineering support and valuable discussions. Thanks to the authors of [DINOv2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov2), [DepthAnything V2](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FDepth-Anything-V2), [Selective-IGEV](https:\u002F\u002Fgithub.com\u002FWindsrain\u002FSelective-Stereo) and [RAFT-Stereo](https:\u002F\u002Fgithub.com\u002Fprinceton-vl\u002FRAFT-Stereo) for their code release. Finally, thanks to CVPR reviewers and AC for their appreciation of this work and constructive feedback.\n\n\n# Contact\nFor commercial inquiries, additional technical support, and other questions, please reach out to [Bowen Wen](https:\u002F\u002Fwenbowen123.github.io\u002F) (bowenw@nvidia.com).\n","FoundationStereo 是一个零样本立体匹配项目，旨在通过输入一对立体图像输出密集视差图，进而转换为度量尺度深度图或3D点云。其核心功能包括构建大规模（100万对立体图像）高逼真度的合成训练数据集、自动自我筛选流程去除模糊样本，以及设计了一系列网络架构组件来增强模型的可扩展性。技术特点涵盖了利用视觉基础模型中的丰富单目先验信息来减少模拟与现实之间的差距，并通过长距离上下文推理实现有效的代价体积过滤。该项目适用于需要跨领域鲁棒性和准确性且无需特定领域微调的场景，如自动驾驶、机器人导航等。",2,"2026-06-11 03:41:58","high_star"]