[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76188":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},76188,"GemDepth","Yuecheng919\u002FGemDepth","Yuecheng919","【ICML 2026】GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth","",null,"Python",125,8,3,0,4,49,9,2.86,false,"main",true,[],"2026-06-12 02:03:40","\u003Cdiv align=\"center\">\n\u003Ch2 align=\"center\"> GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth \u003C\u002Fh2>\n\n[**Yuecheng liu**](https:\u002F\u002Fgithub.com\u002FYuecheng919\u002F)\u003Csup>1\u003C\u002Fsup>, [**Junda Cheng**](https:\u002F\u002Fgithub.com\u002FJunda24)\u003Csup>1*\u003C\u002Fsup>, [**Longliang Liu**](https:\u002F\u002Fgithub.com\u002FlongliangLiu)\u003Csup>1,2\u003C\u002Fsup>, [**Wenjing Liao**](https:\u002F\u002Fgithub.com\u002Fwaldeinsamkeits)\u003Csup>1,2\u003C\u002Fsup>, [**Hanrui Cheng**](https:\u002F\u002Fgithub.com\u002FMarcelRay0312)\u003Csup>1,2\u003C\u002Fsup>, [**Yuzhou Wang**](https:\u002F\u002Fgithub.com\u002FYuzhouWang999)\u003Csup>1\u003C\u002Fsup>, [**Xin Yang**](https:\u002F\u002Fsites.google.com\u002Fview\u002Fxinyang\u002Fhome)\u003Csup>1,3\u003C\u002Fsup>\n\u003Cbr>\u003Cbr>\n\u003Csup>*\u003C\u002Fsup>Corresponding Author\n\u003Cbr>\n\u003Csup>1\u003C\u002Fsup>Hust, \u003Csup>2\u003C\u002Fsup>Carizon, \u003Csup>3\u003C\u002Fsup>Optics Valley Laboratory\n  \u003Ch5>If you like our project, please give us a star ⭐ on GitHub for the latest updates!\u003C\u002Fh5>\n  \n [![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGemDepth-Website-green?logo=googlechrome&logoColor=green)](https:\u002F\u002Fr11031.github.io\u002Fwebsite\u002F) [![Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Model%20-yellow)](https:\u002F\u002Fhuggingface.co\u002FYuechengLiu\u002FGemDepth) [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b31b1b?logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.10525)\n\u003C\u002Fdiv>\n\n## 🤗 Demo Video\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=o6Z2p-P9hSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.youtube.com\u002Fvi\u002Fo6Z2p-P9hSE\u002Fmaxresdefault.jpg\" width=\"75%\" alt=\"GemDepth Overview Video\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## 📢 News\n- **[2026.05.18]** 🤗🤗🤗 Evaluation datasets released on Hugging Face.\n- **[2026.05.16]** 🤗🤗🤗 Hugging Face Gradio demos released.\n- **[2026.05.16]** Add GPU memory adjustment schemes for inference and training.\n- **[2026.05.15]** 🤗🤗🤗Pre-trained weights released on Hugging Face.\n- **[2026.05.14]** Add [`run_video_pointcloud`](https:\u002F\u002Fgithub.com\u002FYuecheng919\u002FGemDepth\u002Fevaluation\u002Finference\u002Frun_video_pointcloud) for pointcloud reconstruction.\n- **[2026.05.09]** 🔥🔥🔥GemDepth is out! It effectively recovering fine-grained\ndetails and has better 3D temporal consistency.\n\n\n## 👋 Introduction\n\nWelcome to the official repository for **GemDepth**! \n\nGemDepth is a framework built on the insight that an explicit awareness of camera motion and global 3D structure is a prerequisite for 3D consistency. Distinctively, GemDepth introduces a Geometry-Embedding Module (GEM) that predicts inter-frame camera poses to generate implicit geometric embeddings. This injection of motion priors equips the network with intrinsic 3D perception and alignment capabilities. Guided by these geometric cues, our Alternating Spatio-Temporal Transformer (ASTT) captures latent point-level correspondences to simultaneously enhance spatial precision for sharp details and enforce rigorous temporal consistency.\n\nGemDepth achieves state of-the-art performance across multiple datasets,\nparticularly in complex dynamic scenarios.\n\n![network](assets\u002Fmedia\u002Fnetwork.png)\n\n##  📝 Benchmarks performance\n![benchmark](assets\u002Fmedia\u002Fzero-shot.png)\n\nComparisons with state-of-the-art methods across four of the most widely used benchmarks.\n## ⏳ Usage\n\n### Preparation\n```Shell\ngit clone https:\u002F\u002Fgithub.com\u002FYuechengliu919\u002FGemDepth\ncd GemDepth\nconda create -n gemdepth python=3.10\nconda activate gemdepth\npip install -r requirements.txt\n```\n\n### Model weights\n\n| Model      |                                               Link                                                |\n|:----:|:-------------------------------------------------------------------------------------------------:|\n| GemDepth| [Download 🤗](https:\u002F\u002Fhuggingface.co\u002FYuechengLiu\u002FGemDepth\u002Fresolve\u002Fmain\u002Fgemdepth.pth?download=true) |\n\nThe final structure shoule be like\n```\nGemDepth\n├── checkpoint\u002F\n├──── gemdepth.pth\n├── configs\u002F\n├── model\u002F\n├── ...\n```\n\n### Use our model\n```bash\nimport torch\nfrom model.gemdepth import GemDepth\nDEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'\nmodel_configs = {\n    'vits': {'encoder': 'vits''features': 64, 'out_channels': [4896, 192, 384]},\n    'vitl': {'encoder': 'vitl''features': 256, 'out_channels'[256, 512, 1024, 1024]},\n}\ngemdepth = GemDepth(**model_configs[argencoder])\ncheckpoint = torch.load(\".\u002Fcheckpoint\u002Fgemdepth.pth\",map_location='cpu',weights_only=False)\ngemdepth.load_state_dict(checkpoint,strict=True)\ngemdepth = gemdepth.to(DEVICE).eval()\nframes, target_fps = read_video_frames(video_path, args.max_len, args.target_fps, 1280)\ndepths, fps = gemdepth.infer_video_depth(frames, target_fps, input_size=args.input_size,device=DEVICE, fp32=args.fp32)\n\n```\n\n### Running script on video\n```bash\n# Only video depth output\npython evaluation\u002Finference\u002Frun_video.py --input_dir .\u002Fassets\u002Fexample_videos --output_dir .\u002Fassets\u002Fexample_result\n# video depth & pointcloud output\npython evaluation\u002Finference\u002Frun_video_pointcloud.py --input_dir .\u002Fassets\u002Fexample_videos --output_dir .\u002Fassets\u002Fexample_result  \n```\nTips: If GPU memory is insufficient, you can adjust the infer settings in `model\u002Fgemdepth.py`. The default settings are:\n```bash\nINFER_LEN = 32\nOVERLAP = 10\nKEYFRAMES = [0, 12, 24, 25, 26, 27, 28, 29, 30, 31]\nINTERP_LEN = 8\n```\nwhich require about 44GB GPU memory. You can reduce them as follows:\n```bash\nINFER_LEN = 16\nOVERLAP = 6\nKEYFRAMES = [0, 6, 12, 13, 14, 15]\nINTERP_LEN = 4\n```\nwhich require about 25GB GPU memory, or:\n```bash\nINFER_LEN = 8\nOVERLAP = 4\nKEYFRAMES = [0, 3, 6, 7]\nINTERP_LEN = 2\n```\nwhich require about 15GB GPU memory. You can adjust these parameters according to your GPU memory.\n\n### Interactive Demo\n\nWe provide an interactive Gradio interface for you to easily test GemDepth on your own videos without writing any code.\n\n```bash\npip install -r demo\u002Frequirements.txt\npython demo\u002Fapp.py\n```\nOur Gradio-based interface allows you to upload videos, run video depth prediction and pointcloud reconstruction, and interactively explore the 3D scene in your browser.\n\n## ✏️ Training Data\n* [TartanAir](https:\u002F\u002Fgithub.com\u002Fcastacks\u002Ftartanair_tools)\n* [VKITTI](https:\u002F\u002Feurope.naverlabs.com\u002Fresearch\u002Fcomputer-vision\u002Fproxy-virtual-worlds-vkitti-1)\n* [VKITTI2](https:\u002F\u002Feurope.naverlabs.com\u002Fproxy-virtual-worlds-vkitti-2)\n* [PointOdyssey](https:\u002F\u002Fgithub.com\u002Fy-zheng18\u002Fpoint_odyssey)\n* [MVS-Synth](https:\u002F\u002Fphuang17.github.io\u002FDeepMVS\u002Fmvs-synth.html)\n* [Dynamic Replica](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdynamic_stereo)\n* [IRS](https:\u002F\u002Fgithub.com\u002FHKBU-HPML\u002FIRS)\n\n## ✈️ Evaluation\n\n### Prepare Evaluation Datasets\n| Datasets      |                                               Link                                                |\n|:----:|:-------------------------------------------------------------------------------------------------:|\n| Sintel| [Download 🤗](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYuechengLiu\u002Ftest_datasets\u002Fresolve\u002Fmain\u002Fsintel.tar.gz?download=true) |\n| KITTI| [Download 🤗](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYuechengLiu\u002Ftest_datasets\u002Fresolve\u002Fmain\u002Fkitti.tar.gz?download=true) |\n| Bonn| [Download 🤗](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYuechengLiu\u002Ftest_datasets\u002Fresolve\u002Fmain\u002Fbonn.tar.gz?download=true) |\n| Scannet| [Download 🤗](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FYuechengLiu\u002Ftest_datasets\u002Fresolve\u002Fmain\u002Fscannet.tar.gz?download=true) |\n\nYou can directly download the evaluation datasets via the link above, or follow the preprocessing steps below.\n\nFollow [VideoDepthAnything](https:\u002F\u002Fgithub.com\u002FDepthAnything\u002FVideo-Depth-Anything\u002Ftree\u002Fmain), download raw datasets from the following links:\n[Sintel](http:\u002F\u002Fsintel.is.tue.mpg.de\u002F), [KITTI](https:\u002F\u002Fwww.cvlibs.net\u002Fdatasets\u002Fkitti\u002F), [Bonn](https:\u002F\u002Fwww.ipb.uni-bonn.de\u002Fdata\u002Frgbd-dynamic-dataset\u002Findex.html), [ScanNet](http:\u002F\u002Fwww.scan-net.org\u002F)\n\n```bash\npip install natsort\ncd dataset\u002Fdataset_extract\npython dataset_extrtact${dataset}.py\n```\nThis script will extract the dataset to the `dataset\u002Fdataset_extract\u002Fdataset` folder. It will also generate the json file for the dataset.\n\n### Run inference\n```bash\npython evaluation\u002Finference\u002Finfer\u002Finfer.py \\\n    --infer_path ${out_path} \\\n    --json_file ${json_path} \\\n    --datasets ${dataset}\n```\nOptions:\n- `--infer_path`: path to save the output results\n- `--json_file`: path to the json file for the dataset, like `sintel_video.json`, `kitti_video_500.json`, `scannet_video_tae.json`\n- `--datasets`: dataset name, choose from `sintel`, `kitti`, `bonn`, `scannet`\n\n### Run evaluation\n```bash\n## ~500frame \npython evaluation\u002Feval\u002Feval.py \\\n    --infer_path ${pred_root} \\\n    --benchmark_path ${benchmark_root} \\\n    --datasets ${dataset}\n```\n\n## ✈️ Training\nTo train GemDepth on mix-datasets, run\n```Shell\n## stage1\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage1\n## stage2\nCUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch train.py --config-name stage2\n```\nTips: If GPU memory is insufficient, you can adjust `seq_len` in the config file.\n\n\n","GemDepth 是一个用于生成3D一致视频深度图的框架。其核心功能是通过几何嵌入模块（GEM）预测帧间相机姿态，生成隐式几何嵌入，并结合交替时空变换器（ASTT）捕捉潜在点级对应关系，从而增强空间精度和时间一致性。该技术特别适用于需要精细恢复细节及保持3D时序一致性的复杂动态场景中，如自动驾驶、虚拟现实等。项目使用Python开发，目前在GitHub上获得了78个星标，显示出其在研究社区中的受欢迎程度。",2,"2026-06-11 03:54:46","CREATED_QUERY"]