[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80060":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":12,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":12,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},80060,"cambrian-p","cambrian-mllm\u002Fcambrian-p","cambrian-mllm","Cambrian-P: Pose-Grounded Video Understanding",null,"Python",95,4,62,0,1,27,48.8,"Other",false,"main",[],"2026-06-11 04:07:03","\u003Cdiv align=\"center\">\n\n# *Cambrian-P*:\u003Cbr> Pose-Grounded Video Understanding\n\n\u003Cp>\n    \u003Cimg src=\"figs\u002Fteaser.png\" alt=\"Cambrian-P\" width=\"800\" height=\"auto\">\n\u003C\u002Fp>\n\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22819\" target=\"_blank\">\n    \u003Cimg alt=\"arXiv\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.22819-red?logo=arxiv\" height=\"25\" \u002F>\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fcambrian-mllm.github.io\u002Fcambrian-p\u002F\" target=\"_blank\">\n    \u003Cimg alt=\"Website\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌎_Website-cambrian--mllm.github.io-blue\" height=\"25\" \u002F>\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnyu-visionx\u002Fcambrian-p-models\" target=\"_blank\">\n    \u003Cimg alt=\"HF Models: Cambrian-P\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20_Models-Cambrian--P-ffc107?color=ffc107&logoColor=white\" height=\"25\" \u002F>\n\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002FCambrian-P-Data\" target=\"_blank\">\n    \u003Cimg alt=\"HF Data: Cambrian-P-Data\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20_Data-Cambrian--P--Data-3cb371?color=3cb371&logoColor=white\" height=\"25\" \u002F>\n\u003C\u002Fa>\n\n\u003Cdiv style=\"font-family: charter;\">\n    \u003Ca href=\"https:\u002F\u002Fjihanyang.github.io\u002F\" target=\"_blank\">Jihan Yang\u003C\u002Fa>\u003Csup>1*\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fwww.zifanzhao.com\" target=\"_blank\">Zifan Zhao\u003C\u002Fa>\u003Csup>1*\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fxichenpan.com\u002F\" target=\"_blank\">Xichen Pan\u003C\u002Fa>\u003Csup>1\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fvealocia\" target=\"_blank\">Shusheng Yang\u003C\u002Fa>\u003Csup>1\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fjunyi42.github.io\u002F\" target=\"_blank\">Junyi Zhang\u003C\u002Fa>\u003Csup>2\u003C\u002Fsup>,\n    \u003Cbr>\n    \u003Ca href=\"https:\u002F\u002Fbingykang.github.io\u002F\" target=\"_blank\">Bingyi Kang\u003C\u002Fa>\u003Csup>1\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fhowardhsu.github.io\u002F\" target=\"_blank\">Hu Xu\u003C\u002Fa>\u003Csup>3\u003C\u002Fsup>,\n    \u003Ca href=\"https:\u002F\u002Fwww.sainingxie.com\u002F\" target=\"_blank\">Saining Xie\u003C\u002Fa>\u003Csup>1\u003C\u002Fsup>\n\u003C\u002Fdiv>\n\n\u003Cdiv style=\"font-family: charter;\">\n    \u003Csup>1\u003C\u002Fsup>New York University&nbsp;&nbsp;\n    \u003Csup>2\u003C\u002Fsup>UC Berkeley&nbsp;&nbsp;\n    \u003Csup>3\u003C\u002Fsup>Meta FAIR\n\u003C\u002Fdiv>\n\n\u003Cdiv style=\"font-family: charter;\">\n*Equal Contribution — JY led the project, JY and ZZ contributed equally.\n\u003C\u002Fdiv>\n\n\u003C\u002Fdiv>\n\n## Release\n\n- 🔥 **Cambrian-P** is out — [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.22819), [model checkpoints](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fnyu-visionx\u002Fcambrian-p-models), the annotated pose [Cambrian-P-Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002FCambrian-P-Data), and full training\u002Feval code are all released.\n\n## Contents\n\n- [*Cambrian-P*: Pose-Grounded Video Understanding](#cambrian-p-pose-grounded-video-understanding)\n  - [Release](#release)\n  - [Contents](#contents)\n  - [Cambrian-P Weights](#cambrian-p-weights)\n    - [VSI-Bench Performance](#vsi-bench-performance)\n    - [Pose Estimation Performance](#pose-estimation-performance)\n    - [Model Card](#model-card)\n  - [Train](#train)\n    - [Environment Preparation](#environment-preparation)\n    - [Data Preparation](#data-preparation)\n    - [Training Scripts](#training-scripts)\n  - [Evaluation](#evaluation)\n  - [Citation](#citation)\n  - [Related Projects](#related-projects)\n  - [License](#license)\n\n## Cambrian-P Weights\n\nCambrian-P is a pose-grounded video MLLM. Built on top of the [Cambrian-S](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian-s) architecture (SigLIP2-SO400m vision encoder + Qwen2.5 LLM + MLP projector), it introduces one learnable camera token per frame (via two learnable query embeddings — one for the first frame, one for the rest) and a lightweight pose head adapted from [VGGT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt). A single forward pass answers spatial video questions *and* regresses per-frame camera translation, rotation, and field-of-view — enabling both **improved spatial video QA** and **streaming camera pose estimation**.\n\n### VSI-Bench Performance\n\nSpatial video understanding on [VSI-Bench](https:\u002F\u002Fvision-x-nyu.github.io\u002Fthinking-space\u002F). Cambrian-P-7B (Qwen2.5-7B + SigLIP2-SO400m) achieves 73.7 average accuracy, the best among 7B-scale spatial-specialist models, with a **+4.5%** gain over Cambrian-S-7B (its no-pose counterpart) and particularly strong results on Relative Direction, Object Count, Route Plan, and Appearance Order.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"figs\u002Fvsibench_results.png\" alt=\"VSI-Bench Performance\" width=\"900\" height=\"auto\">\n\u003C\u002Fp>\n\n### Pose Estimation Performance\n\nStreaming camera pose estimation on ScanNet, TUM-dynamic, and Sintel, following the [MonST3R](https:\u002F\u002Fgithub.com\u002FJunyi42\u002Fmonst3r) protocol. For ScanNet and TUM-dynamic we sample the first 90 frames at temporal stride 3; for Sintel we exclude static \u002F near-straight sequences. All metrics use Sim(3) alignment.\n\nCambrian-P achieves the lowest ATE on ScanNet among all streaming models, competitive with offline pipelines — without a DINOv2 encoder or a bidirectional transformer.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"figs\u002Fpose_results.png\" alt=\"Pose Estimation Results\" width=\"900\" height=\"auto\">\n\u003C\u002Fp>\n\nMetric definitions follow [evo](https:\u002F\u002Fgithub.com\u002FMichaelGrupp\u002Fevo): ATE = absolute trajectory error RMSE (meters), RPE-t \u002F RPE-r = per-frame relative pose error RMSE (meters \u002F degrees).\n\n### Model Card\n\nWe release **five Cambrian-P-7B variants**. All share the same backbone (Qwen2.5-7B + SigLIP2-SO400m + per-frame camera tokens + VGGT-style pose head) and finetune from Cambrian-S-7B stage 3.\n\n| Model | Training Data | Hugging Face |\n|---|---|---|\n| **Cambrian-P-7B** | VSI | [nyu-visionx\u002FCambrian-P-7B](https:\u002F\u002Fhuggingface.co\u002Fnyu-visionx\u002FCambrian-P-7B) |\n| Cambrian-P-7B-32f | VSI | [nyu-visionx\u002FCambrian-P-7B-32f](https:\u002F\u002Fhuggingface.co\u002Fnyu-visionx\u002FCambrian-P-7B-32f) |\n| Cambrian-P-7B-Mix-MA | VSI + MapAnything | [nyu-visionx\u002FCambrian-P-7B-Mix-MA](https:\u002F\u002Fhuggingface.co\u002Fnyu-visionx\u002FCambrian-P-7B-Mix-MA) |\n| Cambrian-P-7B-Mix-3R | VSI + partial VLM-3R | [nyu-visionx\u002FCambrian-P-7B-Mix-3R](https:\u002F\u002Fhuggingface.co\u002Fnyu-visionx\u002FCambrian-P-7B-Mix-3R) |\n| Cambrian-P-7B-Mix-CamS | VSI + Cambrian-S | [nyu-visionx\u002FCambrian-P-7B-Mix-CamS](https:\u002F\u002Fhuggingface.co\u002Fnyu-visionx\u002FCambrian-P-7B-Mix-CamS) |\n\n\n## Train\n\n### Environment Preparation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian-p.git\nconda create -n cambrianp python=3.11 cmake=3.14.0\nconda activate cambrianp\n\ncd cambrian-p\u002Fvggt && pip install -e .\npip install hydra-core tensorboard iopath wcmatch fvcore\n\ncd .. && pip install --upgrade pip && pip install -e \".[train]\"\n\n# PyTorch 2.4.1 + CUDA 12.1 + flash-attn 2.8.3\npip install torch==2.4.1+cu121 torchvision==0.19.1+cu121 torchaudio==2.4.1+cu121 \\\n    --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\npip install --no-deps https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002Fdownload\u002Fv2.8.3\u002Fflash_attn-2.8.3+cu12torch2.4cxx11abiFALSE-cp311-cp311-linux_x86_64.whl\npip install accelerate==0.29.3 easydict matplotlib roma evo imageio OpenEXR\n```\n\nSee [`doc\u002Fenv_install.md`](doc\u002Fenv_install.md) for the detailed version.\n\n### Data Preparation\n\nCambrian-P fine-tunes from Cambrian-S-7B stage 3 on three required pieces:\n\n| Piece | Source | Size | Used for |\n|---|---|---|---|\n| 1. VSI-590K (VQA + scene geometry) | [`nyu-visionx\u002Fvsi-590k`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002Fvsi-590k) | ~236 GB | Spatial QA + scene pose supervision |\n| 2. Cambrian-S 3M videos | [`nyu-visionx\u002FCambrian-S-3M`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002FCambrian-S-3M) | per-source | Video backbone for the pose-annotated half of training |\n| 3. Cambrian-P pseudo pose annotations | [`nyu-visionx\u002FCambrian-P-Data`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002FCambrian-P-Data) | ~850 MiB | Dense pose supervision on the partial Cambrian-S-3M |\n\nSee [`doc\u002Fdata_preparation.md`](doc\u002Fdata_preparation.md) for the full recipe.\n\n\n### Training Scripts\n\nPlease check [`cambrianp\u002Fscripts\u002F`](cambrianp\u002Fscripts\u002F)\n**Training Script Sample test:** *(run [§Data Preparation Quickstart](doc\u002Fdata_preparation.md#quickstart) first)*\n\n```bash\nconda activate cambrianp\nexport WANDB_API_KEY=\u003Cyour-key>         \nexport DATA_DIR=\u002Fpath\u002Fto\u002Fvsi-590k                        \nexport VIPE_CAMBRIANS_DATA_ROOT=\u002Fpath\u002Fto\u002Fcambrian_s_3m  \nexport VIPE_CAMBRIANS_RESULTS_ROOT=\u002Fpath\u002Fto\u002Fcambrian_p_pose  \nexport OUTPUT_DIR=$PWD\u002Fckpts                            \n\nbash cambrianp\u002Fscripts\u002FCambrian-P-7B.sh\n```\n\n## Evaluation\n\nPlease refer to [`doc\u002Fevaluation.md`](doc\u002Fevaluation.md).\n\n## Citation\n\nIf you find our work useful for your research, please consider citing:\n\n```bibtex\n@article{yang2026cambrianp,\n  title   = {Cambrian-P: Pose-Grounded Video Understanding},\n  author  = {Yang, Jihan and Zhao, Zifan and Pan, Xichen and Yang, Shusheng and Zhang, Junyi and Kang, Bingyi and Xu, Hu and Xie, Saining},\n  journal = {arXiv preprint arXiv:2605.22819},\n  year    = {2026},\n}\n```\n\n## License\n\nSee [`LICENSE`](LICENSE).\n\n## Related Projects\n\n- [Cambrian-1](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian) — *Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs*\n- [Thinking in Space](https:\u002F\u002Fgithub.com\u002Fvision-x-nyu\u002Fthinking-space) — *Thinking in Space: How Multimodal Large Language Models See, Remember and Recall Spaces* (introduces VSI-Bench)\n- [Cambrian-S](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian-s) — spatial supersensing in video (shared training-data recipe)\n- [VGGT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fvggt) — reconstruction backbone\n- [DUSt3R](https:\u002F\u002Fgithub.com\u002Fnaver\u002Fdust3r) \u002F [MonST3R](https:\u002F\u002Fgithub.com\u002FJunyi42\u002Fmonst3r) — evaluation protocol\n- [CUT3R](https:\u002F\u002Fgithub.com\u002FCUT3R\u002FCUT3R), [StreamVGGT](https:\u002F\u002Fgithub.com\u002Fwzzheng\u002FStreamVGGT), [MapAnything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fmap-anything) — evaluation baselines\n","Cambrian-P 是一个基于姿态的视频理解项目。它通过分析视频中的人物姿态来实现对视频内容的理解，具有高精度的姿态估计和视频语义解析能力。项目使用 Python 编写，并提供了预训练模型、数据集以及完整的训练与评估代码。Cambrian-P 适用于需要从视频中提取人物行为信息的应用场景，如运动分析、动作识别等。",2,"2026-06-11 03:59:04","CREATED_QUERY"]