[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72154":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":15,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72154,"SpatialLM","manycore-research\u002FSpatialLM","manycore-research","[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling","https:\u002F\u002Fmanycore-research.github.io\u002FSpatialLM",null,"Python",4587,380,54,5,0,17,40,15,82.24,"Other",false,"main",true,[26,27,28,29],"mllm","point-clouds","scene-understanding","spatial-intelligence","2026-06-12 04:01:03","# SpatialLM\n\n\u003C!-- markdownlint-disable first-line-h1 -->\n\u003C!-- markdownlint-disable html -->\n\u003C!-- markdownlint-disable no-duplicate-header -->\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"figures\u002Flogo_light.png#gh-light-mode-only\" width=\"60%\" alt=\"SpatialLM\" \u002F>\n  \u003Cimg src=\"figures\u002Flogo_dark.png#gh-dark-mode-only\" width=\"60%\" alt=\"SpatialLM\" \u002F>\n\u003C\u002Fdiv>\n\u003Chr style=\"margin-top: 0; margin-bottom: 8px;\">\n\u003Cdiv align=\"center\" style=\"margin-top: 0; padding-top: 0; line-height: 1;\">\n    \u003Ca href=\"https:\u002F\u002Fmanycore-research.github.io\u002FSpatialLM\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"Project\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐%20Website-SpatialLM-ffc107?color=42a5f5&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07491\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"arXiv\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Techreport-b31b1b?logo=arxiv&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmanycore-research\u002FSpatialLM\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"GitHub\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-SpatialLM-24292e?logo=github&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\" style=\"line-height: 1;\">\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM1.1-Qwen-0.5B\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"Hugging Face\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-SpatialLM-ffc107?color=ffc107&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmanycore-research\u002FSpatialLM-Dataset\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"Dataset\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Dataset-Dataset-ffc107?color=ffc107&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmanycore-research\u002FSpatialLM-Testset\" target=\"_blank\" style=\"margin: 2px;\">\u003Cimg alt=\"Dataset\"\n    src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Dataset-Testset-ffc107?color=ffc107&logoColor=white\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## ✨ News\n\n- [Sept, 2025] [SpatialLM-Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmanycore-research\u002FSpatialLM-Dataset) is now available on Hugging Face.\n- [Sept, 2025] SpatialLM accepted at NeurIPS 2025.\n- [Jun, 2025] Added finetuning instructions in [FINETUNE.md](.\u002FFINETUNE.md).\n- [Jun, 2025] Check out our new models: [SpatialLM1.1-Llama-1B](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM1.1-Llama-1B) and [SpatialLM1.1-Qwen-0.5B](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM1.1-Qwen-0.5B), now available on Hugging Face. SpatialLM1.1 doubles the point cloud resolution, incorporates a more powerful point cloud encoder [Sonata](https:\u002F\u002Fxywu.me\u002Fsonata\u002F) and supports detection with user-specified categories.\n- [Jun, 2025] SpatialLM [Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07491) is now on arXiv.\n- [Mar, 2025] We're excited to release the [SpatialLM-Llama-1B](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM-Llama-1B) and [SpatialLM-Qwen-0.5B](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM-Qwen-0.5B) on Hugging Face.\n- [Mar, 2025] Initial release of SpatialLM!\n\n## Introduction\n\nSpatialLM is a 3D large language model designed to process 3D point cloud data and generate structured 3D scene understanding outputs. These outputs include architectural elements like walls, doors, windows, and oriented object bounding boxes with their semantic categories. Unlike previous methods that require specialized equipment for data collection, SpatialLM can handle point clouds from diverse sources such as monocular video sequences, RGBD images, and LiDAR sensors. This multimodal architecture effectively bridges the gap between unstructured 3D geometric data and structured 3D representations, offering high-level semantic understanding. It enhances spatial reasoning capabilities for applications in embodied robotics, autonomous navigation, and other complex 3D scene analysis tasks.\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fc0218d6a-f676-41f8-ae76-bba228866306\" poster=\"figures\u002Fcover.png\"> \u003C\u002Fvideo>\n  \u003Cp>\u003Ci>SpatialLM reconstructs 3D layout from a monocular RGB video with MASt3R-SLAM. Results aligned to video with GT cameras for visualization.\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## SpatialLM Models\n\n\u003Cdiv align=\"center\">\n\n|       **Model**        | **Download**                                                                      |\n| :--------------------: | --------------------------------------------------------------------------------- |\n| SpatialLM1.1-Llama-1B  | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM1.1-Llama-1B)  |\n| SpatialLM1.1-Qwen-0.5B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM1.1-Qwen-0.5B) |\n| SpatialLM1.0-Llama-1B  | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM-Llama-1B)     |\n| SpatialLM1.0-Qwen-0.5B | [🤗 HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmanycore-research\u002FSpatialLM-Qwen-0.5B)    |\n\n\u003C\u002Fdiv>\n\n## Usage\n\n### Installation\n\nTested with the following environment:\n\n- Python 3.11\n- Pytorch 2.4.1\n- CUDA Version 12.4\n\n```bash\n# clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fmanycore-research\u002FSpatialLM.git\ncd SpatialLM\n\n# create a conda environment with cuda 12.4\nconda create -n spatiallm python=3.11\nconda activate spatiallm\nconda install -y -c nvidia\u002Flabel\u002Fcuda-12.4.0 cuda-toolkit conda-forge::sparsehash\n\n# Install dependencies with poetry\npip install poetry && poetry config virtualenvs.create false --local\npoetry install\n# SpatialLM1.0 dependency\npoe install-torchsparse # Building wheel for torchsparse will take a while\n# SpatialLM1.1 dependency\npoe install-sonata # Building wheel for flash-attn will take a while\n```\n\n### Inference\n\nIn the current version of SpatialLM, input point clouds are considered axis-aligned where the z-axis is the up axis. This orientation is crucial for maintaining consistency in spatial understanding and scene interpretation across different datasets and applications.\nExample preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https:\u002F\u002Fgithub.com\u002Frmurai0610\u002FMASt3R-SLAM), are available in [SpatialLM-Testset](#spatiallm-testset).\n\nDownload an example point cloud:\n\n```bash\nhuggingface-cli download manycore-research\u002FSpatialLM-Testset pcd\u002Fscene0000_00.ply --repo-type dataset --local-dir .\n```\n\nRun inference:\n\n```bash\npython inference.py --point_cloud pcd\u002Fscene0000_00.ply --output scene0000_00.txt --model_path manycore-research\u002FSpatialLM1.1-Qwen-0.5B\n```\n\n### Detection with user-specified categories\n\nSpatialLM1.1 supports object detection conditioned on user-specified categories by leveraging the flexibility of LLMs.\n\nSpatialLM1.1 offers three variants of structured indoor modeling tasks:\n\n- **Structured Reconstruction**: Detect walls, doors, windows, boxes.\n- **Layout Estimation**: Detect walls, doors, windows.\n- **3D Object Detection**: Detect boxes.\n\nFor tasks that include object box estimation, you can specify a subset of the 59 furniture categories, and the model will only predict objects within those specified categories. For example:\n\n```bash\npython inference.py --point_cloud pcd\u002Fscene0000_00.ply --output scene0000_00.txt --model_path manycore-research\u002FSpatialLM1.1-Qwen-0.5B --detect_type object --category bed nightstand\n```\n\n### Visualization\n\nUse `rerun` to visualize the point cloud and the predicted structured 3D layout output:\n\n```bash\n# Convert the predicted layout to Rerun format\npython visualize.py --point_cloud pcd\u002Fscene0000_00.ply --layout scene0000_00.txt --save scene0000_00.rrd\n\n# Visualize the point cloud and the predicted layout\nrerun scene0000_00.rrd\n```\n\n### Evaluation\n\nTo evaluate the performance of SpatialLM, we provide `eval.py` script that reports the benchmark results on the SpatialLM-Testset in the table below in section [Benchmark Results](#benchmark-results).\n\nDownload the testset:\n\n```bash\nhuggingface-cli download manycore-research\u002FSpatialLM-Testset --repo-type dataset --local-dir SpatialLM-Testset\n```\n\nRun evaluation:\n\n```bash\n# Run inference on the PLY point clouds in folder SpatialLM-Testset\u002Fpcd with SpatialLM1.1-Qwen-0.5B model\npython inference.py --point_cloud SpatialLM-Testset\u002Fpcd --output SpatialLM-Testset\u002Fpred --model_path manycore-research\u002FSpatialLM1.1-Qwen-0.5B\n\n# Evaluate the predicted layouts\npython eval.py --metadata SpatialLM-Testset\u002Ftest.csv --gt_dir SpatialLM-Testset\u002Flayout --pred_dir SpatialLM-Testset\u002Fpred --label_mapping SpatialLM-Testset\u002Fbenchmark_categories.tsv\n```\n\n### Example using a custom video\n\nWe provide an example of how to use our model to estimate scene layout starting from a RGB video with the newly released [SLAM3R](https:\u002F\u002Fgithub.com\u002FPKU-VCL-3DV\u002FSLAM3R) in [EXAMPLE.md](EXAMPLE.md). These steps work for MASt3R-SLAM, and other reconstruction methods as well.\n\n### Finetune on Custom Data\n\nFor instructions on fine-tuning SpatialLM on your own data, please refer to [FINETUNE.md](.\u002FFINETUNE.md). We provide an example using the [ARKitScenes](https:\u002F\u002Fgithub.com\u002Fapple\u002FARKitScenes) dataset.\n\n## SpatialLM Dataset\n\nThe SpatialLM dataset is a large-scale, high-quality synthetic dataset designed by professional 3D designers and used for real-world production. It contains point clouds from 12,328 diverse indoor scenes comprising 54,778 rooms, each paired with rich ground-truth 3D annotations. SpatialLM dataset provides an additional valuable resource for advancing research in indoor scene understanding, 3D perception, and related applications.\n\nFor access to photorealistic RGB\u002FDepth\u002FNormal\u002FSemantic\u002FInstance panoramic renderings and camera trajectories used to generate the SpatialLM point clouds, please refer to the [SpatialGen project](https:\u002F\u002Fmanycore-research.github.io\u002FSpatialGen) for more details.\n\n\u003Cdiv align=\"center\">\n\n|    **Dataset**    | **Download**                                                                       |\n| :---------------: | ---------------------------------------------------------------------------------- |\n| SpatialLM-Dataset | [🤗 Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmanycore-research\u002FSpatialLM-Dataset) |\n\n\u003C\u002Fdiv>\n\n## SpatialLM Testset\n\nWe provide a test set of 107 preprocessed point clouds, reconstructed from RGB videos using [MASt3R-SLAM](https:\u002F\u002Fgithub.com\u002Frmurai0610\u002FMASt3R-SLAM). SpatialLM-Testset is quite challenging compared to prior clean RGBD scans datasets due to the noises and occlusions in the point clouds reconstructed from monocular RGB videos.\n\n\u003Cdiv align=\"center\">\n\n|    **Dataset**    | **Download**                                                                       |\n| :---------------: | ---------------------------------------------------------------------------------- |\n| SpatialLM-Testset | [🤗 Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fmanycore-research\u002FSpatialLM-TestSet) |\n\n\u003C\u002Fdiv>\n\n## Benchmark Results\n\n### Layout Estimation\n\nLayout estimation focuses on predicting architectural elements, i.e., walls, doors, and windows, within an indoor scene. We evaluated this task on the [Structured3D](https:\u002F\u002Fstructured3d-dataset.org) dataset. For [RoomFormer](https:\u002F\u002Fgithub.com\u002Fywyue\u002FRoomFormer), we directly downloaded the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on Structured3D.\n\nWe thank @chinmay0301ucsd for identifying and fixing a bug [#88](https:\u002F\u002Fgithub.com\u002Fmanycore-research\u002FSpatialLM\u002Fpull\u002F88) in the evaluation script that affected door and window metrics. As a result, the scores are higher than previously reported.\n\n\u003Cdiv align=\"center\">\n\n|   **Method**    | **RoomFormer** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |\n| :-------------: | :------------: | :-------------------------: | :------------------------------------: |\n| **F1 @.25 IoU** |      83.4      |            90.4             |                  94.3                  |\n| **F1 @.5 IoU**  |      81.4      |            89.2             |                  93.5                  |\n\n\u003C\u002Fdiv>\n\n### 3D Object Detection\n\nWe evaluate 3D object detection on [ScanNet](http:\u002F\u002Fwww.scan-net.org) with annotations of 18 object categories. For [V-DETR](https:\u002F\u002Fgithub.com\u002FV-DETR\u002FV-DETR), we directly download the model checkpoint. SceneScript and SpatialLM were first trained on our dataset, and further fine-tuned on ScanNet.\n\n\u003Cdiv align=\"center\">\n\n|   **Method**    | **V-DETR** | **SceneScript (finetuned)** | **SpatialLM1.1-Qwen-0.5B (finetuned)** |\n| :-------------: | :--------: | :-------------------------: | :------------------------------------: |\n| **F1 @.25 IoU** |    65.1    |            49.1             |                  65.6                  |\n| **F1 @.5 IoU**  |    56.8    |            36.8             |                  52.6                  |\n\n\u003C\u002Fdiv>\n\n### Zero-shot Detection on Videos\n\nZero-shot detection results on the challenging SpatialLM-Testset are reported in the following table:\n\n\u003Cdiv align=\"center\">\n\n|   **Method**    | **SpatialLM1.1-Llama-1B** | **SpatialLM1.1-Qwen-0.5B** |\n| :-------------: | :-----------------------: | :------------------------: |\n|   **Layout**    |   **F1 @.25 IoU (2D)**    |    **F1 @.25 IoU (2D)**    |\n|      wall       |           68.9            |            68.2            |\n|      door       |           49.1            |            47.4            |\n|     window      |           47.0            |            51.4            |\n|                 |                           |                            |\n|   **Objects**   |   **F1 @.25 IoU (3D)**    |    **F1 @.25 IoU (2D)**    |\n|     curtain     |           34.9            |            37.0            |\n|   nightstand    |           62.8            |            67.0            |\n|   chandelier    |           53.5            |            36.8            |\n|    wardrobe     |           29.4            |            39.6            |\n|       bed       |           96.8            |            95.2            |\n|      sofa       |           66.9            |            69.1            |\n|      chair      |           20.8            |            32.3            |\n|     cabinet     |           15.2            |            11.2            |\n|  dining table   |           40.7            |            24.2            |\n|     plants      |           29.5            |            26.3            |\n|   tv cabinet    |           34.4            |            27.3            |\n|  coffee table   |           56.4            |            64.9            |\n|   side table    |           14.6            |            9.7             |\n| air conditioner |           16.7            |            24.0            |\n|     dresser     |           46.7            |            46.7            |\n|      stool      |           17.6            |            30.8            |\n|  refrigerator   |            0.0            |            16.7            |\n|    painting     |           34.9            |            38.2            |\n|     carpet      |           40.3            |            24.1            |\n|       tv        |           16.0            |            18.0            |\n\n\u003C\u002Fdiv>\n\n### Result Visualizations\n\n\u003Cdiv align=\"center\">\n\n|                                                            Layout Estimation                                                            |                                                          Object Detection                                                          |                                                       Zero-shot Reconstruction                                                        |\n| :-------------------------------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------: |\n|                                                  ![Structured3D](.\u002Ffigures\u002Fstru3d.jpg)                                                  |                                                 ![ScanNet](.\u002Ffigures\u002Fscannet.jpg)                                                  |                                                 ![Zero-shot](.\u002Ffigures\u002Fzeroshot.jpg)                                                  |\n| [Structured3D Results](https:\u002F\u002Fmanycore-research-azure.kujiale.com\u002Fmanycore-research\u002FSpatialLM\u002Fsupplementary\u002Fvisualization_layout.html) | [ScanNet Results](https:\u002F\u002Fmanycore-research-azure.kujiale.com\u002Fmanycore-research\u002FSpatialLM\u002Fsupplementary\u002Fvisualization_object.html) | [Zeroshot Results](https:\u002F\u002Fmanycore-research-azure.kujiale.com\u002Fmanycore-research\u002FSpatialLM\u002Fsupplementary\u002Fvisualization_zeroshot.html) |\n\n\u003C\u002Fdiv>\n\n## License\n\nSpatialLM-Llama-1B is derived from Llama3.2-1B-Instruct, which is licensed under the Llama3.2 license.\nSpatialLM-Qwen-0.5B is derived from the Qwen-2.5 series, originally licensed under the Apache 2.0 License.\n\nSpatialLM1.0 are built upon the SceneScript point cloud encoder, licensed under the CC-BY-NC-4.0 License. TorchSparse, utilized in this project, is licensed under the MIT License.\n\nSpatialLM1.1 are built upon Sonata point cloud encoder, model weight is licensed under the CC-BY-NC-4.0 License. Code built on Pointcept is licensed under the Apache 2.0 License.\n\n## Citation\n\nIf you find this work useful, please consider citing:\n\n```bibtex\n@inproceedings{SpatialLM,\n  title     = {SpatialLM: Training Large Language Models for Structured Indoor Modeling},\n  author    = {Mao, Yongsen and Zhong, Junhao and Fang, Chuan and Zheng, Jia and Tang, Rui and Zhu, Hao and Tan, Ping and Zhou, Zihan},\n  booktitle = {Advances in Neural Information Processing Systems},\n  year      = {2025}\n}\n```\n\n## Acknowledgements\n\nWe would like to thank the following projects that made this work possible:\n\n[Llama3.2](https:\u002F\u002Fgithub.com\u002Fmeta-llama) | [Qwen2.5](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5) | [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) | [SceneScript](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fscenescript) | [TorchSparse](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Ftorchsparse) | [Sonata](https:\u002F\u002Fxywu.me\u002Fsonata\u002F) | [Pointcept](https:\u002F\u002Fgithub.com\u002FPointcept\u002FPointcept)\n","SpatialLM是一个用于结构化室内建模的大规模语言模型训练项目。该项目利用先进的点云处理技术，能够理解复杂的室内场景，并生成高精度的三维空间模型。它采用了Sonata作为更强大的点云编码器，支持用户自定义类别的检测功能，显著提高了模型在细节捕捉与环境解析方面的能力。SpatialLM适用于需要精准室内空间分析的应用场景，如智能家居、虚拟现实以及建筑信息管理等领域。",2,"2026-06-11 03:40:37","high_star"]