[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2721":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":15,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":33,"discoverSource":34},2721,"Volt","YilmazKadir\u002FVolt","YilmazKadir","Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding","https:\u002F\u002Fwww.vision.rwth-aachen.de\u002FVolt",null,"Python",156,5,8,1,0,2,22,3,47.53,"MIT License",false,"main",true,[26,27,28,29],"3d-scene-understanding","instance-segmentation","semantic-segmentation","transformers","2026-06-12 04:00:15","\u003Ch1 align=\"center\">Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.19609\">Paper\u003C\u002Fa>\n  ·\n  \u003Ca href=\"http:\u002F\u002Fvision.rwth-aachen.de\u002FVolt\">Project Page\u003C\u002Fa>\n  ·\n  \u003Ca href=\"#citation\">BibTeX\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  This repository contains the official implementation of Volume Transformer (Volt).\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fomnomnom.vision.rwth-aachen.de\u002Fdata\u002FVolt\u002FVolt.jpg\" alt=\"main_figure\" width=\"900\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  Volt partitions the input 3D scene into non-overlapping volumetric patches and embeds each patch into a token with a linear tokenizer. The resulting token sequence is processed by a Transformer encoder with global attention. The latent tokens are then upsampled back to the voxel resolution with a single transposed convolution and mapped to semantic predictions by a linear classification head.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  The core Volt model implementation can be found in\n  \u003Ca href=\"pointcept\u002Fmodels\u002Fvolt\u002Fvolt_base.py\">\u003Ccode>pointcept\u002Fmodels\u002Fvolt\u002Fvolt_base.py\u003C\u002Fcode>\u003C\u002Fa>.\n\u003C\u002Fp>\n\n## 📢 News\n\n- 2026-04-22: Code release.\n\n## Setup\n\nThis repository is built on top of [Pointcept](https:\u002F\u002Fgithub.com\u002FPointcept\u002FPointcept\u002Fblob\u002F04a0232b70f5c7091ffdc6bfe7a476e3eb7daff2) and incorporates components from [SGIFormer](https:\u002F\u002Fgithub.com\u002FRayYoh\u002FSGIFormer\u002Fblob\u002F4c05d57bbbd676b6a2398b03deac916e603a9dd7) for instance segmentation. For integrating image features with 3D backbones, please refer to our [DITR](https:\u002F\u002Fgithub.com\u002FVisualComputingInstitute\u002Fditr) codebase.\n\n### Dependencies\nWe recommend using [`uv`](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F#highlights), a fast Python package and environment manager, to install the environment.\n\nTo install `uv` on macOS and Linux, run:\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n```\n\nThen set up the environment with:\n```bash\n# Make sure to load CUDA 12.6 beforehand\n# This will automatically create a virtual environment (.venv) and install dependencies from pyproject.toml\nuv sync\nsource .venv\u002Fbin\u002Factivate\n```\n\n## Data Preprocessing\nFollow the dataset setup instructions in the [Pointcept README](https:\u002F\u002Fgithub.com\u002FPointcept\u002FPointcept\u002Fblob\u002F04a0232b70f5c7091ffdc6bfe7a476e3eb7daff2\u002FREADME.md).\n\n### Indoor Datasets\nPreprocessing for indoor datasets is identical to Pointcept.\n\n### Nuscenes\nFor **nuScenes**, run the preprocessing script below. Unlike Pointcept preprocessing, we additionally write panoptic labels to the `.pkl` files.\n```bash\nuv run --no-project --python 3.12 --with nuscenes-devkit python pointcept\u002Fdatasets\u002Fpreprocessing\u002Fnuscenes\u002Fpreprocess_nuscenes_info.py --dataset_root ${NUSCENES_DIR} --output_root ${PROCESSED_NUSCENES_DIR}\n```\n\n### SemanticKITTI\nFor **SemanticKITTI**, run the following script to generate the instance database used for instance CutMix.\n\n```bash\npython pointcept\u002Fdatasets\u002Fpreprocessing\u002Fsemantic_kitti\u002Fbuild_instance_db_h5.py --dataset_root ${KITTI_DIR} --output_root \"data\u002Fsemantic_kitti_instances\"\n```\n\n### Waymo\nFor **Waymo**, run the preprocessing script below. Waymo provides multiple LiDAR sensors. Unlike Pointcept preprocessing, we use only the points from the TOP LiDAR sensor, since only those points have semantic labels.\n\n```bash\nuv run --no-project --python 3.10 --with waymo-open-dataset-tf-2-11-0 python pointcept\u002Fdatasets\u002Fpreprocessing\u002Fwaymo\u002Fpreprocess_waymo.py --dataset_root ${WAYMO_DIR} --output_root ${PROCESSED_WAYMO_DIR} --splits training validation --num_workers ${NUM_WORKERS}\n```\n\n## Train\n\nDownload UNet teacher weights from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain)\n\n```bash\nhf download KadirYilmaz\u002FVolt --include \"teacher_weights\u002F*.pth\" --local-dir weights\u002F\n```\nThen, run the training script with the `semseg-volt-distill` config for each dataset.\n\n```bash\n### ScanNet\nsh scripts\u002Ftrain.sh -g 4 -d scannet -c semseg-volt-distill -n semseg-volt-distill\n### ScanNet200\nsh scripts\u002Ftrain.sh -g 4 -d scannet200 -c semseg-volt-distill -n semseg-volt-distill\n### ScanNet++\nsh scripts\u002Ftrain.sh -g 4 -d scannetpp -c semseg-volt-distill -n semseg-volt-distill\n### NuScenes\nsh scripts\u002Ftrain.sh -g 4 -d nuscenes -c semseg-volt-distill -n semseg-volt-distill\n### SemanticKITTI\nsh scripts\u002Ftrain.sh -g 4 -d semantic_kitti -c semseg-volt-distill -n semseg-volt-distill\n### Waymo\nsh scripts\u002Ftrain.sh -g 4 -d waymo -c semseg-volt-distill -n semseg-volt-distill\n```\n\nFor joint training, use the `semseg-volt-joint-small` config instead.\n```bash\n### ScanNet\nsh scripts\u002Ftrain.sh -g 4 -d scannet -c semseg-volt-joint-small -n semseg-volt-joint-small\n### ScanNet200\nsh scripts\u002Ftrain.sh -g 4 -d scannet200 -c semseg-volt-joint-small -n semseg-volt-joint-small\n### NuScenes\nsh scripts\u002Ftrain.sh -g 4 -d nuscenes -c semseg-volt-joint-small -n semseg-volt-joint-small\n### SemanticKITTI\nsh scripts\u002Ftrain.sh -g 4 -d semantic_kitti -c semseg-volt-joint-small -n semseg-volt-joint-small\n### Waymo\nsh scripts\u002Ftrain.sh -g 4 -d waymo -c semseg-volt-joint-small -n semseg-volt-joint-small\n```\n\n### Instance Segmentation\n\nFirst, run the preprocessing script to generate superpoints for ScanNet and ScanNet200.\n```bash\npython pointcept\u002Fdatasets\u002Fpreprocessing\u002Fscannet\u002Fpreprocess_superpoints.py --dataset_root ${RAW_SCANNET_DIR} --output_root ${PROCESSED_SCANNET_DIR}\n```\n\nDownload the pretrained Volt-S backbone weights from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain)\n```bash\nmkdir -p weights\ncurl -L -o weights\u002Fvolt-small-scannet.pth https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Fresolve\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fscannet\u002Fmodel\u002Fmodel_last.pth\ncurl -L -o weights\u002Fvolt-small-scannet200.pth https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Fresolve\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fscannet200\u002Fmodel\u002Fmodel_last.pth\n```\nAlternatively you can train them yourself using the corresponding configs above.\n\nThen, run the training script with the `insseg-spformer-volt-S-0-base` config for scannet\u002Fscannet200\n\n```bash\n### ScanNet\nsh scripts\u002Ftrain.sh -g 4 -d scannet -c insseg-spformer-volt-S-0-base -n insseg-volt\n### ScanNet200\nsh scripts\u002Ftrain.sh -g 4 -d scannet200 -c insseg-spformer-volt-S-0-base -n insseg-volt\n```\n\n## Model Zoo\n\nWe provide the experiment directories, including configs, logs, and checkpoints. The experiments can also be seen from [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain).\n\n### Semantic Segmentation: Single-Dataset Training\n\n| Model | Dataset | Val mIoU | Exp. Dir |\n| :--- | :--- | :---: | :---: |\n| Volt-S | ScanNet | 76.3 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fscannet) |\n| Volt-S | ScanNet200 | 36.1 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fscannet200) |\n| Volt-S | ScanNet++ | 50.2 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fscannetpp) |\n| Volt-S | nuScenes | 81.1 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fnuscenes) |\n| Volt-S | SemanticKITTI | 70.3 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fsemantic_kitti) |\n| Volt-S | Waymo | 71.2 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fsingle_dataset\u002Fwaymo) |\n\n### Semantic Segmentation: Joint Training\n\n| Model | Dataset | Val mIoU | Exp. Dir |\n| :--- | :--- | :---: | :---: |\n| Volt-S | ScanNet | 80.2 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fscannet) |\n| Volt-S | ScanNet200 | 38.5 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fscannet200) |\n| Volt-S | nuScenes | 81.8 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fnuscenes) |\n| Volt-S | SemanticKITTI | 72.8 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fsemantic_kitti) |\n| Volt-S | Waymo | 72.5 | [link](https:\u002F\u002Fhuggingface.co\u002FKadirYilmaz\u002FVolt\u002Ftree\u002Fmain\u002FVolt_experiments\u002Fjoint_training_small\u002Fwaymo) |\n\n## Citation\n\nIf you use our work in your research, please use the following BibTeX entry.\n\n```\n@article{yilmaz2026volt,\n  title     = {{Volume Transformer: Revisiting Vanilla Transformers for 3D Scene Understanding}},\n  author    = {Yilmaz, Kadir and Kruse, Adrian and Höfer, Tristan and de Geus, Daan and Leibe, Bastian},\n  journal   = {arXiv preprint arXiv:2604.19609},\n  year      = {2026}\n}\n```\n","Volume Transformer (Volt) 是一个专注于3D场景理解的项目，通过改进传统的Transformer架构来处理三维数据。其核心功能是将输入的3D场景分割为非重叠的体积块，并使用线性编码器将每个块转换为标记序列，再由具有全局注意力机制的Transformer编码器进行处理。之后，通过反卷积操作将潜在标记上采样回体素分辨率，并最终映射到语义预测结果。该项目适用于需要高精度3D实例分割和语义分割的应用场景，如自动驾驶、机器人导航等。基于Python开发，并集成了来自Pointcept与SGIFormer项目的组件以增强其性能。","2026-06-11 02:50:59","CREATED_QUERY"]