[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79992":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":14,"stars30d":12,"stars90d":15,"forks30d":15,"starsTrendScore":12,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},79992,"4DThinker","zhangquanchen\u002F4DThinker","zhangquanchen","4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding",null,"Python",74,3,71,1,0,44.61,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:26","\u003Cdiv align=\"center\">\n\u003Ch1>4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding\u003C\u002Fh1>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.05997\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2503.11651-b31b1b\" alt=\"arXiv\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.05997\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Paper-orange'>\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjankin123\u002F4DThinker-Training-Data\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Dataset-blue'>\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fjankin123\u002F4DThinker-3B\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-4DThinker3B-yellow'>\u003C\u002Fa>\n\u003C!-- \u003Cp align=\"center\">\n  🤗 \u003Cb>2.0 \u003Ca href=\"#models\">Models\u003C\u002Fa>\u003C\u002Fb> · \u003Cb>\u003Ca href=\"#datasets\">Datasets\u003C\u002Fa>\u003C\u002Fb> · \u003Cb>\u003Ca href=\"#citation\">Technical Report\u003C\u002Fa>\u003C\u002Fb>\n\u003C\u002Fp> -->\n\n**THU**; **Meituan**; **CUHK**; **NUS**; **LMMs-Lab**; **UCLA**\n\u003C\u002Fdiv>\n\n## Overview\n\n\u003Cimg src=\"assets\u002Fpipeline.png\" alt=\"drawing\" width=\"500\"\u002F>\n\n## Introduction\nDynamic spatial reasoning from monocular video is essential for bridging visual intelligence and the physical world, yet remains challenging for vision-language models (VLMs). Prior approaches either verbalize spatial-temporal reasoning entirely as text, which is inherently verbose and imprecise for complex dynamics, or rely on external geometric modules that increase inference complexity without fostering intrinsic model capability. In this paper, we present 4DThinker, the first framework that enables VLMs to \"think with 4D\" through dynamic latent mental imagery, i.e., internally simulating how scenes evolve within the continuous hidden space. Specifically, we first introduce a scalable, annotation-free data generation pipeline that synthesizes 4D reasoning data from raw videos. We then propose Dynamic-Imagery Fine-Tuning (DIFT), which jointly supervises textual tokens and 4D latents to ground the model in dynamic visual semantics. Building on this, 4D Reinforcement Learning (4DRL) further tackles complex reasoning tasks via outcome-based rewards, restricting policy gradients to text tokens to ensure stable optimization. Extensive experiments across multiple dynamic spatial reasoning benchmarks demonstrate that 4DThinker consistently outperforms strong baselines and offers a new perspective toward 4D reasoning in VLMs.\n\n## Project Structure\n\n```\n4DThinker\u002F\n├── README.md\n├── LICENSE.txt\n├── .gitignore\n├── dift\u002F                        # DIFT training code\n│   ├── src\u002F                     # main.py, trainer.py, task.py, utils.py, inference.py\n│   ├── transformers\u002F            # Custom Qwen2.5-VL transformers fork\n│   ├── configs\u002F                 # DeepSpeed configs (ds_zero2.json, ds_zero3.json)\n│   ├── train.sh                 # Multi-GPU training script\n│   ├── train_single_gpu.sh      # Single-GPU training script\n│   └── requirements_dift.txt\n├── 4drl\u002F                        # 4DRL (GRPO) training code\n│   ├── src\u002Fopen-r1-multimodal\u002F  # RL trainer package\n│   ├── transformers_rl\u002F         # Custom transformers fork for RL\n│   ├── trl\u002F                     # Modified trl package\n│   ├── run_scripts\u002F             # train_4dthinker.sh\n│   ├── configs\u002F                 # DeepSpeed configs\n│   └── requirements_4drl.txt\n├── evaluation\u002F                  # DSR benchmark evaluation\n│   ├── dsr_eval.py\n│   ├── batch_dsr_eval.sh\n│   └── results\u002F                 # Evaluation output\n├── preprocess\u002F                  # Data generation pipeline\n│   ├── run.sh                   # Entry point: loops process_minibatch.py\n│   ├── process_minibatch.py     # Frame extraction + SAM3 masks + object detection\n│   ├── merge_jsonl.py           # Merge per-video data.jsonl\n│   ├── generate_camera_qa.py    # Camera movement QA + CoT\n│   ├── generate_dynamic_qa.py   # Object motion QA + CoT\n│   ├── convert_format.py        # Convert to training JSONL format\n│   ├── check_output_image.py    # Validate \u003Coutput_image> tags\n│   └── sam3\u002F                    # SAM3 segmentation model\n├── data\u002F                        # [HuggingFace] Training data\n│   ├── dift_data.jsonl          # DIFT training data (38K samples)\n│   ├── 4drl_data_filtered.jsonl # 4DRL training data (37K samples)\n│   └── processed_data\u002F          # Video frames & masks\n├── raw_data\u002F                    # [Downloading yourself] Evaluation benchmark data\n└── model\u002F                       # [HuggingFace] Model checkpoints\n    ├── dift\u002F                    # DIFT checkpoint\n    └── 4drl\u002F                    # 4DRL checkpoint\n```\n\n> **Note**: `data\u002F`, `raw_data\u002F`, and `model\u002F` are hosted on HuggingFace due to their large size. See the respective HuggingFace repositories for download instructions.\n\n## Env Setup\n\n### Preprocess Environment (optional)\n\n```bash\ncd preprocess\u002Fsam3\npip install -e .\n```\n\n### DIFT Environment\n\n```bash\nconda create -n 4dthinker python=3.10 -y\nconda activate 4dthinker\n\npip install -r dift\u002Frequirements_dift.txt\ncd dift\npip install -e .\u002Ftransformers\u002F\n```\n\n### 4DRL Environment\n\n```bash\nconda create -n 4dthinker-rl python=3.10 -y\nconda activate 4dthinker-4drl\n\npip install -r 4drl\u002Frequirements_4drl.txt\ncd 4drl\npip install -e .\u002Ftransformers_rl\u002F\ncp -rf .\u002Ftrl $(python -c \"import site; print(site.getsitepackages()[0])\")\u002Ftrl\n# Install RL trainer\npip install -e .\u002Fsrc\u002Fopen-r1-multimodal\u002F\n```\n\n## Data Preprocessing\n\nThe `preprocess\u002F` directory contains the full annotation-free data generation pipeline. Starting from raw SpatialVID videos, it produces structured 4D reasoning data (CoT interleaved with dynamic mental imagery). The preprocessed dataset is in [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjankin123\u002F4DThinker-Training-Data).\n\n### Pipeline Overview\n\u003Cimg src=\"assets\u002Fdata_gen.png\" alt=\"drawing\" width=\"500\"\u002F>\n\n### Prerequisites\n\n- SAM3 model checkpoint at `preprocess\u002Fsam3\u002Fmodels\u002Fsam3.pt`\n- SpatialVID data (videos + annotations + metadata CSV)\n- OpenAI-compatible API access (for Gemini-based QA generation)\n\n### Usage\n\n```bash\ncd preprocess\n\n# Set environment variables\nexport OPENAI_API_KEY=your_api_key\nexport OPENAI_BASE_URL=https:\u002F\u002Fapi.openai.com\u002Fv1\nexport DATA_BASE_DIR=\u002Fpath\u002Fto\u002Fyour\u002Fdata\n\n# Step 1: Process videos (frame extraction + SAM3 masks + object identification)\n# This script loops automatically until all videos are processed.\nbash run.sh\n\n# Step 2: Merge per-video results into a single JSONL\npython merge_jsonl.py\n\n# Step 3: Generate motion QA pairs with imagery-based CoT\npython generate_camera_qa.py    # Camera motion questions\npython generate_dynamic_qa.py   # Object motion questions\n\n# Step 4: Convert to training format and validate\npython convert_format.py .\u002Fcamera_data_qa_all.jsonl .\u002Fcamera_qa_converted.jsonl\npython convert_format.py .\u002Fdynamic_data_qa_all.jsonl .\u002Fdynamic_qa_converted.jsonl\npython check_output_image.py .\u002Fcamera_qa_converted.jsonl\npython check_output_image.py .\u002Fdynamic_qa_converted.jsonl\n```\n\n## Training\n\nA demo trained checkpoints from Qwen2.5-VL-3B is in [here](https:\u002F\u002Fhuggingface.co\u002Fjankin123\u002F4DThinker-3B). \n\n### DIFT Training\n```bash\nconda activate 4dthinker\nbash dift\u002Ftrain.sh\n\nOR\n\nbash dift\u002Ftrain_single_gpu.sh\n```\n\nKey arguments:\n- `MODEL_PATH`: Path to Qwen2.5-VL-3B-Instruct base model\n- `DATA_PATH`: Path to `dift_data.jsonl`\n- `--latent_size`: Number of latent tokens per image (default: 4)\n- `--ce_weight` \u002F `--sim_weight`: Loss weights (default: 0.1 \u002F 1.0)\n\n### 4DRL Training\n\n```bash\nconda activate 4dthinker-4drl\nbash 4drl\u002Frun_scripts\u002Ftrain_4dthinker.sh\n```\n\nKey arguments:\n- `MODEL_PATH`: Path to DIFT checkpoint directory\n- `DATA_PATH`: Path to `4drl_data_filtered.jsonl`\n\n## Inference\n\nsee `dift\u002Fsrc\u002Finference.py`\n\n## Evaluation\nOn DSR benchmark:\n\n```bash\nconda activate 4dthinker\n# Single model evaluation\nCUDA_VISIBLE_DEVICES=0 python evaluation\u002Fdsr_eval.py \\\n    --model_path model\u002Fdift\u002Fcheckpoints \\\n    --benchmark_path .\u002Fraw_data\u002FDSR_Suite-Data\u002Fbenchmark.parquet \\\n    --video_root .\u002Fraw_data\u002FDSR-data\u002Fbmk_video \\\n    --latent_size 4\n\n# Batch evaluation (multiple checkpoints in parallel)\nbash evaluation\u002Fbatch_dsr_eval.sh\n```\n\n## Acknowledgements\nThe repo also benifits form [SpatialVID](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FSpatialVID\u002FSpatialVID), [DSR_Suite](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FTencentARC\u002FDSR_Suite-Data), [Dyn-Bench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyu2hi13\u002FDyn-Bench), [Mirage](https:\u002F\u002Fgithub.com\u002FUMass-Embodied-AGI\u002FMirage), [trl](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl), [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers), [SAM3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsam3).\n\nThanks for their wonderful works.\n\n## Bibtex\nIf you find 4DThinker helpful for your work, please cite\n\n```\n@article{chen20264dthinker,\n  title={4DThinker: Thinking with 4D Imagery for Dynamic Spatial Understanding},\n  author={Chen, Zhangquan and Zhang, Manyuan and Yu, Xinlei and An, Xiang and Li, Bo and Xie, Xin and Wang, ZiDong and Sun, Mingze and Chen, Shuang and Li, Hongyu and others},\n  journal={arXiv preprint arXiv:2605.05997},\n  year={2026}\n}\n```\n","4DThinker 是一个用于动态空间理解的框架，通过四维影像帮助视觉-语言模型（VLMs）进行更有效的时空推理。其核心功能包括一个可扩展且无需标注的数据生成管道，该管道能够从原始视频中合成四维推理数据；以及动态影像微调（DIFT），它联合监督文本标记和四维潜变量以增强模型对动态视觉语义的理解。此外，4D强化学习（4DRL）进一步通过基于结果的奖励来解决复杂的推理任务。该项目适合需要在单目视频中实现高级别动态空间推理的应用场景，如自动驾驶、机器人导航等。",2,"2026-06-11 03:58:48","CREATED_QUERY"]