[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79934":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":28,"discoverSource":29},79934,"TrackCraft3r","cvlab-kaist\u002FTrackCraft3r","cvlab-kaist","Official code implementation for TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking","https:\u002F\u002Fcvlab-kaist.github.io\u002FTrackCraft3r\u002F",null,"Python",92,4,1,0,2,7,6,2.1,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:55","# TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking\n\n[Paper](paper\u002F2605.12587v1.pdf) &nbsp;|&nbsp; [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12587) &nbsp;|&nbsp; [Project Page](https:\u002F\u002Fcvlab-kaist.github.io\u002FTrackCraft3r)\n\nThis repository contains the official training code for **TrackCraft3R**, the first method that repurposes a pre-trained video diffusion transformer (Wan2.1-T2V-1.3B) as a single-pass dense 3D tracker. Given a monocular video together with its predicted depth and camera, TrackCraft3R predicts dense 3D trajectories in a single forward pass.\n\n---\n\n## 1. Environment\n\nWe tested training on **8 × NVIDIA H200 (141 GB)** GPUs with CUDA 12.1, Python 3.10, and PyTorch 2.4.\n\n```bash\n# 1. Create a fresh conda env\nconda create -n trackcraft3r python=3.10 -y\nconda activate trackcraft3r\n\n# 2. Install PyTorch (match your CUDA version)\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\n\n# 3. Install the rest of the dependencies\npip install -e .\npip install -r requirements.txt\npip install accelerate huggingface_hub wandb imageio[ffmpeg]\n```\n\nConfigure `accelerate` once:\n\n```bash\naccelerate config   # or: accelerate config default\n```\n\n---\n\n## 2. Pre-trained Weights\n\n### 2.1 Wan2.1-T2V-1.3B base model\n\nThe training code initializes the DiT, VAE, and T5 text encoder from the public Wan2.1-T2V-1.3B checkpoint. Download it once with:\n\n```bash\npython scripts\u002Fdownload_wan_1.3B.py --target .\u002Fcheckpoints\u002Fwan_models\n```\n\nThis pulls only the files we need (`diffusion_pytorch_model*.safetensors`, `models_t5_umt5-xxl-enc-bf16.pth`, `Wan2.1_VAE.pth`) into `.\u002Fcheckpoints\u002Fwan_models\u002FWan-AI\u002FWan2.1-T2V-1.3B\u002F`. The training scripts read this location through the `MODELSCOPE_CACHE` environment variable (default `.\u002Fcheckpoints\u002Fwan_models`).\n\n### 2.2 TrackCraft3R checkpoint\n\nTrained weights are released on the Hugging Face Hub at\n[`trackcraft3r\u002Fcheckpoint`](https:\u002F\u002Fhuggingface.co\u002Ftrackcraft3r\u002Fcheckpoint):\n\n```bash\nhuggingface-cli download trackcraft3r\u002Fcheckpoint --local-dir .\u002Fcheckpoints\u002Ftrackcraft3r\n# → .\u002Fcheckpoints\u002Ftrackcraft3r\u002Fmodel.safetensors\n```\n\n---\n\n## 3. Training Datasets\n\nWe train on four synthetic datasets: [Kubric](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fkubric), Dynamic Replica and PointOdyssey (downloaded via the scripts in [St4RTrack](https:\u002F\u002Fgithub.com\u002FHavenFeng\u002FSt4RTrack)), and [TartanAir](https:\u002F\u002Ftheairlab.org\u002Ftartanair-dataset\u002F). For Kubric we render 6K sequences (480×832, 81 frames).\n\nOnce downloaded\u002Frendered, set the four environment variables that the training scripts read:\n\n```bash\nexport KUBRIC_ROOT=\u002Fpath\u002Fto\u002Fkubric\nexport DYNAMIC_REPLICA_ROOT=\u002Fpath\u002Fto\u002Fdynamic_replica\nexport POINTODYSSEY_ROOT=\u002Fpath\u002Fto\u002Fpoint_odyssey\u002Ftrain\nexport TARTANAIR_ROOT=\u002Fpath\u002Fto\u002Ftartanair\n```\n\nSee `diffsynth\u002Ftrainers\u002Fsynthetic_dataset.py` for the exact directory structure each loader expects.\n\n---\n\n## 4. Training\n\nTraining proceeds in two stages.\n\n### 4.1 Stage 1: DiT LoRA + I\u002FO projections\n\nTrain the DiT with LoRA together with the input\u002Foutput projection layers. The VAE encoders\u002Fdecoders are frozen.\n\n```bash\nbash scripts\u002Ftrain_stage1.sh\n```\n\nCheckpoints are saved every 100 steps to `.\u002Fcheckpoints\u002Fstage1\u002F`.\n\n### 4.2 Stage 2: DiT LoRA + I\u002FO projections + VAE\n\nContinue from a Stage-1 checkpoint and additionally unfreeze the VAE encoder\u002Fdecoder. The pointmap encoder and visibility decoder are deep-copied to give two independent encoders (RGB \u002F pointmap) and two independent decoders (residual track \u002F visibility), all trained jointly with the DiT LoRA and the input\u002Foutput projection layers.\n\n```bash\n# Pick the Stage-1 state directory you want to resume from\nexport RESUME_FROM=.\u002Fcheckpoints\u002Fstage1\u002Fstate-XXXX\n\nbash scripts\u002Ftrain_stage2.sh\n```\n\nCheckpoints are saved to `.\u002Fcheckpoints\u002Fstage2\u002F`.\n\n### 4.3 W&B logging\n\nRun `wandb login` once before training. To skip W&B entirely, `export WANDB_MODE=disabled`.\n\n---\n\n## 5. Evaluation\n\nWe provide two evaluation scripts that reproduce the numbers reported in\nthe paper: an **interleaved eval** for long-video inference and a\n**stride eval** for large-motion inference. See\n[`evaluation\u002FREADME.md`](evaluation\u002FREADME.md) for details.\n\n```bash\n# 1) Download eval dataset and checkpoint\nhuggingface-cli download trackcraft3r\u002Ftrackcraft3r-eval --repo-type dataset --local-dir .\u002Feval_dataset\nhuggingface-cli download trackcraft3r\u002Fcheckpoint --local-dir .\u002Fcheckpoints\u002Ftrackcraft3r\n\n# 2) Run\nbash evaluation\u002Fscripts\u002Feval_interleaved.sh \\\n    --checkpoint_path .\u002Fcheckpoints\u002Ftrackcraft3r\u002Fmodel.safetensors \\\n    --data_root .\u002Feval_dataset --output_dir .\u002Feval_results\u002Finterleaved\n\nbash evaluation\u002Fscripts\u002Feval_stride.sh \\\n    --checkpoint_path .\u002Fcheckpoints\u002Ftrackcraft3r\u002Fmodel.safetensors \\\n    --data_root .\u002Feval_dataset --output_dir .\u002Feval_results\u002Fstride\n```\n\n---\n\n## 6. Run on your own video\n\nEnd-to-end pipeline:\n\n```\nyour_video → preprocess (DA3 or ViPE) → build_user_npz → inference → visualize\n```\n\nThe walkthrough below uses the included sample\n[`assets\u002Fexample\u002Fbreakdance.mp4`](assets\u002Fexample\u002Fbreakdance.mp4).\n\n\n### 6.1 Extract depth + camera\n\nYou can use either DA3 or ViPE. Both produce z-depth + per-frame\ncamera and feed into the same downstream NPZ. Each is installed\nin its own conda env to avoid clashing with `trackcraft3r`.\n\n**Option A: Depth-Anything-V3** ([repo](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fdepth-anything-3)):\n\n```bash\n# 1. Set up a dedicated env for DA3.\n#    Pinning torch + xformers up-front prevents `pip install -e .` from\n#    pulling the latest torch (which may not match your CUDA driver).\nconda create -n da3 python=3.10 -y\nconda activate da3\npip install torch==2.5.1 torchvision==0.20.1 xformers==0.0.28.post3 \\\n    --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\n\n# 2. Clone + install DA3\ngit clone https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002Fdepth-anything-3 .\u002FDepth-Anything-V3\ncd .\u002FDepth-Anything-V3 && pip install -e . && cd ..\n\n# 3. Run on the example video (use --frame_dir for an image directory)\npython scripts\u002Fpreprocess_da3.py \\\n    --video_path .\u002Fassets\u002Fexample\u002Fbreakdance.mp4 \\\n    --output_dir .\u002Fpreproc\u002Fbreakdance_da3\u002F \\\n    --da3_root .\u002FDepth-Anything-V3 \\\n    --model_name \"depth-anything\u002FDA3NESTED-GIANT-LARGE\"\n\nconda deactivate\n```\n\nWrites `depth.npy` (T,H,W float32 z-depth), `extrinsics.npy`\n(T,4,4 **W2C**), `intrinsics.npy` (T,3,3, rescaled to original image\nres).\n\n**Option B: ViPE** ([repo](https:\u002F\u002Fgithub.com\u002Fnv-tlabs\u002Fvipe)):\n\n```bash\n# 1. Set up a dedicated env for ViPE (follow ViPE's README for the exact\n#    torch\u002FCUDA versions it expects)\nconda create -n vipe python=3.10 -y\nconda activate vipe\n\n# 2. Clone + install ViPE\ngit clone https:\u002F\u002Fgithub.com\u002Fnv-tlabs\u002Fvipe .\u002FViPE\ncd .\u002FViPE && pip install -e . && cd ..\n\n# 3. Run on the example video\npython scripts\u002Fpreprocess_vipe.py \\\n    --video_path .\u002Fassets\u002Fexample\u002Fbreakdance.mp4 \\\n    --output_dir .\u002Fpreproc\u002Fbreakdance_vipe\u002F \\\n    --vipe_root .\u002FViPE\n\nconda deactivate\n```\n\nWrites `depth.npy` (z-depth), `extrinsics.npy` (T,4,4 **C2W** per\n`vipe\u002Futils\u002Fio.py \"cam2world matrices\"`), `intrinsics.npy`. ViPE's\nintrinsics are constant across frames in a single clip.\n\nAfter §6.1, `conda activate trackcraft3r` to run §6.2 onwards.\n\n### 6.2 Build the TrackCraft3R-format NPZ\n\nDA3 path:\n```bash\npython scripts\u002Fbuild_user_npz.py \\\n    --video_path     .\u002Fassets\u002Fexample\u002Fbreakdance.mp4 \\\n    --depth_npy      .\u002Fpreproc\u002Fbreakdance_da3\u002Fdepth.npy \\\n    --extrinsics_npy .\u002Fpreproc\u002Fbreakdance_da3\u002Fextrinsics.npy \\\n    --intrinsics_npy .\u002Fpreproc\u002Fbreakdance_da3\u002Fintrinsics.npy \\\n    --depth_convention z --extrinsics_convention w2c \\\n    --output_npz .\u002Fbreakdance_user.npz\n```\n\nViPE path:\n```bash\npython scripts\u002Fbuild_user_npz.py \\\n    --video_path     .\u002Fassets\u002Fexample\u002Fbreakdance.mp4 \\\n    --depth_npy      .\u002Fpreproc\u002Fbreakdance_vipe\u002Fdepth.npy \\\n    --extrinsics_npy .\u002Fpreproc\u002Fbreakdance_vipe\u002Fextrinsics.npy \\\n    --intrinsics_npy .\u002Fpreproc\u002Fbreakdance_vipe\u002Fintrinsics.npy \\\n    --depth_convention z --extrinsics_convention c2w \\\n    --output_npz .\u002Fbreakdance_user.npz\n```\n\nThe output NPZ contains:\n\n* `images_jpeg_bytes` — JPEG-encoded RGB frames\n* `depth_map` (T, H, W) — z-depth from DA3 \u002F ViPE\n* `extrinsics_w2c` (T, 4, 4) — frame-0-normalized world-to-camera\n* `fx_fy_cx_cy` (4,) — predicted intrinsics from DA3 \u002F ViPE\n\n### 6.3 Run inference + save the dense prediction\n\n`--num_frames` × `--frame_stride` decides which frames the model sees. The NPZ from §6.2\nkeeps all frames so you can re-run inference at different settings\nwithout re-building.\n\n```bash\nMODELSCOPE_CACHE=.\u002Fcheckpoints\u002Fwan_models \\\npython scripts\u002Finference_user_video.py \\\n    --checkpoint_path .\u002Fcheckpoints\u002Ftrackcraft3r\u002Fmodel.safetensors \\\n    --input_npz  .\u002Fbreakdance_user.npz \\\n    --output_npz .\u002Fbreakdance_dense.npz \\\n    --num_frames 12 --frame_stride 5\n```\n\nSaves `track_map` (T,H,W,3) per-pixel 3D tracks in frame-0 cam space (used\nfor the overlaid track trails), `recon_map` (T,H,W,3) per-frame depth\nback-projection in the same frame-0 cam space (used for the per-frame RGB\npoint cloud), and `rgb` (T,H,W,3) for point-cloud coloring.\n\n### 6.4 Visualize with Viser\n\n```bash\npython scripts\u002Fvisualize_dense.py --dense_npz .\u002Fbreakdance_dense.npz --port 8080\n```\n\n---\n\n## 7. Acknowledgements\n\n- [Wan2.1-T2V-1.3B](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1): base video backbone\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio): training pipeline framework\n- [St4RTrack](https:\u002F\u002Fgithub.com\u002FHavenFeng\u002FSt4RTrack): evaluation code\n- [Any4D](https:\u002F\u002Fgithub.com\u002FAny-4D\u002FAny4D): visualization code\n\n","TrackCraft3R 是一个将预训练的视频扩散变压器重新用于密集3D跟踪的项目。其核心功能是通过单次前向传递，从单目视频及其预测的深度和相机信息中预测出密集的3D轨迹。技术上，它基于Wan2.1-T2V-1.3B模型，并在训练过程中引入了LoRA（低秩适应）等技术以优化性能。该项目适合需要进行高精度3D目标跟踪的应用场景，例如自动驾驶、增强现实或虚拟现实领域中的物体追踪任务。开发环境要求包括Python 3.10、PyTorch 2.4以及CUDA支持的GPU。","2026-06-11 03:58:36","CREATED_QUERY"]