[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80152":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":12,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":12,"lastSyncTime":25,"discoverSource":26},80152,"semantic-wm","chandar-lab\u002Fsemantic-wm","chandar-lab","repository for training action-conditioned latent diffusion world models for robot video generation",null,"Python",64,2,52,3,0,11,1.43,false,"main",true,[],"2026-06-12 02:03:58","# Reconstruction or Semantics?\n\nOfficial code for **Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models**.\n\nThis repository trains action-conditioned latent diffusion world models for robot video generation and policy evaluation. The paper studies whether robotic world models should operate in reconstruction-aligned latent spaces, such as VAEs and Cosmos, or semantic latent spaces from pretrained vision encoders, such as V-JEPA 2.1, Web-DINO, and SigLIP 2.\n\n**Links:** [Project page](https:\u002F\u002Fhskalin.github.io\u002Fsemantic-wm\u002F) | [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06388) | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FNilaksh404\u002Fsemantic-wm)\n\nThe main finding is that pixel fidelity alone is not enough for choosing a world-model latent space. Reconstruction encoders can score well on visual metrics, but semantic encoders generally preserve action information, task progress, planning utility, and downstream policy behavior better across model scales.\n\n---\n\n## What This Code Includes\n\n- **Multiple encoders**: VAE (SD3), DINOv2-based RAE, SigLIP2\u002FWebSSL ScaleRAE, Qwen2.5-VL, V-JEPA 2.1, Cosmos CI16x16, VA-VAE\n- **Semantic adapters (S-VAE)**: Compress high-dimensional encoder features (768–1280-d) to a compact diffusion-friendly space (96-d) via Transformer-based VAE\n- **Pixel decoder**: Lightweight CNN that directly maps adapter latents to RGB, bypassing the large Transformer decoder\n- **Flow matching & DDPM**: Both objectives supported, with logit-normal time sampling and time-shift scheduling\n- **Multi-view support**: Transfer single-view pretrained weights to 3-camera setups via view-aware 3D RoPE\n- **Evaluation suite**: PSNR\u002FSSIM\u002FLPIPS\u002FFID\u002FFVD, PCK (keypoint tracking), controllability (action optimization in latent space), and trajectory success probing\n\n---\n\n## Installation\n\n```bash\nuv venv\nuv pip install -r requirements.txt\n```\n\n---\n\n## Quick Start\n\n### 1. Download Data\n\n```bash\npip install tensorflow tensorflow_datasets\npython -m src.data.download_data --dataset_name bridge_v2 --output_dir .\u002Fdata\n```\n\n### 2. Train an Adapter\n\nRequired for representation encoders (RAE, ScaleRAE, Qwen, V-JEPA 2.1). Not needed for VAE, Cosmos, or VA-VAE.\n\n```bash\npython -m src.launch_adapter \\\n    --encoder_type scale_rae_webssl \\\n    --adapter_type svae \\\n    --adapter_latent_dim 96 \\\n    --dataset_dir .\u002Fdata \\\n    --subset_names bridge_v2 \\\n    --batch_size 16 \\\n    --num_epochs 50 \\\n    --use_pixel_decoder true \\\n    --stage svae\n```\n\n### 3. Train the World Model (DiT)\n\n```bash\n# Single GPU\npython -m src.launch \\\n    --encoder_type scale_rae_webssl \\\n    --adapter_type svae \\\n    --adapter_checkpoint_path outputs\u002Fadapter_svae\u002Fadapter_ckpt_50.pt \\\n    --adapter_latent_dim 96 \\\n    --dit_size XL \\\n    --objective flow_matching \\\n    --dataset_dir .\u002Fdata \\\n    --subset_names bridge_v2 \\\n    --batch_size 8\n\n# Multi-GPU\ntorchrun --nproc_per_node=4 -m src.launch \\\n    --encoder_type scale_rae_webssl \\\n    --adapter_type svae \\\n    --adapter_checkpoint_path outputs\u002Fadapter_svae\u002Fadapter_ckpt_50.pt \\\n    --adapter_latent_dim 96 \\\n    --dit_size XL \\\n    --objective flow_matching \\\n    --dataset_dir .\u002Fdata \\\n    --batch_size 8\n```\n\nCheckpoints and GIF samples are written to `outputs\u002F\u003Ctimestamp>\u002F`.\n\n### 4. Evaluate\n\n```bash\npython -m src.launch_eval \\\n    --model_preset DiT-S_WEBSSL_WIDE \\\n    --dataset_dir .\u002Fdata \\\n    --subset_names bridge_v2 \\\n    --metrics \"psnr,ssim,lpips,fvd,pck,controllability\"\n```\n\n---\n\n## Architecture\n\n### Encoders (`src\u002Fmodels\u002Fbase_autoencoder.py`)\n\nAll encoders inherit from `BaseAutoencoder` and expose a uniform `encode(x)` \u002F `decode(z)` \u002F `latent_dim` API. Instantiate via `create_autoencoder(config)`.\n\n| `encoder_type` | Class | Latent Dim | Backbone |\n|---|---|---|---|\n| `vae` | `VAE` | 16 | Stable Diffusion 3, frozen |\n| `rae` | `RAE` | 768 | DINOv2-Base + ViT-MAE decoder |\n| `scale_rae_siglip` | `ScaleRAE` | 1152 | SigLIP2 + ViT-XL decoder |\n| `scale_rae_webssl` | `ScaleRAE` | 1024 | WebSSL\u002FDINOv2 + ViT-XL decoder |\n| `qwen` | `QwenEncoderWrapper` | 1280 | Qwen2.5-VL-3B (3D temporal) |\n| `vjepa2` | `VJEPA2EncoderWrapper` | 1024 | V-JEPA 2.1 ViT-L\u002F16 (image mode) |\n| `cosmos` | `CosmosTokenizerWrapper` | 16 | Cosmos CI16x16; no adapter needed |\n| `vavae` | `VAVAEWrapper` | 32 | VA-VAE f16d32; no adapter needed |\n\n### Adapters (`src\u002Fmodels\u002Fadapters.py`)\n\nProject high-dimensional encoder latents (d_h = 768–1280) down to a compact space (d_l, default 96). The adapter is **always frozen** during DiT training.\n\n| `adapter_type` | Description |\n|---|---|\n| `identity` | Pass-through (use with VAE, Cosmos, VA-VAE) |\n| `mlp` | Two-layer MLP: d_h → hidden → d_l |\n| `svae` | Transformer blocks + diagonal Gaussian; optional pixel decoder |\n\n### Diffusion Transformer (`src\u002Fmodels\u002Fmodel.py`)\n\nDiT variants with causal attention across time, action conditioning via concatenation, and spatial\u002Ftemporal rotary embeddings.\n\n| Size | Hidden | Depth | Heads |\n|---|---|---|---|\n| S | 384 | 12 | 6 |\n| B | 768 | 12 | 12 |\n| L | 1024 | 24 | 16 |\n| XL | 1152 | 28 | 16 |\n\n### Inference API (`src\u002Fmodels\u002Fworld_model.py`)\n\n```python\nfrom src.models.world_model import WorldModel\n\nmodel = WorldModel(checkpoint_path=\"model.pt\")\nmodel.reset(initial_frames)                    # encode and cache history\nnext_frames = model.generate_chunk(action_vector)  # autoregressive generation\n```\n\n---\n\n## Evaluation Metrics\n\n| Metric | Description |\n|---|---|\n| PSNR \u002F SSIM \u002F LPIPS \u002F FID \u002F FVD | Standard pixel-level video quality |\n| **PCK@k** | Percentage of Correct Keypoints within k pixels (via CoTracker); measures spatial structure preservation |\n| **Controllability** | Action optimization error in latent space (CEM\u002Fgradient\u002Fgrid); isolates DiT action-following quality |\n| **Probe accuracy** | Trajectory success classifier on frozen features; measures semantic fidelity of generated videos |\n\n---\n\n## Citation\n\nIf you use this code or build on the paper, please cite:\n\n```bibtex\n@article{nilaksh2026reconstruction,\n  title={Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models},\n  author={Nilaksh and Saurav Jha and Artem Zholus and Sarath Chandar},\n  year={2026},\n  eprint={2605.06388},\n  archivePrefix={arXiv},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.06388}\n}\n```\n","该项目旨在训练动作条件下的潜扩散世界模型，用于机器人视频生成和策略评估。它支持多种编码器如VAE、DINOv2-based RAE等，并通过语义适配器将高维特征压缩到紧凑的96维空间中，再利用轻量级CNN进行像素解码。此外，项目提供了流匹配与DDPM两种目标函数的支持以及多视角功能，适用于从单视图预训练权重转移到三摄像头设置。该代码库适合于研究如何选择更有利于机器人世界模型的潜在空间，尤其是在需要保持动作信息、任务进度及规划效用的场景下使用。","2026-06-11 03:59:26","CREATED_QUERY"]