[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77540":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":14,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":28,"discoverSource":29},77540,"RAEv2","nanovisionx\u002FRAEv2","nanovisionx","Official Implemenation for RAEv2: Improved Baselines with Representation Autoencoders","",null,"Python",258,10,2,4,0,25,230,18,69.62,"Other",false,"main",[],"2026-06-12 04:01:21","## Improved Baselines with Representation Autoencoders\u003Cbr>\u003Csub>Official PyTorch Implementation\u003C\u002Fsub>\n\n### [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18324) | [Project Page](https:\u002F\u002Fraev2.github.io)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fhero.png\" width=\"95%\" alt=\"RAEv2 pareto-optimal reconstruction-generation and 10x faster convergence\">\n\u003C\u002Fp>\n\nThis repository contains the **PyTorch\u002FGPU** implementation of our paper:\nImproved Baselines with Representation Autoencoders.\n\n> [**Improved Baselines with Representation Autoencoders**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.18324)\u003Cbr>\n> [Jaskirat Singh](https:\u002F\u002F1jsingh.github.io\u002F)\u003Csup>1,2\u003C\u002Fsup>, [Boyang Zheng](http:\u002F\u002Fbytetriper.github.io\u002F)\u003Csup>3\u003C\u002Fsup>, [Zongze Wu](https:\u002F\u002Fbetterze.github.io\u002Fwebsite\u002F)\u003Csup>1\u003C\u002Fsup>, [Richard Zhang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=LW8ze_UAAAAJ&hl=en)\u003Csup>1\u003C\u002Fsup>, [Eli Shechtman](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=B_FTboQAAAAJ)\u003Csup>1\u003C\u002Fsup>, [Saining Xie](https:\u002F\u002Fwww.sainingxie.com\u002F)\u003Csup>3\u003C\u002Fsup>\u003Cbr>\n> \u003Csup>1\u003C\u002Fsup>Adobe Research, \u003Csup>2\u003C\u002Fsup>ANU, \u003Csup>3\u003C\u002Fsup>New York University\u003Cbr>\n\n```bibtex\n@article{singh2026raev2,\n  title={Improved Baselines with Representation Autoencoders},\n  author={Singh, Jaskirat and Zheng, Boyang and Wu, Zongze and Zhang, Richard and Shechtman, Eli and Xie, Saining},\n  journal={arXiv preprint arXiv:2605.18324},\n  year={2026}\n}\n```\n\nRAEv2 simplifies and improves representation autoencoders, achieving over 10x faster convergence, better generation, and better reconstruction. RAEv2 achieves state-of-the-art gFID and FDr6 in just 80 epochs compared to prior baselines (800 epochs) without any post-training. We also validate the improved training recipe on diverse settings T2I generation and world models showing consistent improvements.\n\n## Dependency Setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fnanovisionx\u002FRAEv2.git\ncd RAEv2\n\n# install uv project manager (if you don't already have it)\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n\n# install dependencies\nuv sync\n```\n\n## Data\n\nPre-processed datasets at 256x256. All rights to the original owners.\n\n| Subset | Task | Source | Format | Notes |\n|---|---|---|---|---|\n| `imagenet-256` | ImageNet | [ImageNet](https:\u002F\u002Fimage-net.org\u002F) | Arrow | Or use your own ImageNet |\n| `blip3o-256` | T2I | [BLIP3o](https:\u002F\u002Fhuggingface.co\u002FBLIP3o) | WDS | Captioned image pairs |\n| `rendertext-256` | T2I | [RenderedText](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fwendlerc\u002FRenderedText) | WDS | Rendered-text images |\n| `scale-rae-256` | T2I | [Scale-RAE](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fnyu-visionx\u002Fscale-rae-data) | WDS | Synthetic FLUX images |\n| `recon-256` | NWM | [RECON](https:\u002F\u002Fsites.google.com\u002Fview\u002Frecon-robot\u002Fdataset) | WDS | Robot navigation frames |\n\n```bash\n# (Recommended) ~5-10x faster downloads\nexport HF_HUB_ENABLE_HF_TRANSFER=1\n\n# Download all subsets into data\u002F\nuv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --local-dir data\u002F\n\n# Or download a specific subset (uncomment one):\n# uv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --include \"imagenet-256\u002F**\"     --local-dir data\u002F   # ImageNet\n# uv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --include \"blip3o-256\u002F**\"       --local-dir data\u002F   # BLIP3o\n# uv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --include \"rendertext-256\u002F**\"  --local-dir data\u002F   # RenderedText\n# uv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --include \"scale-rae-256\u002F**\"    --local-dir data\u002F   # Scale-RAE\n# uv run hf download nanovisionx\u002FRAEv2-data --repo-type dataset --exclude .gitattributes --include \"recon-256\u002F**\"        --local-dir data\u002F   # RECON\n```\n\n## Pretrained Models\n\n```bash\n# Download all (encoders + stage 1 + stage 2)\nuv run hf download nyu-visionx\u002FRAEv2-models --exclude .gitattributes --local-dir pretrained_models\u002F\n\n# Or download a specific subset (uncomment one):\n# uv run hf download nyu-visionx\u002FRAEv2-models --include \"encoders\u002F**\" --exclude .gitattributes --local-dir pretrained_models\u002F   # Pretrained vision encoders\n# uv run hf download nyu-visionx\u002FRAEv2-models --include \"stage1\u002F**\"   --exclude .gitattributes --local-dir pretrained_models\u002F   # RAEv2 stage 1 checkpoints\n# uv run hf download nyu-visionx\u002FRAEv2-models --include \"stage2\u002F**\"   --exclude .gitattributes --local-dir pretrained_models\u002F   # RAEv2 stage 2 checkpoints\n```\n\n## Stage 1: Generalized Representation Autoencoders\n\n### Training\n\nWe support 80+ pre-trained vision encoders across different encoder families and sizes (DINOv2, DINOv3, WebSSL, EUPE, MAE, iJEPA, MoCov3, CLIP, SigLIP2 etc.). See `src\u002Fencoders\u002F` for the full list and naming spec.\n\n**Naming**: e.g.\n- DINOv3-L: `dinov3-vit-l16`\n- DINOv3-L-K7 (multi-layer-sum, last 7 layers): `dinov3mls-vit-l16[layers=11.13.15.17.19.21.23]`\n\n```bash\nexport WANDB_ENTITY=\u003Cyour-entity>\nexport WANDB_PROJECT=\u003Cyour-project>\nexport EXPERIMENT_NAME=\u003Cyour-run-name>\n\nuv run torchrun --nproc_per_node=8 \\\n    src\u002Ftrain_stage1.py \\\n    --config \u003CCONFIG_PATH> \\\n    --results-dir ckpts\u002Fstage1 \\\n    --precision bf16 \\\n    --compile \\\n    --wandb\n```\n\n**ImageNet config:**\n- `configs\u002Fstage1\u002Ftraining\u002Fdinov3l-k7-imagenet.yaml`\n- `configs\u002Fstage1\u002Ftraining\u002Fdinov3l-k23-imagenet.yaml`\n\n**General Config:** Similar to proprietary VAEs, training with more data helps further improve reconstruction performance.\n- `configs\u002Fstage1\u002Ftraining\u002Fdinov3l-k7-general.yaml`\n- `configs\u002Fstage1\u002Ftraining\u002Fdinov3l-k23-general.yaml`\n\n**After training:** extract the EMA decoder and compute encoder statistics for latent normalization.\n\n```bash\n# 1. Extract EMA decoder from the final checkpoint\nuv run python scripts\u002Fstage1\u002Fextract_decoder.py \\\n    --config \u003CCONFIG_PATH> \\\n    --ckpt ckpts\u002Fstage1\u002F\u003CRUN_NAME>\u002Fcheckpoints\u002Fep-XXXXXXX.pt \\\n    --use-ema \\\n    --out pretrained_models\u002Fstage1\u002F\u003Cimagenet|general>\u002F\u003Cencoder>-k\u003CN>\u002Fdecoder.pt\n\n# 2. Compute encoder stats (multi-GPU, single node)\nuv run torchrun --nproc_per_node=8 \\\n    scripts\u002Fstage1\u002Fcompute_encoder_stats.py \\\n    --config \u003CCONFIG_PATH> \\\n    --use-hf-dataset \\\n    --hf-data-dir data\u002Fimagenet-256 \\\n    --batch-size 256 \\\n    --output-path pretrained_models\u002Fstage1\u002F\u003Cimagenet|general>\u002F\u003Cencoder>-k\u003CN>\u002Fstats.pt\n```\n\n### Evaluation\n\n**Sampling**: reconstruct an image with a trained RAE. See `configs\u002Fstage1\u002Fsampling\u002F` for the full list. E.g.,\n\n```bash\nuv run python scripts\u002Fstage1\u002Fsample.py \\\n    --config configs\u002Fstage1\u002Fsampling\u002Fdinov3l-k23-general.yaml \\\n    --image assets\u002Fsamples\u002Fsample_1.png\n```\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Frecon_comparison.png\" width=\"100%\" alt=\"Stage 1 reconstruction comparison: FLUX VAE \u002F SD-VAE \u002F SDXL-VAE \u002F RAEv2 K=23 \u002F RAEv2 K=7 on handwritten text\">\n\u003C\u002Fp>\n\n**Evaluation**: offline reconstruction metrics (rFID, PSNR, SSIM, LPIPS) on eval datasets (e.g., ImageNet, RenderedText etc).\n\n```bash\nexport EXPERIMENT_NAME=\u003Cyour-run-name>\n\nuv run torchrun --nproc_per_node=8 \\\n    src\u002Foffline_eval_stage1.py \\\n    --config configs\u002Fstage1\u002Fsampling\u002Fdinov3l-k23-general.yaml\n```\n\n## Stage 2: Latent Diffusion Transformers\n\n### Training\n\nWe support training RAEv2 across diverse settings: ImageNet, text-to-image (T2I), and navigation world models.\n\n```bash\nexport WANDB_ENTITY=\u003Cyour-entity>\nexport WANDB_PROJECT=\u003Cyour-project>\nexport EXPERIMENT_NAME=\u003Cyour-run-name>\n\nuv run torchrun --nproc_per_node=8 \\\n    src\u002Ftrain.py \\\n    --config \u003CCONFIG_PATH> \\\n    --results-dir ckpts\u002Fstage2 \\\n    --precision bf16 \\\n    --compile \\\n    --wandb\n```\n\nExample training configs for different tasks (all under `configs\u002Fstage2\u002Ftraining\u002F`):\n\n| Task | k=1 | k=7 | k=23 |\n|---|---|---|---|\n| ImageNet | `imagenet-dinov3l-k1.yaml` | `imagenet-dinov3l-k7.yaml` | `imagenet-dinov3l-k23.yaml` |\n| T2I | `t2i-dinov3l-k1.yaml` | `t2i-dinov3l-k7.yaml` | `t2i-dinov3l-k23.yaml` |\n| NWM | `nwm-dinov3l-k1.yaml` | `nwm-dinov3l-k7.yaml` | `nwm-dinov3l-k23.yaml` |\n\n### Evaluation\n\n**Online Evaluation**: Similar to [JiT](https:\u002F\u002Fgithub.com\u002FLTH14\u002FJiT), we support online evaluation during training. See the `eval` block in any config under `configs\u002Fstage2\u002Ftraining\u002F`.\n\n| Task | Supported metrics |\n|---|---|\n| Stage 1 - Reconstruction | rFID, PSNR, LPIPS, SSIM |\n| Stage 2 - ImageNet | gFID, Inception Score, [FDr6](https:\u002F\u002Fgithub.com\u002FJiawei-Yang\u002FFD-loss) (6 representation spaces), [MIND](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.06797) \u002F [torch-fidelity](https:\u002F\u002Fgithub.com\u002Ftoshas\u002Ftorch-fidelity) (6 representation spaces) |\n| Stage 2 - T2I | GenEval, DPGBench, GenAI Bench, gFID, VQAScore |\n| Stage 2 - NWM | LPIPS, gFID |\n\n**Offline Evaluation**: We can also evaluate the model ckpts after training.\n```bash\nexport EXPERIMENT_NAME=\u003Cyour-run-name>\nuv run torchrun --nproc_per_node=8 src\u002Foffline_eval.py \\\n    --config configs\u002Fstage2\u002Fsampling\u002Fimagenet-dinov3l-k7.yaml\n```\n\n## Acknowledgments\n\nThe codebase is built upon some amazing projects:\n- [RAE](https:\u002F\u002Fgithub.com\u002Fbytetriper\u002FRAE)\n- [REPA](https:\u002F\u002Fgithub.com\u002Fsihyun-yu\u002FREPA)\n- [REPA-E](https:\u002F\u002Fgithub.com\u002FEnd2End-Diffusion\u002FREPA-E)\n- [JiT](https:\u002F\u002Fgithub.com\u002FLTH14\u002FJiT)\n\nWe thank the authors for making their work publicly available. We also sincerely thank Xingjian Leng for support and help with online geneval and dpgbench evaluation during T2I training.\n\n## BibTeX\n\n```bibtex\n@article{singh2026raev2,\n  title={Improved Baselines with Representation Autoencoders},\n  author={Singh, Jaskirat and Zheng, Boyang and Wu, Zongze and Zhang, Richard and Shechtman, Eli and Xie, Saining},\n  journal={arXiv preprint arXiv:2605.18324},\n  year={2026}\n}\n```\n","RAEv2是一个基于PyTorch的项目，旨在通过改进表示自编码器来提升基线模型性能。其核心功能包括简化模型结构、加快收敛速度（超过10倍），以及在生成和重建任务中取得更好的效果。该项目使用了GPU加速技术，并在多个图像生成和世界模型任务上展示了显著的优势。适用于需要高效且高质量图像生成或重建的应用场景，如图像处理、计算机视觉研究等。","2026-06-11 03:55:36","CREATED_QUERY"]