[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79087":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},79087,"PiD","nv-tlabs\u002FPiD","nv-tlabs","PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion","https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fsil\u002Fprojects\u002Fpid\u002F",null,"Python",717,36,10,7,0,5,74,569,45,8.7,"Other",false,"main",[26,27],"diffusion-decoder","pixel-diffusion","2026-06-12 02:03:49","# PiD — Pixel Diffusion Decoder\n\n> **TL;DR** — PiD is a plug-and-play diffusion decoder that replaces VAE\u002FRAE decoders, turning latent representations directly into super-resolved pixels in a single pass.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"figures\u002Fteaser.jpg\" alt=\"PiD teaser\" width=\"100%\">\n\u003C\u002Fp>\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa556e2d4-5de5-4bcf-9daa-80f7ea6b2124\n\nPiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion\nmodel, unifying decoding and upsampling into a single generative module.\nIt directly denoises in high-resolution pixel\nspace and produces a super-resolved image in one pass.\n\n**[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.23902), [Project Page](https:\u002F\u002Fresearch.nvidia.com\u002Flabs\u002Fsil\u002Fprojects\u002Fpid\u002F), [Model Weights](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FPiD)**\n\n[Yifan Lu](https:\u002F\u002Fyifanlu0227.github.io\u002F),\n[Qi Wu](https:\u002F\u002Fwilsoncernwq.github.io\u002F),\n[Jay Zhangjie Wu](https:\u002F\u002Fzhangjiewu.github.io\u002F),\n[Zian Wang](https:\u002F\u002Fwww.cs.toronto.edu\u002F~zianwang\u002F),\n[Huan Ling](https:\u002F\u002Fwww.cs.toronto.edu\u002F~linghuan\u002F),\n[Sanja Fidler](https:\u002F\u002Fwww.cs.utoronto.ca\u002F~fidler\u002F),\n[Xuanchi Ren](https:\u002F\u002Fxuanchiren.com\u002F) \u003Cbr>\n\n## News\n- 🚀 [May 25, 2026] Paper, code, and model weights released, with PiD options for **FLUX**, **FLUX.2**, **Z-Image**, **Z-Image-Turbo**, **SD3**, **DINOv2**, and **SigLIP**.\n- 🔜 [Coming Soon] PiD option for **Qwen-Image**.\n- 🔜 [Coming Soon] PiD undistilled checkpoints.\n- ⏳ [Planned] Training scripts.\n\n## Installation\n\n> [!TIP]\n> **Quick Start** — if your environment already has PyTorch (with CUDA), `transformers>=4.57.x`, and `diffusers>=0.37`, you don't need to build a new conda env. Just install the small set of utility deps the inference code pulls eagerly and you're ready to run the diffusers backbones (`flux`\u002F`flux2`\u002F`sd3`\u002F`zimage`\u002F`zimage_turbo`):\n>\n> ```bash\n> pip install hydra-core omegaconf pyyaml \\\n>     attrs einops loguru termcolor fvcore iopath wandb \\\n>     imageio opencv-python-headless pandas \\\n>     safetensors sentencepiece boto3 botocore\n> pip install -e .\n> ```\n> To validate your environment is ready for inference, run `python verify_env.py`.\n\n\nFull conda-managed install (preferred if you're starting from scratch):\n\n```bash\nconda env create -f environment.yml\nconda activate pid\n\n# 2. Install this package in editable mode.\npip install -e .\n```\n\n## Checkpoints and assets\n\nPretrained PiD checkpoints live under `checkpoints\u002F`. Each diffusers backbone ships\ntwo variants — the original `2k` decoder (trained at 2048px) and a `2kto4k` decoder\n(trained with multi-resolution data bucketing 2048→3840 + an SD3-style dynamic\nshift, intended for 1024 LDM → 4K decoding). Pick the variant at the CLI via\n`--pid_ckpt_type {2k,2kto4k}` (default: `2k`).\n\n### Downloading\n\nThe released decoder weights and the encoder\u002Fdecoder (\"VAE\") weights they\ndepend on are hosted at [`nvidia\u002FPiD`](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FPiD) on\nthe Hugging Face Hub. Pull just the `checkpoints\u002F` tree into this repo:\n\n```bash\nhf download nvidia\u002FPiD --local-dir . --include \"checkpoints\u002F*\"\n```\n\n## Running inference\n\nPiD ships two complementary entry points per backbone:\n\n| Backbone | `from_clean_*` (image → encode → PiD) | `from_ldm_*` (text\u002Fclass → LDM → PiD) |\n|----------|---------------------------------------|---------------------------------------|\n| flux     | `from_clean_flux.py`    | `from_ldm_flux.py`    |\n| flux2    | `from_clean_flux2.py`   | `from_ldm_flux2.py`   |\n| sd3      | `from_clean_sd3.py`     | `from_ldm_sd3.py`     |\n| zimage   | reuses `flux`           | `from_ldm_zimage.py`  |\n| zimage_turbo | reuses `flux`       | `from_ldm_zimage_turbo.py` |\n| dinov2   | `from_clean_dinov2.py`  | `from_ldm_dinov2.py`  |\n| siglip   | `from_clean_siglip.py`  | `from_ldm_siglip.py`  |\n\nAll scripts live under `pid\u002F_src\u002Finference\u002F` and decode each captured latent\ntwice — once with the backbone's native VAE (baseline) and once with PiD.\n\n> [!IMPORTANT]\n> Picking the checkpoint variant — `--pid_ckpt_type`\n> Every entry point accepts `--pid_ckpt_type {2k,2kto4k}` (default `2k`):\n>\n> - **`2k`** — the original 2048px-trained decoder.\n> - **`2kto4k`** — the up-to-4K-resolution decoder. Available for `flux` \u002F `flux2` \u002F `sd3` \u002F `zimage` \u002F `zimage_turbo` only. Worse than `2k` at 2048px resolution.\n>\n> For the exact checkpoint path for each backbone, see [docs\u002Fcheckpoints.md](docs\u002Fcheckpoints.md).\n> A quick sanity check that the right variant loaded: when `2kto4k` is active you\nshould see `PixelDiT dynamic shift: base_shift=4.0 base_image_size=1024` in the\ninit log; for `2k` that line is absent. Both `2k` and `2kto4k` support non-square aspect ratios.\n\n### 📕 `from_ldm_*`: text \u002F class → latent diffusion → PiD decode\n\nRuns the corresponding latent-diffusion backbone on a prompt (or class id for\nthe class-conditional `dinov2` backbone), captures the intermediate `x_t` at\nuser-specified denoising steps (early LDM termination) and the final clean `x_0`, then decodes\neach captured latent with both the native VAE \u002F RAE decoder (baseline) and PiD.\n\nFor `flux` \u002F `flux2` \u002F `sd3` \u002F `zimage` \u002F `zimage_turbo` the LDM is a HuggingFace `diffusers`\npipeline (`FluxPipeline`, `Flux2Pipeline`, `StableDiffusion3Pipeline`,\n`ZImagePipeline`).\n\nFor `dinov2` and `siglip` the LDM is the upstream\n[RAE](https:\u002F\u002Fgithub.com\u002Fbytetriper\u002FRAE) (class-conditional ImageNet-512) or\n[Scale-RAE](https:\u002F\u002Fgithub.com\u002FZitengWangNYU\u002FScale-RAE) (text-conditional\n256px) repo — see the optional-deps section below for installation.\n\n#### Example 1 — Single-GPU, single prompt (Flux, default `2k` decoder)\n\n```bash\nPYTHONPATH=. python -m pid._src.inference.from_ldm_flux \\\n    --prompt \"A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic\" \\\n    --ldm_inference_steps 28 --save_xt_steps 24 \\\n    --output_dir .\u002Fresults\u002Fofficial_demo\u002Fflux \\\n    --cfg_scale 1 --pid_inference_steps 4 --scale 4\n```\n\n#### Example 2 — Single-GPU, 4K decode (Flux, `2kto4k` decoder)\n\nSame backbone as Example 1 but with `--resolution 1024 --pid_ckpt_type 2kto4k`,\nso the LDM produces a 1024² latent and PiD decodes it to 4K.\n\n```bash\nPYTHONPATH=. python -m pid._src.inference.from_ldm_flux \\\n    --prompt \"A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic\" \\\n    --resolution 1024 --pid_ckpt_type 2kto4k \\\n    --ldm_inference_steps 28 --save_xt_steps 24 \\\n    --output_dir .\u002Fresults\u002Fofficial_demo\u002Fflux_4k \\\n    --cfg_scale 1 --pid_inference_steps 4 --scale 4\n```\n\n#### Example 3 — Multi-GPU with a prompt file (Z-Image)\n\n`torchrun` shards `--prompt_file` across ranks; each rank writes to\n`--output_dir` independently.\n\n```bash\nPYTHONPATH=. torchrun --nproc_per_node=4 \\\n    -m pid._src.inference.from_ldm_zimage \\\n    --prompt_file pid\u002F_src\u002Finference\u002Fprompts\u002Fprompt_creative.txt \\\n    --ldm_inference_steps 50 --save_xt_steps 46 \\\n    --output_dir .\u002Fresults\u002Fofficial_demo\u002Fzimage \\\n    --cfg_scale 1 --pid_inference_steps 4 --scale 4\n```\n\n#### Example 4 — Multi-GPU, 1K to 4K decode (Z-Image-Turbo, `2kto4k` decoder)\n\nZ-Image-Turbo defaults to 9 diffusers steps with `guidance_scale=0.0`. The final\nclean latent `x0` is always saved and is the recommended Turbo output to inspect.\n`--save_xt_steps 7` is optional; it saves an additional near-final `x_t` sample\nfor comparison.\n\n```bash\nPYTHONPATH=. torchrun --nproc_per_node=8 \\\n    -m pid._src.inference.from_ldm_zimage_turbo \\\n    --prompt_file pid\u002F_src\u002Finference\u002Fprompts\u002Fprompt_zimage_turbo.txt \\\n    --resolution 1024 --pid_ckpt_type 2kto4k \\\n    --output_dir .\u002Fresults\u002Fofficial_demo\u002Fzimage_turbo_4k \\\n    --cfg_scale 1 --pid_inference_steps 4 --scale 4\n```\n\n#### `dinov2` \u002F `siglip` backbones\n\nThe upstream RAE \u002F Scale-RAE LDMs don't live in `diffusers` — see\n[`docs\u002Fdinov2_siglip.md`](docs\u002Fdinov2_siglip.md) for setup and end-to-end\nexamples.\n\n#### Suggested step settings per diffusers backbone\n\n(See each script's docstring for the exact recipe.)\n\n| Backbone | LDM steps flag          | Default steps | Optional `--save_xt_steps` | Recommended latent |\n|----------|-------------------------|---------------|----------------------------|--------------------|\n| flux     | `--ldm_inference_steps` | 28            | `22 24 26`                 | step `24`          |\n| sd3      | `--ldm_inference_steps` | 28            | `22 24 26`                 | step `24`          |\n| flux2    | `--ldm_inference_steps` | 50            | `44 46 48`                 | step `46`          |\n| zimage   | `--ldm_inference_steps` | 50            | `44 46 48`                 | step `46`          |\n| zimage_turbo | `--ldm_inference_steps` | 9         | `7`                        | `x0`               |\n\n---\n### 📗 `from_clean_*`: image → VAE encode → PiD decode\n\nNo latent diffusion model is run. The input image is encode by VAE,\noptionally corrupted with Gaussian noise at each\nsigma in `--degrade_sigmas`, then decoded by PiD at `--scale * input_resolution`.\n\nSingle-GPU example (Flux):\n\n```bash\nPYTHONPATH=. python -m pid._src.inference.from_clean_flux \\\n    --manifest assets\u002Fclean_image_manifest.jsonl \\\n    --input_resolution 512 \\\n    --degrade_sigmas 0.0 \\\n    --output_dir .\u002Fresults\u002Fofficial_demo_from_clean\u002Fflux \\\n    --cfg_scale 1 --pid_inference_steps 4 --scale 4\n```\n\nYou can pass a single image with `--input_path` and a prompt with `--prompt`\ninstead of `--manifest`, and a sigma sweep such as `--degrade_sigmas 0.0 0.2 0.4 0.8`\nto decode noise-corrupted latents.\n\nThe `dinov2` \u002F `siglip` `from_clean_*` flows take the same flags but with\ndifferent default resolutions and scales —\nsee [`docs\u002Fdinov2_siglip.md`](docs\u002Fdinov2_siglip.md).\n\n### Common arguments\n\n| Flag | Meaning |\n|------|---------|\n| `--pid_inference_steps`| Number of denoising steps for PiD (4 for the released distilled checkpoints) |\n| `--scale`              | PiD upscale factor (output = `baseline * scale`); 8 for Scale-RAE and 4 for other backbones |\n| `--cfg_scale`          | Classifier-free guidance scale for PiD |\n| `--output_dir`         | Where to write the side-by-side comparison images |\n| `--seed`               | Base random seed |\n\nMulti-GPU runs use `torchrun --nproc_per_node=N`; each rank processes a shard\nof the prompts \u002F manifest entries and writes to `--output_dir` independently.\n\n## Repository layout\n\n```\npid\u002F_src\u002Finference\u002F\n├── from_ldm_{flux,flux2,sd3,zimage,zimage_turbo,dinov2,siglip}.py  # text\u002Fclass → LDM → PiD decode\n├── from_clean_{flux,flux2,sd3,dinov2,siglip}.py       # image → encode → PiD decode\n├── _demo_common.py                                    # shared CLI + run loop for from_ldm_*\n├── _demo_from_clean_common.py                         # shared CLI + run loop for from_clean_*\n├── checkpoint_registry.py                             # backbone → PiD checkpoint mapping\n├── pipeline_registry.py                               # diffusers backbone → HF pipeline mapping\n├── rae_generation.py                                  # DINOv2-RAE LDM helpers (from_ldm_dinov2)\n├── scale_rae_generation.py                            # Scale-RAE LDM helpers (from_ldm_siglip)\n└── prompts\u002F                                           # prompt files for from_ldm_*\n```\n\n## License\n\nPiD codebase is licensed under the [Apache License 2.0](LICENSE).\n\n## Contributing\n\nSee [`CONTRIBUTING.md`](CONTRIBUTING.md) for development setup, code style,\nand the DCO sign-off requirement.\n\n## Acknowledgments\n\nThe authors would like to acknowledge [Yongsheng Yu](https:\u002F\u002Fwww.yongshengyu.com\u002F) and [Wei Xiong](https:\u002F\u002Fwxiong.me\u002F) for open-sourcing [PixelDiT](https:\u002F\u002Fpixeldit.github.io\u002F)'s model and weights, and thank Product Managers [Aditya Mahajan](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Faditya-mahajan1) and [Matt Cragun](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fmcragun\u002F) for their valuable support and guidance.\n\n\n## Citation\n\n```bibtex\n@article{lu2026pid,\n    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},\n    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},\n    journal={arXiv preprint arXiv:2605.23902},\n    year={2026}\n}\n```\n","PiD是一个即插即用的扩散解码器，能够将潜在表示直接转换为高分辨率像素图像。其核心技术特点是通过条件像素空间扩散模型统一了解码和上采样过程，在单次传递中直接在高分辨率像素空间进行去噪并生成超分辨率图像。该项目使用Python编写，适合于需要从低分辨率或潜在表示快速生成高质量图像的应用场景，如图像处理、计算机视觉任务等。此外，PiD支持多种流行的扩散模型框架，并提供了预训练权重以简化部署流程。",2,"2026-06-11 03:57:25","CREATED_QUERY"]