[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80897":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":14,"starSnapshotCount":14,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},80897,"RTDMD","Harahan\u002FRTDMD","Harahan","[Arxiv 2026] This is the official PyTorch implementation of \"RTDMD: Reinforcing Few-step Generators via Reward-Tilted Distribution Matching\"","https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.26108",null,"Python",24,1,0,5,8,15,54.2,"Apache License 2.0",false,"main",true,[24,25,26,27,28,29,30],"dmd","flux","flux2-klein","grpo","reinforcement-learning","sd35","text-to-image","2026-06-12 04:01:30","\u003Cdiv align=\"center\">\n\n\n\u003Cimg width=\"70%\" height=\"70%\" alt=\"logo\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4d534e80-f8ec-4c0b-948f-730cc0311961\" \u002F>\n\n\n\u003Ch2> Reinforcing Few-step Generators via Reward-Tilted Distribution Matching \u003C\u002Fh2>\n\n\u003Cp>\u003Cb>Reward-Tilted DMD &nbsp;·&nbsp; Ambient-Consistent Distillation &nbsp;·&nbsp; Hybrid Policy Gradient\u003C\u002Fb>\u003C\u002Fp>\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-arXiv-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.26108)\n[![Github](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHarahan%2FRTDMD-000000?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FHarahan\u002FRTDMD)\n[![Hugging Face Collection](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRTDMD_Collection-fcd022?style=for-the-badge&logo=huggingface&logoColor=000)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FHarahan\u002Frtdmd)\n\n[![License: Apache 2.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10%2B-blue.svg)](https:\u002F\u002Fwww.python.org\u002F)\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n[Yushi Huang](https:\u002F\u002Fharahan.github.io\u002F)\u003Csup>1, 2,\u003C\u002Fsup>\\*\u003Csup>†\u003C\u002Fsup>, [Xiangxin Zhou](https:\u002F\u002Fzhouxiangxin1998.github.io\u002F)\u003Csup>2,\u003C\u002Fsup>\\*, Ruoyu Wang\u003Csup>2, 3,\u003C\u002Fsup>\\*\u003Csup>†\u003C\u002Fsup>, [Chi Zhang](https:\u002F\u002Ficoz69.github.io\u002F)\u003Csup>3\u003C\u002Fsup>, [Jun Zhang](https:\u002F\u002Feejzhang.people.ust.hk\u002F)\u003Csup>1\u003C\u002Fsup>,[Tianyu Pang](https:\u002F\u002Fp2333.github.io\u002F)\u003Csup>2,\u003C\u002Fsup>‡\n\n\u003Csup>1\u003C\u002Fsup>The Hong Kong University of Science and Technology &nbsp;&nbsp;\n\u003Csup>2\u003C\u002Fsup>Tencent Hunyuan &nbsp;&nbsp;\n\u003Csup>3\u003C\u002Fsup>Westlake University\n\n\\* Equal contribution &nbsp;·&nbsp; † Work done during internship at Tencent Hunyuan &nbsp;·&nbsp; ‡ Corresponding author\n\n\u003C\u002Fdiv>\n\n---\n\n## 📑 Table of Contents\n\n- [📖 Abstract](#-abstract)\n- [🍭 Method Overview](#-method-overview)\n- [📊 Main Results](#-main-results)\n- [✅ TODO](#-todo)\n- [📁 Repository Layout](#-repository-layout)\n- [🛠️ Installation](#%EF%B8%8F-installation)\n- [🚀 Quick Start](#-quick-start)\n  - [1. Cold-start distillation (AC-DMD)](#1-cold-start-distillation-ac-dmd)\n  - [2. RL fine-tune (RTDMD = GRPO + AC-DMD \u002F BP aux)](#2-rl-fine-tune-rtdmd--grpo--ac-dmd--bp-aux)\n  - [3. Inference](#3-inference)\n  - [4. Reward evaluation](#4-reward-evaluation)\n- [⚙️ Configuration](#%EF%B8%8F-configuration)\n- [🎁 Reward Scorers](#-reward-scorers)\n- [🙌 Acknowledgements](#-acknowledgements)\n- [📄 Citation](#-citation)\n- [⚖️ License](#%EF%B8%8F-license)\n\n---\n\n## 📖 Abstract\n\nWe propose **Reward-Tilted Distribution Matching Distillation (RTDMD)**, a\ntwo-stage framework that unifies distribution-matching distillation with\nreward-guided RL for few-step flow generators. Minimizing the KL divergence to\na *reward-tilted teacher distribution* decomposes naturally into a\n**distribution-matching** term and a **reward-maximization** term — instantiated\nas **Ambient-Consistent DMD (AC-DMD)** for the cold start and a **hybrid policy\ngradient** (SubGRPO + final-step reward back-propagation) for the RL stage.\nWith **4 NFE** RTDMD reaches new SOTA on SD3-M \u002F SD3.5-M \u002F FLUX.2 4B; the\ndistilled FLUX.2 4B even beats the full FLUX.2 9B teacher (50 NFE) on most\nrewards.\n\n\u003Ctable align=\"center\">\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcb1fb0da-d388-4846-9017-66bccebd0749\" alt=\"RTDMD teaser\" width=\"100%\">\n      \u003Cbr\u002F>\n      \u003Cem>4-step samples from RTDMD-distilled FLUX.2 4B (no classifier-free guidance).\u003C\u002Fem>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\n      \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3d76d99a-6fe1-4059-8e68-9461ba067b01\" alt=\"RTDMD comparison\" width=\"100%\">\n      \u003Cbr\u002F>\n      \u003Cem>Qualitative comparison for few-step diffusion models (4 NFE).\u003C\u002Fem>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## 🍭 Method Overview\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F61a64fca-a143-40ae-9e36-79c6fcb5b696\" alt=\"RTDMD method overview\" width=\"70%\">\n  \u003Cbr\u002F>\n  \u003Cem>RTDMD overview. \u003Cb>Det.\u003C\u002Fb> = deterministic final step, \u003Cb>Stoc.\u003C\u002Fb> = stochastic intermediate steps. Trajectories: teacher (blue), few-step generator (green), fake score (yellow).\u003C\u002Fem>\n\u003C\u002Fdiv>\n\nFor the generator $G_\\theta$, the reward-tilted KL objective decomposes as\n\n$$\n\\nabla_\\theta D_{\\text{KL}}(p_\\theta \\| \\tilde{p}_\\psi) =\n\\underbrace{\\nabla_\\theta D_{\\text{KL}}(p_\\theta \\| p_\\psi)}_{\\text{distribution matching}} - \\beta\\underbrace{\\nabla_\\theta \\mathbb{E}_{\\hat{\\mathbf{x}}_0 \\sim p_\\theta}[r(\\hat{\\mathbf{x}}_0)]}_{\\text{reward maximization}}.\n$$\n\nThe two terms map directly to the two trainers exposed by the CLI:\n\n| Stage | Trainer | Key knobs |\n| --- | --- | --- |\n| 1. AC-DMD cold start | `ACDMDTrainer` (`--trainer ac_dmd`) | sub-interval renoising, consistency weight `γ`, CPS sampler `η = 0.9` |\n| 2. RTDMD RL fine-tune | `RTDMDTrainer` (`--trainer rtdmd`)  | SubGRPO + final-step BP + AC-DMD |\n\n---\n\n## 📊 Main Results\n\nAll numbers are on **4 NFE** (4 inference steps); the teacher uses its standard\nmulti-step setting. **Bold** = best; \u003Cins>underline\u003C\u002Fins> = second-best.\n\n### SD3-M (paper Table 1)\n\n| Method | NFE | CLIPScore ↑ | Aesthetic ↑ | PickScore ↑ | HPSv2 ↑ | ImageReward ↑ |\n| --- | :---: | :---: | :---: | :---: | :---: | :---: |\n| SD3-M teacher (w\u002F CFG) | 100 | 0.2936 | 5.5711 | 22.3236 | 0.2810 | 1.0759 |\n| GDMD               | 4 | 0.2930 | 5.8728 | 22.4614 | \u003Cins>0.3076\u003C\u002Fins> | 1.2702 |\n| R\u003Csub>dm\u003C\u002Fsub>     | 4 | \u003Cins>0.2936\u003C\u002Fins> | \u003Cins>5.8769\u003C\u002Fins> | \u003Cins>22.5783\u003C\u002Fins> | 0.2957 | \u003Cins>1.2897\u003C\u002Fins> |\n| **RTDMD (Ours)**   | 4 | **0.3161** | **5.9642** | **22.8593** | **0.3211** | **1.3024** |\n\nRTDMD is the only 4-NFE model that **surpasses the 100-NFE SD3-M teacher with\nCFG** across all five metrics — see the paper for the full baseline table.\n\n### FLUX.2 4B (paper Table 2)\n\n| Method | NFE | ImageReward ↑ | CLIPScore ↑ | Aesthetic ↑ | PickScore ↑ | HPSv2 ↑ | HPSv3 ↑ | GenEval ↑ | GenEval2 ↑ | OCR ↑ |\n| --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| FLUX.2 4B teacher       | 50 | 0.8538 | 0.2834 | 5.3333 | 22.3938 | 0.2771 | 11.7025 | 0.7631 | 0.2207 | 0.6133 |\n| FLUX.2 9B teacher       | 50 | 1.0021 | \u003Cins>0.2962\u003C\u002Fins> | 5.2030 | 22.6382 | 0.2800 | 11.6883 | 0.7568 | 0.3557 | 0.7432 |\n| Z-Image 6B              | 50 | 0.7841 | 0.2841 | 5.2488 | 22.2118 | 0.2714 | 10.0857 | 0.6563 | 0.3012 | 0.7373 |\n| Z-Image-Turbo 6B        |  4 | 0.9696 | 0.2764 | 5.2894 | 22.7994 | 0.2954 | 12.9136 | 0.7562 | 0.3530 | 0.7539 |\n| FLUX.2 4B               |  4 | 1.0506 | 0.2864 | 5.2658 | 22.7370 | 0.2890 | 12.9295 | 0.7722 | 0.2403 | 0.6375 |\n| FLUX.2 9B               |  4 | \u003Cins>1.1998\u003C\u002Fins> | 0.2919 | \u003Cins>5.3730\u003C\u002Fins> | \u003Cins>23.0178\u003C\u002Fins> | 0.2991 | 13.2955 | \u003Cins>0.7814\u003C\u002Fins> | \u003Cins>0.3570\u003C\u002Fins> | \u003Cins>0.7566\u003C\u002Fins> |\n| Z-Image 6B w\u002F TDM-R1    |  4 | 1.1543 | 0.2836 | 5.2450 | 22.8202 | \u003Cins>0.3064\u003C\u002Fins> | \u003Cins>13.4349\u003C\u002Fins> | 0.7737 | **0.4073** | **0.7665** |\n| **FLUX.2 4B w\u002F RTDMD (Ours)** | 4 | **1.3712** | **0.3219** | **5.7746** | **23.9642** | **0.3516** | **15.5772** | **0.9046** | 0.2755 | 0.6858 |\n\nRTDMD on FLUX.2 4B is the best 4-NFE model on **7 of 9** rewards\n(ImageReward \u002F CLIPScore \u002F Aesthetic \u002F PickScore \u002F HPSv2 \u002F HPSv3 \u002F GenEval)\nand beats the **FLUX.2 9B teacher at 50 NFE** on every one of those seven —\nincluding **+0.37 ImageReward**, **+0.57 Aesthetic**, **+1.33 PickScore**,\n**+3.89 HPSv3**, and **+0.15 GenEval**.\n\n---\n\n## ✅ TODO\n\n- [ ] Release more RTDMD checkpoints (FLUX.2 9B and FLUX.1 dev) on the [RTDMD HF collection](\u003CTODO: huggingface collection URL>)\n\n---\n\n## 📁 Repository Layout\n\n```\nRTDMD\u002F\n├── main.py                # Training entry point\n├── inference.py           # Inference entry point\n├── configs\u002F\n│   ├── cold_start\u002F        # AC-DMD distillation YAMLs (5 backbones)\n│   ├── rtdmd\u002F             # RTDMD RL fine-tune YAMLs (5 backbones)\n│   └── inference\u002F         # Inference YAMLs (5 backbones)\n├── rtdmd\u002F                 # Source package: trainers\u002F, models\u002F, schedulers\u002F,\n│                          # rewards\u002F, data\u002F, parallel\u002F, utils\u002F, diffusers_patch\u002F\n└── scripts\u002F\n    ├── cold_start.sh      # AC-DMD launcher (single \u002F multi-node)\n    ├── rtdmd.sh           # RTDMD launcher  (single \u002F multi-node)\n    ├── inference.sh       # Inference launcher\n    └── merge_lora_transformer.py\n```\n\n---\n\n## 🛠️ Installation\n\nReference environment (what the paper numbers were produced with):\n\n| Component | Version |\n| --- | --- |\n| Python    | 3.10 |\n| CUDA      | 12.4 |\n| PyTorch   | 2.6.0 |\n| GPU       | NVIDIA H20 \u002F H100 \u002F H800 \u002F A100-80GB |\n| NCCL \u002F IB | RoCE or InfiniBand for multi-node |\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FHarahan\u002FRTDMD.git\ncd RTDMD\n\nconda create -n rtdmd python=3.10 -y\nconda activate rtdmd\n\npip install -r requirements.txt\npip install -e .\n```\n\n`requirements.txt` is a pinned snapshot of the paper environment\n(`flash-attn`, `peft`, the exact `diffusers` git commit, and `mmcv` \u002F\n`mmdet` for the GenEval scorer). If `flash-attn` fails to build, drop the\nline — the model loaders fall back to PyTorch SDPA automatically.\n\n### Pretrained models\n\n`pretrained_path` and `*_init_path` accept either a local directory or a\nHuggingFace Hub repo id; `diffusers.from_pretrained()` downloads and caches\nthe weights on first use. Gated repos (e.g. `black-forest-labs\u002FFLUX.1-dev`)\nrequire `huggingface-cli login` with an authorized token first.\n\n### Reward checkpoints\n\nPoint `RTDMD_REWARD_CKPT_PATH` (or each config's `reward_ckpt_path`) at a\nlocal directory for the reward-model weights. **Most scorers auto-download\non first use**: PickScore, HPSv3, ImageReward, CLIPScore, GenEval2\n(Qwen3-VL Soft-TIFA), OCR (PaddleOCR), and the GenEval Mask2Former\nbackbone (pulled from the OpenMMLab CDN into `reward_ckpts\u002F`).\n\nOnly two scorers need a one-time `wget`:\n\n```bash\nmkdir -p reward_ckpts && cd reward_ckpts\n# Aesthetic predictor (LAION)\nwget https:\u002F\u002Fgithub.com\u002Fchristophschuhmann\u002Fimproved-aesthetic-predictor\u002Fraw\u002Frefs\u002Fheads\u002Fmain\u002Fsac+logos+ava1-l14-linearMSE.pth\n# HPSv2.1 (OpenCLIP backbone + HPS classifier head)\nwget https:\u002F\u002Fhuggingface.co\u002Flaion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K\u002Fresolve\u002Fmain\u002Fopen_clip_pytorch_model.bin\nwget https:\u002F\u002Fhuggingface.co\u002Fxswu\u002FHPSv2\u002Fresolve\u002Fmain\u002FHPS_v2.1_compressed.pt\ncd ..\nexport RTDMD_REWARD_CKPT_PATH=$(pwd)\u002Freward_ckpts\n```\n\nGenEval evaluates against the COCO-80 object categories (the\nMask2Former detector we use is trained on COCO) — the class-name lookup\nships at `rtdmd\u002Frewards\u002Fassets\u002Fobject_names.txt`, so no extra setup is\nneeded beyond `pip install -r requirements.txt`.\n\nOptional pre-warm so the first training step doesn't stall on\nHuggingFace downloads:\n\n```bash\npython - \u003C\u003C'PY'\nfrom transformers import AutoModel, AutoProcessor, CLIPModel, CLIPProcessor\nAutoProcessor.from_pretrained(\"laion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K\")\nAutoModel.from_pretrained(\"yuvalkirstain\u002FPickScore_v1\")\nCLIPModel.from_pretrained(\"openai\u002Fclip-vit-large-patch14\")\nCLIPProcessor.from_pretrained(\"openai\u002Fclip-vit-large-patch14\")\nPY\n```\n\n---\n\n## 🚀 Quick Start\n\nAll examples below use **FLUX.2-klein 4B**. The other four supported\nbackbones (SD3-M, SD3.5-M, FLUX.1-dev, FLUX.2-klein 9B) use the exact same\ncommands — only the YAML basename changes under each\n`configs\u002F{cold_start,rtdmd,inference}\u002F` directory.\n\n### 1. Cold-start distillation (AC-DMD)\n\nAll five models run cold-start on **1 node × 8 GPUs**:\n\n```bash\nbash scripts\u002Fcold_start.sh 8 configs\u002Fcold_start\u002Fflux2_4b.yaml\n```\n\n### 2. RL fine-tune (RTDMD = GRPO + AC-DMD \u002F BP aux)\n\nRecommended scale per model:\n\n| Model              | Nodes × GPUs\u002Fnode | Total GPUs |\n| --- | --- | --- |\n| SD3.5-M            | 1 × 8             | 8          |\n| SD3-M              | 2 × 8             | 16         |\n| FLUX.2-klein 4B    | 2 × 8             | 16         |\n| FLUX.1-dev         | 4 × 8             | 32         |\n| FLUX.2-klein 9B    | 4 × 8             | 32         |\n\nSingle-node (e.g., SD3.5-M):\n\n```bash\nbash scripts\u002Frtdmd.sh 8 configs\u002Frtdmd\u002Fsd35m.yaml\n```\n\nMulti-node — FLUX.2-klein 4B on 2 × 8 GPUs (set the env vars on **each** node):\n\n```bash\n# Node 0\nNNODES=2 NODE_RANK=0 MASTER_ADDR=\u003Cchief-ip> \\\n    bash scripts\u002Frtdmd.sh 8 configs\u002Frtdmd\u002Fflux2_4b.yaml\n\n# Node 1\nNNODES=2 NODE_RANK=1 MASTER_ADDR=\u003Cchief-ip> \\\n    bash scripts\u002Frtdmd.sh 8 configs\u002Frtdmd\u002Fflux2_4b.yaml\n```\n\nFor 4-node jobs (FLUX.1-dev \u002F FLUX.2-klein 9B) set `NNODES=4` and launch on\nranks `0..3` the same way. When the scheduler exports `CHIEF_IP \u002F INDEX \u002F\nHOST_NUM \u002F HOST_GPU_NUM` these are picked up automatically.\n\n### 3. Inference\n\nOne YAML per model under `configs\u002Finference\u002F`. Each ships with the\n**distilled + RL LoRA stack** enabled by default. The three LoRA regimes are\nselected by the YAML's `lora_paths`:\n\n- `lora_paths: []`                 → plain pretrained model, no LoRA\n- `lora_paths: [distilled]`        → distilled-only LoRA\n- `lora_paths: [distilled, rl]`    → distilled + RL LoRAs merged in order *(YAML default)*\n\nDistilled few-step generation (FLUX.2-klein 4B), 8 GPUs, no reward scoring:\n\n```bash\nbash scripts\u002Finference.sh 8 configs\u002Finference\u002Fflux2_4b.yaml \\\n    --override eval_reward=false --prompt \"a cute cat sitting on a windowsill\"\n```\n\nNo LoRA (plain pretrained) or distilled-only LoRA via CLI override:\n\n```bash\n# No LoRA\nbash scripts\u002Finference.sh 8 configs\u002Finference\u002Fflux2_4b.yaml --override lora_paths=\n\n# Distilled-only LoRA\nbash scripts\u002Finference.sh 8 configs\u002Finference\u002Fflux2_4b.yaml \\\n    --override lora_paths=\u002Fpath\u002Fto\u002Fflux2_4b_cold_start_ckpt\u002Fcheckpoint-15000\u002Fgenerator_ema.pt\n```\n\n### 4. Reward evaluation\n\nSame launcher with `eval_reward=true` (already the YAML default). Generates\nimages for the datasets baked into the YAML and writes per-reward + weighted\nmean scores to `inference_outputs\u002F\u003Cmodel>\u002Fmetadata.json`:\n\n```bash\nbash scripts\u002Finference.sh 8 configs\u002Finference\u002Fflux2_4b.yaml\n```\n\nThe default eval block mirrors training: `drawbench` for most rewards plus\n`hpsv3` \u002F `geneval` \u002F `geneval2` \u002F `ocr` on their own sub-datasets, capped at\n`num_media_images: 64` prompts per dataset. See the `reward_fn` and\n`reward_dataset_map` sections of each inference YAML for per-reward weights\nand dataset routing.\n\n---\n\n## ⚙️ Configuration\n\nConfiguration is pure-Python dataclass + YAML with dot-notation CLI overrides:\n\n```bash\nbash scripts\u002Frtdmd.sh 8 configs\u002Frtdmd\u002Fflux2_4b.yaml \\\n    --override train.seed=123 dmd.fake_update_ratio=10\n```\n\nTop-level sections of `RTDMDConfig` (see [`rtdmd\u002Fconfig.py`](rtdmd\u002Fconfig.py)):\n\n| Section       | Purpose |\n| --- | --- |\n| `model`       | Pretrained path (HF Hub repo id or local dir), dtype, LoRA settings (generator \u002F fake-score \u002F teacher). |\n| `dmd`         | DMD hyperparameters: CPS sampler `η`, denoising step list, fake-score TTUR ratio. |\n| `ac_dmd`      | AC-DMD sub-interval renoising bounds and consistency-loss knobs. |\n| `grpo`        | GRPO sampling \u002F PPO settings + `last_step_loss` (AC-DMD \u002F BP aux on the deterministic last step). |\n| `solver`      | Per-role AdamW configs (`generator` \u002F `fake_score` \u002F `teacher`). |\n| `train`       | Steps, batch size, autocast dtype, EMA, resume. |\n| `distributed` | `fsdp` or `ddp`; FSDP sharding strategy (`full_shard` \u002F `hybrid` \u002F `shard_grad_op`); CPU offload for frozen aux models. |\n| `eval`        | Periodic reward-evaluation knobs. |\n| `logging`     | wandb project \u002F run name \u002F tags. |\n\nThe dataclass loader silently drops unknown keys, so old configs remain\nloadable across refactors.\n\n---\n\n## 🎁 Reward Scorers\n\n`MultiScorer` (in [`rtdmd\u002Frewards\u002F`](rtdmd\u002Frewards\u002F)) wraps nine backends\nthat can be combined as `{name: weight}` inside any `reward_fn` block:\n`pickscore`, `hpsv2`, `hpsv3`, `clipscore`, `aesthetic`, `imagereward`,\n`ocr`, `geneval`, `geneval2`.\n\nThe differentiable subset (`pickscore`, `hpsv2`, `clipscore`, `imagereward`)\ncan be plugged into reward back-propagation on the deterministic final step\nby setting `last_step_loss.bp_enabled: true` in the RTDMD YAML — the rest are\nscored offline as part of GRPO advantages.\n\n---\n\n## 🙌 Acknowledgements\n\n- [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers),\n  [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers), and\n  [peft](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) — base generative-model\n  stack and LoRA.\n- [Flow-GRPO](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo) — the\n  SDE-step-with-logprob routine in\n  [`rtdmd\u002Fdiffusers_patch\u002Fsde_with_logprob.py`](rtdmd\u002Fdiffusers_patch\u002Fsde_with_logprob.py)\n  is ported from this project.\n- Teacher backbones:\n  [Stable Diffusion 3 \u002F 3.5](https:\u002F\u002Fhuggingface.co\u002Fstabilityai),\n  [FLUX.1](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-dev), and\n  [FLUX.2](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs).\n\n---\n\n## 📄 Citation\n\n```bibtex\n@misc{huang2026reinforcingfewstepgeneratorsrewardtilted,\n      title={Reinforcing Few-step Generators via Reward-Tilted Distribution Matching}, \n      author={Yushi Huang and Xiangxin Zhou and Ruoyu Wang and Chi Zhang and Jun Zhang and Tianyu Pang},\n      year={2026},\n      eprint={2605.26108},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.26108}, \n}\n```\n\n\u003C!-- BibTeX will be filled in once the paper is on arXiv. -->\n\n---\n\n## ⚖️ License\n\nThis project is licensed under the Apache License 2.0 — see\n[LICENSE](LICENSE). The supported teacher checkpoints (SD3 \u002F SD3.5 \u002F FLUX.1 \u002F\nFLUX.2) are released under their original licenses; please comply with each\nupstream license when using them.\n","RTDMD是一个基于PyTorch实现的框架，旨在通过奖励倾斜分布匹配来增强少量步骤生成器。其核心功能包括通过最小化与奖励倾斜教师分布之间的KL散度来结合分布匹配蒸馏和奖励导向的强化学习。技术特点主要体现在Ambient-Consistent DMD（AC-DMD）用于冷启动阶段，以及在RL阶段采用混合策略梯度方法（SubGRPO加上最终步骤奖励反向传播）。该项目特别适合于需要高效且高质量生成结果的应用场景，例如图像生成、文本到图像转换等任务。",2,"2026-06-11 04:02:45","CREATED_QUERY"]