[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72538":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72538,"flow_grpo","yifan123\u002Fflow_grpo","yifan123","[NeurIPS 2025] An official implementation of Flow-GRPO: Training Flow Matching Models via Online RL","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.05470",null,"Python",2329,161,22,26,0,4,21,79,12,28.63,"MIT License",false,"main",true,[],"2026-06-12 02:03:04","\u003Ch1 align=\"center\"> Flow-GRPO:\u003Cbr>Training Flow Matching Models via Online RL \u003C\u002Fh1>\n\u003Cdiv align=\"center\">\n  \u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05470'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArXiv-red?logo=arxiv'>\u003C\u002Fa>  &nbsp;\n  \u003Ca href='https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO\u002F'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVisualization-green?logo=github'>\u003C\u002Fa> &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-9E95B7?logo=github\">\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjieliu\u002Fsd35m-flowgrpo-68298ec27a27af64b0654120'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-blue?logo=huggingface'>\u003C\u002Fa> &nbsp; \n  \u003Ca href='https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-blue?logo=huggingface'>\u003C\u002Fa> &nbsp;\n\u003C\u002Fdiv>\n\n## Changelog\n\n\u003Cstrong>2026-05-07\u003C\u002Fstrong>\n\n* 🚀 Flow-GRPO is now supported in [verl-omni](https:\u002F\u002Fgithub.com\u002Fverl-project\u002Fverl-omni)! This provides a verl-style training framework for Flow-GRPO users.\n\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Update History\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n** 2025-11-04**\n\n* 🚀 Adding **GRPO-Guard**.\n\n**2025-11-04**\n* Adding support for [Bagel-7B](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FBAGEL-7B-MoT).\n\n**2025-10-14**\n\n* Refactor FlowGRPO-Fast for compatibility with FlowGRPO, add CPS sampling and No-CFG training on SD3.\n\n**2025-08-15**\n\n* Adding support for **Qwen-Image** and **Qwen-Image-Edit**.\n\n**2025-08-15**\n\n* Thanks [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) for adding **Wan2.1**. Training command\n```bash\naccelerate launch --config_file scripts\u002Faccelerate_configs\u002Fmulti_gpu.yaml --num_processes=1 --main_process_port 29503 scripts\u002Ftrain_wan2_1.py --config config\u002Fgrpo.py:general_ocr_wan2_1\n```\n\n**2025-08-14**\n\n* Adding reward curve of Flow-GRPO-Fast vs. Flow-GRPO. In Pickscore reward, Flow-GRPO-Fast is comparable to Flow-GRPO with only 2 steps training.\n\n\n**2025-08-04**\n\n* Adding support for **FLUX.1-Kontext-dev**. For the counting task, we use Geneval reward to detect object counts and CLIP feature similarity to ensure consistency between the original and edited images. This implementation offers a runnable pipeline, but the training set contains only 800 samples. Making Flow-GRPO truly effective for editing tasks still requires further exploration by the community.\n\n\n**2025-07-31**\n\n- Adding Flow-GRPO-Fast.\n\n**2025-07-28**\n\n- Adding support for **FLUX.1-dev**.\n- Adding support for CLIPScore as reward model.\n- Introducing `config.sample.same_latent` to control whether the same noise is reused for identical prompts, addressing [Issue #7](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fissues\u002F7).\n\n**2025-05-15** \n\n- 🔥We showcase image examples from three tasks and their training evolution at https:\u002F\u002Fgongyeliu.github.io\u002FFlow-GRPO. Check them out!\n- 🔥We now provide an online demo for all three tasks at https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fjieliu\u002FSD3.5-M-Flow-GRPO. You're welcome to try it out!\n\u003C\u002Fdetails>\n\n## 🤗 Model\n| Task    | Model |\n| -------- | -------- |\n| GenEval     | [🤗GenEval](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-GenEval) |\n| Text Rendering     | [🤗Text](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-Text) |\n| Human Preference Alignment     | [🤗PickScore](https:\u002F\u002Fhuggingface.co\u002Fjieliu\u002FSD3.5M-FlowGRPO-PickScore) |\n\n## Training Speed\n\nTo improve training efficiency, we provide a better set of parameters for Flow-GRPO.\nWe found the following adjustments significantly accelerate training:\n\n* No CFG during training or testing — the RL process effectively performs **CFG distillation**.\n* Use the window mechanism from **Flow-GRPO-Fast** or **[MixGRPO](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2507.21802)** — only train on partial steps.\n* Adopt **[Coefficients-Preserving Sampling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.05952) (CPS)** — CPS provides a notable improvement on GenEval, and produces higher-quality samples. A typical setting is `noise_level = 0.8`, which works well without tuning for different models or step counts.\n\nThe figure below shows the test-set performance curves using GenEval and PickScore as rewards, where both training and evaluation are performed **without CFG**. The experiments are configured with [**geneval_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L163) and [**pickscore_sd3_fast_nocfg**](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L323), using scripts from `scripts\u002Fmulti_node\u002Fsd3_fast`.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_geneval.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_nocfg_pickscore.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F> \n\u003C\u002Fp>\n\n## 🛡️ Over-optimization (GRPO-Guard) 🔥🔥\n\nTo mitigates implicit over-optimization in flow matching, our team propose [GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319) ( [🔥Project Page](https:\u002F\u002Fjingw193.github.io\u002FGRPO-Guard\u002F)).\n\nWe first observe that the importance ratio exhibits an inherent bias:\n\n1. Its mean is consistently **below 1** and becomes significantly pronounced at low-noise steps (e.g., step 8 in SD3.5-M).\n\n2. The variance varies notably across different steps.\n\nIdeally, the importance ratio distribution should have a mean of 1 and stable variance. The clipping operation truncates overly confident positive or negative samples outside the region [1−ϵ,1+ϵ], ensuring stable gradient updates. However, the observed bias in the importance ratio disrupts this mechanism—gradients of positive samples are no longer properly constrained, **leading the policy model into over-optimization**. As a result, the proxy score continues to rise while the gold score declines, causing a severe degradation in image quality.\n\n\nThe biased ratio distributions are summarized in the table below.\n\n| FlowGRPO | GRPO-Guard|\n| - | - |\n| ![flow_grpo ratio](flow_grpo\u002Fassets\u002FGRPO-Guard\u002Fgif_1.gif) | ![grpo_guard ratio](flow_grpo\u002Fassets\u002FGRPO-Guard\u002Fgif_2.gif)  |\n| The clipping mechanism is imbalanced, failing to constrain overconfident positive samples. | The clipping mechanism is imbalanced, failing to constrain overconfident positive samples.|\n\n\nTo address this issue, [GRPO-Guard](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319) introduces two mechanisms that effectively alleviate over-optimization:\n\n- **RatioNorm**: Corrects the distributional bias of importance ratios and unifies their statistics across denoising steps.\n\n- **Gradient Reweight**: Further reweights the gradients of different denoising steps based on RatioNorm, balancing their contributions and preventing excessive optimization under specific noise levels.\n\nThe following figure compares over-optimization between GRPO-Guard and FlowGRPO on text rendering tasks. GRPO-Guard maintains the same rising trend in proxy scores as FlowGRPO while preventing rapid declines in gold scores, thus preserving high image quality and diversity.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002FGRPO-Guard\u002FGRPO-Guard-figure1.png\" alt=\"GRPO-Guard Illustration\" width=900\"\u002F>\n\u003C\u002Fp>\n\n**Start Training**\n\nAfter downloading the base model and setting up the reward model, run the following script to start training the GRPO-Guard for the SD3.5-M text rendering task.\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fsd3_grpo_guard.sh 1\n```\n\n## Flow-GRPO-Fast\nWe propose Flow-GRPO-Fast, an accelerated variant of Flow-GRPO that requires training on **only one or two denoising step** per trajectory. For each prompt, we first generate a deterministic trajectory using ODE sampling. At a randomly chosen intermediate step, we inject noise and switch to SDE sampling to generate a group. The rest of the process continues with ODE sampling. This confines stochasticity to one or two steps, allowing training to focus solely on that steps. This few-step training idea was primarily proposed by [Ziyang Yuan](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=fWxWEzsAAAAJ&hl=en) during our discussions in early June. \n\nFlow-GRPO-Fast achieves significant efficiency gains:\n\n- Each trajectory is trained only once or twice, significantly reducing the training cost.\n\n- Sampling before branching requires only a single prompt without group expansion, further speeding up data collection.\n\nExperiments on PickScore show that Flow-GRPO-Fast matches the reward performance of Flow-GRPO while offering faster training speed. The x-axis in the figure represents training epochs. Flow-GRPO-Fast with 2 training steps per iteration performs better than Flow-GRPO, while Flow-GRPO-Fast with only 1 training step per iteration performs slightly worse than Flow-GRPO. In both cases, compared to Flow-GRPO’s 10 training steps per iteration, the training process is significantly faster.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast.png\" alt=\"Flow-GRPO-Fast Illustration\" width=450\"\u002F>\n\u003C\u002Fp>\n\n\nPlease use scripts in `scripts\u002Fmulti_node\u002Fsd3_fast` to run these experiments.\n\n\n## 🚀 Quick Started\n### 1. Environment Set Up\nClone this repository and install packages.\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo.git\ncd flow_grpo\nconda create -n flow_grpo python=3.10.16\npip install -e .\n```\n\n### 2. Model Download\nTo avoid redundant downloads and potential storage waste during multi-GPU training, please pre-download the required models in advance.\n\n**Models**\n* **SD3.5**: `stabilityai\u002Fstable-diffusion-3.5-medium`\n* **Flux**: `black-forest-labs\u002FFLUX.1-dev`\n\n**Reward Models**\n* **PickScore**:\n  * `laion\u002FCLIP-ViT-H-14-laion2B-s32B-b79K`\n  * `yuvalkirstain\u002FPickScore_v1`\n* **CLIPScore**: `openai\u002Fclip-vit-large-patch14`\n* **Aesthetic Score**: `openai\u002Fclip-vit-large-patch14`\n\n\n### 3. Reward Preparation\nThe steps above only install the current repository. Since each reward model may rely on different versions, combining them in one Conda environment can cause version conflicts. To avoid this, we adopt a remote server setup inspired by ddpo-pytorch. You only need to install the specific reward model you plan to use.\n\n#### GenEval\nPlease create a new Conda virtual environment and install the corresponding dependencies according to the instructions in [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server).\n\n#### OCR\nPlease install paddle-ocr:\n```bash\npip install paddlepaddle-gpu==2.6.2\npip install paddleocr==2.9.1\npip install python-Levenshtein\n```\nThen, pre-download the model using the Python command line:\n```python\nfrom paddleocr import PaddleOCR\nocr = PaddleOCR(use_angle_cls=False, lang=\"en\", use_gpu=False, show_log=False)\n```\n\n#### Pickscore\nPickScore requires no additional installation. Note that the original [pickscore](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fyuvalkirstain\u002Fpickapic_v1) dataset corresponds to `dataset\u002Fpickscore` in this repository, containing some NSFW prompts. We strongly recommend using [pickapic\\_v1\\_no\\_images\\_training\\_sfw](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FCarperAI\u002Fpickapic_v1_no_images_training_sfw), the SFW version of the Pick-a-Pic dataset, which corresponds to `dataset\u002Fpickscore_sfw` in this repository.\n\n#### DeQA\nPlease create a new Conda virtual environment and install the corresponding dependencies according to the instructions in [reward-server](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Freward-server).\n\n#### UnifiedReward\nSince `sglang` may conflict with other environments, we recommend creating a new conda environment.\n```bash\nconda create -n sglang python=3.10.16\nconda activate sglang\npip install \"sglang[all]\"\n```\nWe use sglang to deploy the reward service. After installing sglang, please run the following command to launch UnifiedReward:\n```bash\npython -m sglang.launch_server --model-path CodeGoat24\u002FUnifiedReward-7b-v1.5 --api-key flowgrpo --port 17140 --chat-template chatml-llava --enable-p2p-check --mem-fraction-static 0.85\n```\n#### ImageReward\nPlease install imagereward:\n```bash\npip install image-reward\npip install git+https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP.git\n```\n\n### 4. Start Training\n\n#### GRPO\n\n\n**Single-node training**\n\n```bash\n# sd3\nbash scripts\u002Fsingle_node\u002Fgrpo.sh\n# flux\nbash scripts\u002Fsingle_node\u002Fgrpo_flux.sh\n```\n\n---\n\n\u003Cdetails> \u003Csummary>Multi-node training for SD3:\u003C\u002Fsummary>\n\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fsd3.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fsd3.sh 1\nbash scripts\u002Fmulti_node\u002Fsd3.sh 2\nbash scripts\u002Fmulti_node\u002Fsd3.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for FLUX.1-dev\u003C\u002Fsummary>\n\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fflux.sh 0\n# Other node\nbash scripts\u002Fmulti_node\u002Fflux.sh 1\nbash scripts\u002Fmulti_node\u002Fflux.sh 2\nbash scripts\u002Fmulti_node\u002Fflux.sh 3\n```\nFor Flow-GRPO-Fast, please use `scripts\u002Fmulti_node\u002Fflux_fast.sh`. See the W&B logs for [Geneval](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fqz47q208) (with `geneval_flux_fast` in the config) and [PickScore](https:\u002F\u002Fapi.wandb.ai\u002Flinks\u002Fljie\u002Fncdwa0wo) (with `pickscore_flux_fast` in the config).\n\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for FLUX.1-Kontext-dev\u003C\u002Fsummary>\n\nPlease first download [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) and extract it into the `counting_edit` directory. You can also use the scripts in the `counting_edit` directory to generate the data yourself.\n\nPlease install `diffusers` from the main branch to support `FLUX.1-Kontext-dev`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nAfter upgrading Diffusers, some packages such as PEFT may also need to be upgraded. If you encounter any errors, please upgrade them according to the error messages.\nThen, run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 1\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 2\nbash scripts\u002Fmulti_node\u002Fflux_kontext.sh 3\n```\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Qwen-Image:\u003C\u002Fsummary>\n\nIn the implementation of Qwen-Image, we have unified Flow-GRPO and Flow-GRPO-Fast. You can control the size of the SDE window with `config.sample.sde_window_size`, and adjust the position of the window with `config.sample.sde_window_range`.\n\nPlease install `diffusers` from the main branch to support `Qwen-Image`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage.sh 3\n```\nUsing the provided configuration, the resulting reward curve of Qwen-Image on the test set is shown below.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fflow_grpo_fast_qwenimage.png\" alt=\"Flow-GRPO-Fast Illustration\" width=350\"\u002F>\n\u003C\u002Fp>\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Qwen-Image-Edit:\u003C\u002Fsummary>\n\nSame as Flux Kontext, please first download [generated\\_images.zip](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjieliu\u002Fcounting_edit\u002Fblob\u002Fmain\u002Fgenerated_images.zip) and extract it into the `counting_edit` directory. You can also use the scripts in the `counting_edit` directory to generate the data yourself.\n\nPlease install `diffusers` from the main branch to support `Qwen-Image-Edit`:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers.git\n```\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 1\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 2\nbash scripts\u002Fmulti_node\u002Fqwenimage_edit.sh 3\n```\n\nUsing the provided configuration, the resulting reward curve of Qwen-Image-Edit on the test set is shown below.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fqwenimageedit_epoch.png\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fqwenimageedit_time.png\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F> \n\u003C\u002Fp>\n---\n\u003C\u002Fdetails>\n\n\n\u003Cdetails> \u003Csummary>Multi-node training for Bagel:\u003C\u002Fsummary>\n\nPlease first upgrade `transformers` to **version>=4.44.0** install `flash-attn`:\n```bash\npip install transformers==4.44.0\npip install flash-attn==2.7.4.post1 --no-build-isolation\n```\n\nThen run the scripts:\n```bash\n# Master node\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 0\n# Other nodes\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 1\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 2\nbash scripts\u002Fmulti_node\u002Fbagel\u002Fmain.sh 3\n```\n\nUsing the provided configuration, the resulting reward(PickScore) curve of Bagel on the test set is shown below (with 32 GPU).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"flow_grpo\u002Fassets\u002Fbagel_pickscore.svg\" alt=\"Flow-GRPO-Fast Illustration\" width=\"350\"\u002F>\n\u003C\u002Fp>\n\n**[Note]: About resource requirements & OOM**\n\nThe default training script adopts full-parameter mode, whcih requires at least **8 × 80GB GPUs**. If you encounter OOM issues, you can switch to LoRA training with the config provided in `config\u002Fgrpo.py:pickscore_bagel_lora`.\n\n---\n\u003C\u002Fdetails>\n\n\n#### DPO \u002F OnlineDPO \u002F SFT \u002F OnlineSFT\n Single-node training:\n```bash\nbash scripts\u002Fsingle_node\u002Fdpo.sh\nbash scripts\u002Fsingle_node\u002Fsft.sh\n```\nMulti-node training:\n\nPlease update the entry Python script and config file names in the `scripts\u002Fmulti_node` bash file.\n\n\n## FAQ\n\n* Please use **fp16** for training whenever possible, as it provides higher precision than bf16, resulting in smaller log-probability errors between data collection and training. For Flux and Wan, becauase fp16 inference cannot produce valid images or videos, you will have to use **bf16** for training. Note that log-probability errors tend to be smaller at high-noise steps and larger at low-noise steps. Training only on high-noise steps yields better results in this case. Thanks to [Jing Wang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=Q9Np_KQAAAAJ&hl=zh-CN) for these observations.\n\n* When using **Flow-GRPO-Fast**, set a relatively small `clip_range`, otherwise training may crash.\n\n* When implementing a new model, please check whether using different batch sizes leads to slight differences in the output. SD3 has this issue, which is why I ensure that the batch size for training is the same as that used for data collection.\n\n\n## How to Support Other Models\n\nTo integrate a new model into this framework, please follow the steps below:\n\n**1. Add the following files adapted for your model:**\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_pipeline_with_logprob.py`:\n  This file is adapted from [pipeline\\_stable\\_diffusion\\_3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fsrc\u002Fdiffusers\u002Fpipelines\u002Fstable_diffusion_3\u002Fpipeline_stable_diffusion_3.py). You can refer to diffusers for your model.\n\n* `scripts\u002Ftrain_sd3.py`:\n  This script is based on [train\\_dreambooth\\_lora\\_sd3.py](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\u002Fblob\u002Fmain\u002Fexamples\u002Fdreambooth\u002Ftrain_dreambooth_lora_sd3.py) from the DreamBooth examples.\n\n* `flow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py`:\n  This file handles SDE sampling. In most cases, you don't need to modify it. However, if your definitions of `dt` or `velocity` differ in sign or convention, please adjust accordingly.\n\n**2. Verify SDE sampling:**\nSet `noise_level = 0` in [sde\\_demo.py](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Ftree\u002Fmain\u002Fscripts\u002Fdemo\u002Fsd3_sde_demo.py) to check whether the generated images look normal. This helps verify that your SDE implementation is correct.\n\n**3. Ensure on-policy consistency:**\nSet [`config.sample.num_batches_per_epoch = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L120) and [`config.train.gradient_accumulation_steps = 1`](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fconfig\u002Fgrpo.py#L125C5-L125C47) to enforce a purely on-policy setup, where the model collecting samples is identical to the one being trained.\nUnder this setting, the [ratio](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fscripts\u002Ftrain_sd3.py#L886) should remain exactly 1. If it's not, please check whether the sampling and training code paths differ—for example, through use of `torch.compile` or other model wrappers—and make sure both share the same logic.\n\n**4. Tune reward behavior:**\nStart with `config.train.beta = 0` to observe if the reward increases during training. You may also need to adjust the noise level [here](https:\u002F\u002Fgithub.com\u002Fyifan123\u002Fflow_grpo\u002Fblob\u002Fmain\u002Fflow_grpo\u002Fdiffusers_patch\u002Fsd3_sde_with_logprob.py#L47) based on your model. Other hyperparameters are generally model-agnostic and can be kept as default.\n\n\n## 🏁 Multi Reward Training\nFor multi-reward settings, you can pass in a dictionary where each key is a reward name and the corresponding value is its weight.\nFor example:\n\n```python\n{\n    \"pickscore\": 0.5,\n    \"ocr\": 0.2,\n    \"aesthetic\": 0.3\n}\n```\n\nThis means the final reward is a weighted sum of the individual rewards.\n\nThe following reward models are currently supported:\n* **Geneval** evaluates T2I models on complex compositional prompts.\n* **OCR** provides an OCR-based reward.\n* **PickScore** is a general-purpose T2I reward model trained on human preferences.\n* **[DeQA](https:\u002F\u002Fgithub.com\u002Fzhiyuanyou\u002FDeQA-Score)** is a multimodal LLM-based image quality assessment model that measures the impact of distortions and texture damage on perceived quality.\n* **ImageReward** is a general-purpose T2I reward model capturing text-image alignment, visual fidelity, and safety.\n* **QwenVL** is an experimental reward model using prompt engineering.\n* **Aesthetic** is a CLIP-based linear regressor predicting image aesthetic scores.\n* **JPEG\\_Compressibility** measures image size as a proxy for quality.\n* **UnifiedReward** is a state-of-the-art reward model for multimodal understanding and generation, topping the human preference leaderboard.\n\n        \n## ✨ Important Hyperparameters\nYou can adjust the parameters in `config\u002Fgrpo.py` to tune different hyperparameters. An empirical finding is that `config.sample.train_batch_size * num_gpu \u002F config.sample.num_image_per_prompt * config.sample.num_batches_per_epoch = 48`, i.e., `group_number=48`, `group_size=24`.\nAdditionally, setting `config.train.gradient_accumulation_steps = config.sample.num_batches_per_epoch \u002F\u002F 2`.\n\n## 🤗 Acknowledgement\nThis repo is based on [ddpo-pytorch](https:\u002F\u002Fgithub.com\u002Fkvablack\u002Fddpo-pytorch) and [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers). We thank the authors for their valuable contributions to the AIGC community. Special thanks to Kevin Black for the excellent *ddpo-pytorch* repo.\n\n## ⭐Citation\nIf you find Flow-GRPO useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@article{liu2025flow,\n  title={Flow-grpo: Training flow matching models via online rl},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Li, Yangguang and Liu, Jiaheng and Wang, Xintao and Wan, Pengfei and Zhang, Di and Ouyang, Wanli},\n  journal={arXiv preprint arXiv:2505.05470},\n  year={2025}\n}\n```\nIf you find GRPO-Guard useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@misc{wang2025grpoguardmitigatingimplicitoveroptimization,\n    title={GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping}, \n    author={Jing Wang and Jiajun Liang and Jie Liu and Henglin Liu and Gongye Liu and Jun Zheng and Wanyuan Pang and Ao Ma and Zhenyu Xie and Xintao Wang and Meng Wang and Pengfei Wan and Xiaodan Liang},\n    year={2025},\n    eprint={2510.22319},\n    archivePrefix={arXiv},\n    primaryClass={cs.CV},\n    url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22319}, \n}\n```\nIf you find Flow-DPO useful for your research or projects, we would greatly appreciate it if you could cite the following paper:\n```\n@article{liu2025improving,\n  title={Improving video generation with human feedback},\n  author={Liu, Jie and Liu, Gongye and Liang, Jiajun and Yuan, Ziyang and Liu, Xiaokun and Zheng, Mingwu and Wu, Xiele and Wang, Qiulin and Qin, Wenyu and Xia, Menghan and others},\n  journal={arXiv preprint arXiv:2501.13918},\n  year={2025}\n}\n```\n","Flow-GRPO是一个通过在线强化学习训练流匹配模型的官方实现。该项目利用Python语言开发，支持多种预训练模型和奖励机制，如CLIPScore作为奖励模型，并引入了GRPO-Guard等新特性来增强模型性能与安全性。此外，Flow-GRPO还提供了快速版本Flow-GRPO-Fast，在保证效果的同时大幅减少了训练步骤。此项目适用于需要高效且灵活地生成或编辑图像的应用场景，例如基于文本描述的图像合成、图像编辑等任务。通过提供的可视化界面和在线演示，用户可以直观地了解模型的工作原理及其在不同任务上的表现。",2,"2026-06-11 03:42:29","high_star"]