[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71006":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":15,"starSnapshotCount":15,"syncStatus":37,"lastSyncTime":38,"discoverSource":39},71006,"OpenRLHF","OpenRLHF\u002FOpenRLHF","An Easy-to-use, Scalable and High-performance Agentic RL Framework based on Ray (PPO & DAPO & REINFORCE++ &  VLM & TIS & vLLM & Ray & Async  RL)","https:\u002F\u002Fopenrlhf.readthedocs.io\u002F",null,"Python",9623,967,52,297,0,17,48,143,51,39.96,"Apache License 2.0",false,"main",true,[26,27,28,29,30,31,32,33],"large-language-models","proximal-policy-optimization","raylib","reinforcement-learning","reinforcement-learning-from-human-feedback","transformers","visual-language-models","vllm","2026-06-12 02:02:46","\u003Cdiv align=\"center\">\n    \u003Cimg alt=\"OpenRLHF logo\" src=\".\u002Fdocs\u002Flogo.png\" style=\"height: 140px;\" \u002F>\n\u003C\u002Fdiv>\n\u003Cdiv align=\"center\">\n\u003Cp align=\"center\">\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n        \u003Cimg alt=\"GitHub Contributors\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FOpenRLHF\u002FOpenRLHF\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fissues\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fdiscussions\">\n        \u003Cimg alt=\"Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdiscussions\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fpulls\">\n        \u003Cimg alt=\"GitHub pull requests\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr\u002FOpenRLHF\u002FOpenRLHF?color=0088ff\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fstargazers\">\n        \u003Cimg alt=\"GitHub stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRLHF\u002FOpenRLHF?color=ccf\" \u002F>\n      \u003C\u002Fa>\n      \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002FOpenRLHF\u002FOpenRLHF\">\u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\">\u003C\u002Fa>\n      \u003Cbr>\n      \u003Cem>Open-source \u002F Comprehensive \u002F Lightweight \u002F Easy-to-use\u003C\u002Fem>\n    \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Chr>\n\n\u003Cspan>[ English | \u003Ca href=\"README_zh.md\">中文\u003C\u002Fa> | \u003Ca href=\"README_ja.md\">日本語\u003C\u002Fa> ]\u003C\u002Fspan>\n\nOpenRLHF is **the first** high-performance, production-ready open-source RLHF framework that combines **Ray + vLLM distributed architecture** with a **unified agent-based design paradigm** for scalable and extensible reinforcement learning from human feedback.\n\n📚 **Learn More**: [Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002F) | [Slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1JRhB1d7csofx0PIZBmfyBdMluxNd5JLPpUHrrvVhGnk\u002Fedit?usp=sharing) | [Technical Report](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F393414548_OpenRLHF_An_Easy-to-use_Scalable_and_High-performance_RLHF_Framework) | [Video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1dv2jBxEQG\u002F)\n\n## 📖 Table of Contents\n\n- [🗞️ News](#news)\n- [🏗️ Architecture Foundation](#architecture-foundation-ray--vllm-distribution) - Ray + vLLM + DeepSpeed distributed infrastructure\n- [🎯 Design Paradigm](#design-paradigm-agent-based-execution) - Unified agent-based execution pipeline\n- [🚀 RL Algorithms](#state-of-the-art-rl-algorithms) - PPO, REINFORCE++, GRPO, RLOO\n- [📋 Features Overview](#comprehensive-features) - Complete RLHF pipeline capabilities\n- [🎬 Quick Start](#quick-start) - Installation and typical workflow\n- [🎓 Training Guide](#supervised-fine-tuning) - SFT, Reward Model, RL Training\n- [🎯 Single-Turn Agent](#single-turn-agent-reinforced-fine-tuning-with-custom-rewards) - Custom reward functions\n- [🤖 Multi-Turn Agent](#multi-turn-agent-complex-environment-interactions) - Complex environments\n- [🔧 Advanced Topics](#advanced-topics) - LoRA, performance tuning\n\n---\n\n\u003Ca id=\"news\">\u003C\u002Fa>\n## News\n\n\u003Cdetails>\n\u003Csummary>Show News\u003C\u002Fsummary>\n\n- [2026\u002F4] OpenRLHF 0.10 adds **Multi-Turn VLM RL** — multi-step interactions with images in both prompts and environment feedback (e.g. screenshots). Example: [vlm_multiturn_agent.py](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py)\n- [2026\u002F4] OpenRLHF 0.10 adds **VLM (Vision-Language Model) RLHF support** — train VLMs like Qwen3.5 with image inputs end-to-end. Training script: [train_vlm_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh)\n- [2026\u002F2] [ProRL V2](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fscaling-llm-reinforcement-learning-with-prolonged-training-using-prorl-v2\u002F) uses REINFORCE++-baseline to train a state-of-the-art 1.5B reasoning model with prolonged RL training. Training script: [train_prorlv2_math_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_prorlv2_math_hybrid_engine.sh)\n- [2026\u002F10] [ScaleRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.13786) validates the effectiveness of REINFORCE++-baseline in large-scale training scenarios. Releases [REINFORCE++ slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1stieP_3PM1z4Hq1YWR3GywFkxcHEAlstXMaS23KlGN4)\n- [2026\u002F6] [Magistral](https:\u002F\u002Fmistral.ai\u002Fstatic\u002Fresearch\u002Fmagistral.pdf) uses the method quite similar to REINFORCE++-baseline to train the reasoning models.\n- [2026\u002F5] [MARTI](https:\u002F\u002Fgithub.com\u002FTsinghuaC3I\u002FMARTI) has been released as a fork of OpenRLHF. It is designed to train LLM-based multi-agent systems using RL, by integrating centralized multi-agent interactions with distributed policy training.\n- [2026\u002F5] OpenRLHF 0.8.0 supports async RLHF training via `--train.async_enable` and async agent RLHF via `--train.agent_func_path`. See [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh) for a runnable example.\n- [2026\u002F4] Post the blog [Accelerating RLHF with vLLM, Best Practice from OpenRLHF](https:\u002F\u002Fblog.vllm.ai\u002F2026\u002F04\u002F23\u002Fopenrlhf-vllm.html)\n- [2026\u002F4] Clean OpenRLHF: Refactored the source code based on Single Controller and Unified Packing Samples\n- [2026\u002F3] The CMU [Advanced Natural Language Processing Spring 2026](https:\u002F\u002Fcmu-l3.github.io\u002Fanlp-spring2026\u002F) course uses OpenRLHF as the RLHF framework teaching case.\n- [2026\u002F2] [Logic-RL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14768) and [PRIME](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01456) demonstrate that REINFORCE++ is more stable in training compared to GRPO and faster than PPO.\n- [2026\u002F2] [LMM-R1](https:\u002F\u002Fgithub.com\u002FTideDra\u002Flmm-r1) is a fork of OpenRLHF, aimed at providing high-performance RL infrastructure for reproduction of DeepSeek-R1 on multimodal tasks.\n- [2026\u002F2] MIT & Microsoft proposed the [On the Emergence of Thinking in LLMs I: Searching for the Right Intuition](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.06773) using OpenRLHF\n- [2026\u002F1] HKUST reproduced the [DeepSeek-R1-Zero and DeepSeek-R1 training on small models using OpenRLHF](https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002FsimpleRL-reason)\n- [2024\u002F12] We \"proposed\" 😊 the [REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F387487679_REINFORCE_An_Efficient_RLHF_Algorithm_with_Robustnessto_Both_Prompt_and_Reward_Models).\n- [2024\u002F12] We analyzed the PPO, REINFORCE++, GRPO and RLOO in the [Notion Blogpost](https:\u002F\u002Fhijkzzz.notion.site\u002Funraveling-rlhf-and-its-variants-engineering-insights#147d9a33ecc9806090f3d5c749d31f05).\n- [2023\u002F8] OpenRLHF was open-sourced.\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"architecture-foundation-ray--vllm-distribution\">\u003C\u002Fa>\n## 🏗️ Architecture Foundation: Ray + vLLM Distribution\n\nOpenRLHF is **the first RLHF framework** built on Ray + vLLM distributed architecture, orchestrating multiple components across GPUs efficiently:\n\n\u003Cdiv align=\"center\">\n  \u003Cimg alt=\"OpenRLHF Architecture (Ray + vLLM)\" src=\".\u002Fdocs\u002Fopenrlhf_architecture.svg\" style=\"max-width: 100%; height: auto;\" \u002F>\n\u003C\u002Fdiv>\n\n### Core Infrastructure Components\n\n**Ray - Distributed Scheduler and Controller**  \nOpenRLHF leverages [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) for efficient distributed scheduling. It separates the Actor, Reward, Reference, and Critic models across different GPUs, enabling scalable training for models up to **70B+ parameters**.\n\n**Hybrid Engine Scheduling**: All models and vLLM engines can share GPU resources—minimizing idle time and maximizing GPU utilization. This allows running full RLHF pipelines on limited hardware.\n\n**vLLM - High-Performance Inference Engine**  \nRLHF training spends **80% of the time on sample generation**. Powered by [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) with Auto Tensor Parallelism (AutoTP) and Pipeline Parallelism (PP), OpenRLHF delivers high-throughput, memory-efficient generation.\n\n**DeepSpeed - Memory-Efficient Training**  \nBuilt on [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed) ZeRO-3, [deepcompile](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fdeepcompile\u002FREADME.md), [AutoTP](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed\u002Fblob\u002Fmaster\u002Fblogs\u002Fhuggingface-tp\u002FREADME.md), and RingAttention. Enables large model training without heavyweight frameworks while working directly with HuggingFace models.\n\n**Transformers - Model Interface**  \nNative integration with HuggingFace Transformers for seamless model loading, state management, and fine-tuning of pretrained models.\n\n**NCCL \u002F CUDA IPC - High-Speed Communication**  \nEfficient inter-GPU communication for distributed training and inference.\n\n---\n\n\u003Ca id=\"design-paradigm-agent-based-execution\">\u003C\u002Fa>\n## 🎯 Design Paradigm: Agent-Based Execution\n\n**On top of the Ray distributed architecture**, OpenRLHF is **the first RLHF framework** to implement a **unified agent-based paradigm**. Every training run—whether standard PPO or complex multi-turn reasoning—follows a consistent agent execution pipeline.\n\n### Why Agent-Based?\n\nOpenRLHF **unifies generation and training through token-in-token-out agent execution**, ensuring perfect consistency, easy single\u002Fmulti-turn extension, and zero text-level mismatches.\n\n### Agent Architecture\n\n```\n                 ┌─────────────────────────────┐\n                 │    AgentExecutorBase        │\n                 │  (Token-in-Token-out Core)  │\n                 └─────────────────────────────┘\n                              │\n                 ┌────────────┴────────────┐\n                 ↓                         ↓\n         SingleTurnExecutor        MultiTurnExecutor\n                 │                         │\n      ┌──────────┴──────────┐   ┌─────────┴──────────┐\n      ↓                     ↓   ↓                    ↓\n  Standard RLHF      Custom Reward   Multi-Step    External Env\n  (One-shot gen)     Function      Reasoning     (OpenAI Agent Server)\n      ↓                     ↓           ↓                ↓\n      └─────────────────────┴───────────┴────────────────┘\n                              │\n                    Consistent Token Trajectories\n                              │\n                    ┌─────────┴─────────┐\n                    │  RL Algorithms    │\n                    │  (Decoupled)      │\n                    │                   │\n                    │  PPO, REINFORCE++ │\n                    │  GRPO, RLOO, etc. │\n                    └───────────────────┘\n```\n\n### Core Design Principles\n\n\u003Cdetails>\n\u003Csummary>Show core design principles\u003C\u002Fsummary>\n\n| Principle | Description | Benefit |\n|-----------|-------------|---------|\n| **Token-in-Token-out** | All sampling produces token-level trajectories | Zero text-level mismatch |\n| **Unified Interface** | Same `AgentExecutorBase` API for all modes | Switch modes with one flag |\n| **Algorithm-Agnostic** | RL algorithms (PPO, REINFORCE++, etc.) are decoupled from agent executors | Any algorithm works with any mode |\n| **Extensible** | Plug in custom rewards\u002Fenvironments easily | Rapid experimentation |\n| **Production-Ready** | Sync\u002FAsync\u002FHybrid Engine support | From research to deployment |\n\n\u003C\u002Fdetails>\n\n### Two Execution Modes (Orthogonal to RL Algorithms)\n\nThe agent execution mode is **independent** of the RL algorithm you choose. You can use **any algorithm** (PPO, REINFORCE++, GRPO, etc.) with **any execution mode**:\n\n| Mode | Use Cases | Interface | Complexity |\n|------|-----------|-----------|------------|\n| **Single-Turn** | Standard RLHF, custom reward functions | Optional `reward_func()` | ⭐ Default (99% use cases) |\n| **Multi-Turn** | Multi-step reasoning, interactive environments | `reset()` + `step()` | ⭐⭐ Advanced |\n\n---\n\n\u003Ca id=\"state-of-the-art-rl-algorithms\">\u003C\u002Fa>\n## 🚀 State-of-the-Art RL Algorithms\n\nOpenRLHF implements **PPO, REINFORCE++, REINFORCE++-baseline, GRPO, RLOO** with advanced optimization tricks inspired by practical guides and community best practices. \n\n**Key Design**: RL algorithms are **decoupled from agent execution modes**. All algorithms work seamlessly with both single-turn and multi-turn agent executors, running through the unified token-in-token-out pipeline for consistent behavior.\n\n\u003Cdetails>\n\u003Csummary>Show algorithm comparison table\u003C\u002Fsummary>\n\n| Algorithm | `--algo.advantage.estimator` | Key Feature | Best Use Case |\n|-----------|------------------------|-------------|---------------|\n| **PPO** | (default) | Full critic network | Stable training, proven results |\n| **REINFORCE++** | `reinforce` | PPO tricks without critic | Efficient training, less memory |\n| **REINFORCE++-baseline** | `reinforce_baseline` | Mean reward baseline | Reasoning tasks (RLVR), robust to reward scales |\n| **RLOO** | `rloo` | Per-token KL + PPO-clip | Multi-sample training |\n| **GRPO** | `group_norm` | Group normalization | Batch-based training |\n| **Dr. GRPO** | `dr_grpo` | Simplified GRPO | Removes local `\u002Fstd` norm |\n\n\u003C\u002Fdetails>\n\nReferences: [Zhihu article](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F622134699) | [Notion best practices](https:\u002F\u002Fhijkzzz.notion.site\u002Frlhf-implementation-tricks?v=158d9a33ecc98132bf9e000c39227361)\n\n---\n\n\u003Ca id=\"comprehensive-features\">\u003C\u002Fa>\n## 📋 Comprehensive Features\n\nOpenRLHF provides a complete RLHF pipeline with agent-based flexibility:\n\n### 🎯 Agent-Based RL Training (Core Innovation)\n\n\u003Cdetails>\n\u003Csummary>Show agent-based RL training details\u003C\u002Fsummary>\n\n**Single-Turn Mode** (Default - 99% of use cases)\n- One-shot generation per prompt\n- Works with all RL algorithms: [PPO](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh), [REINFORCE++\u002Fbaseline\u002FGRPO\u002FRLOO](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_hybrid_engine.sh)\n- [Custom reward functions](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh) (`--reward.remote_url`)\n- [Hybrid Engine](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh) for maximum GPU utilization\n\n**Multi-Turn Mode** (Advanced - Interactive tasks)\n- Multi-step interactions with environment feedback\n- Works with all RL algorithms\n- [Custom agent functions](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh) (`--train.agent_func_path`)\n- OpenAI-compatible server: see `examples\u002Fpython\u002Fagent_func_openai_server_executor.py` for an agent executor that wraps vLLM as a local OpenAI Agent Server\n- Async pipeline (`--train.async_enable`) for higher throughput: [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)\n\n\u003C\u002Fdetails>\n\n### 🎓 Supervised Training & Preference Learning\n\n\u003Cdetails>\n\u003Csummary>Show supervised training & preference learning table\u003C\u002Fsummary>\n\n| Method | Script | Description |\n|--------|--------|-------------|\n| **SFT** | [train_sft.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_sft.sh) | Supervised fine-tuning with packing |\n| **DPO\u002FIPO\u002FcDPO** | [train_dpo_llama.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_dpo_llama.sh) | Direct preference optimization |\n| **Reward Model** | [train_rm.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_rm.sh) | Train reward models |\n\n\u003C\u002Fdetails>\n\n### ⚡ Advanced Capabilities\n\n\u003Cdetails>\n\u003Csummary>Show advanced capabilities\u003C\u002Fsummary>\n\n**Efficiency Optimizations**\n- Sample packing (`--ds.packing_samples`) for all training modes\n- vLLM acceleration (`--vllm.num_engines`) for fast generation\n- DAPO [dynamic filtering](.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh) (`--algo.dynamic_filtering_enable`)\n  - 🎲 Dynamic Sampling: for each prompt, generate multiple responses and **filter** them by your reward \u002F agent **0–1 `scores`** signal\n    - Enable: `--algo.dynamic_filtering_enable`\n    - Score range: `--algo.dynamic_filtering_range 0.0 1.0`\n    - Requires: `--rollout.n_samples_per_prompt > 1` and either `--reward.remote_url` or `--train.agent_func_path`\n    - Example: `.\u002Fexamples\u002Fscripts\u002Ftrain_dapo_ray_hybrid_engine.sh`\n\n**Scalability**\n- DeepSpeed AutoTP for tensor parallelism (see `--ds.tensor_parallel_size` in training scripts)\n- [RingAttention](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_dpo_ring_llama.sh) for long context (`--ds.ring_attn_size`)\n- Multi-node training with [SLURM](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_slurm.sh)\n\n**Model Support**\n- [VLM (Vision-Language Models)](.\u002Fexamples\u002Fscripts\u002Ftrain_vlm_math_hybrid_engine.sh) — single-turn and [multi-turn with image feedback](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py) (`--data.image_key`, `--data.max_images_per_prompt`)\n- [LoRA\u002FQLoRA](.\u002Fexamples\u002Fscripts\u002Ftrain_sft_mixtral_lora.sh) (`--ds.lora.rank`, `--ds.load_in_4bit`)\n- [Mixture of Experts (MoE)](.\u002Fexamples\u002Ftest_scripts\u002Ftrain_sft_moe.sh) (`--actor.aux_loss_coef`)\n- FlashAttention (`--ds.attn_implementation`)\n- HuggingFace chat templates (`--data.apply_chat_template`)\n\n**Optimizers**\n- AdamW (default): `--{actor,critic}.optim adam --{actor,critic}.adam.lr 2e-6`\n- [Muon](https:\u002F\u002Fkellerjordan.github.io\u002Fposts\u002Fmuon\u002F) (via DeepSpeed ≥ 0.18.2, 2D weights only; embeddings \u002F head \u002F 1-D params use aux-AdamW): `--{actor,critic}.optim muon --{actor,critic}.muon.lr 1e-4 --{actor,critic}.muon.momentum 0.95`. Newton-Schulz produces scale-invariant updates, so disable global grad clipping with `--{actor,critic}.max_norm 0` (the Adam default `1.0` would clip away the Muon update).\n\n**Reward Shaping**\n- DAPO-style overlong penalty for length control (`--reward.overlong_buffer_len`, `--reward.overlong_penalty_factor`) — soft-penalize responses that exceed `max_new_tokens - overlong_buffer_len`\n- ProRL-style truncation penalty (`--reward.stop_properly_penalty_coef`) — for samples with `finish_reason='length'`: `coef ∈ [0, 1]` multiplicatively scales the reward; `coef \u003C 0` sets the reward to that fixed value (e.g. `-0.5`)\n\n**Production Features**\n- Wandb (`--logger.wandb.key`) and TensorBoard (`--logger.tensorboard_dir`) logging\n- Checkpoint recovery (`--ckpt.load_enable`, `--ckpt.save_steps`)\n- Best-checkpoint saving on eval metrics (`--ckpt.best_metric_key`)\n- Evaluation datasets (`--eval.dataset`, `--eval.temperature`, `--eval.n_samples_per_prompt`) — supported in async training\n- Multi-process data loading (`--data.dataloader_num_workers`, available for PPO\u002FSFT\u002FRM\u002FDPO)\n- PPO observability: actor\u002Fcritic grad-norm and per-phase timing (`timing\u002Fmake_experience`, `timing\u002Fppo_train`, `timing\u002Fbroadcast`, `timing\u002Fgeneration`, `timing\u002Fstep_total`)\n\n\u003C\u002Fdetails>\n\n---\n\n\u003Ca id=\"quick-start\">\u003C\u002Fa>\n## 🎬 Quick Start\n\n### Installation\n\n**Recommended**: Use Docker for hassle-free setup\n\n```bash\n# 1. Launch Docker container\ndocker run --runtime=nvidia -it --rm --shm-size=\"10g\" --cap-add=SYS_ADMIN \\\n  -v $PWD:\u002Fopenrlhf nvcr.io\u002Fnvidia\u002Fpytorch:25.11-py3 bash\n\n# 2. Clean conflicting packages\nsudo pip uninstall xgboost transformer_engine flash_attn pynvml -y\n\n# 3. Install OpenRLHF (choose one)\npip install openrlhf                    # Basic\npip install openrlhf[vllm]              # + vLLM 0.19.1 (recommended)\npip install openrlhf[vllm_latest]       # + Latest vLLM\npip install openrlhf[vllm,ring,liger]   # + All optimizations\n```\n\n**Alternative: Install from source**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF.git\ncd OpenRLHF\npip install -e .\n```\n\n> [!TIP]\n> We recommend **vLLM 0.19.1+** for best performance. See [Dockerfiles](.\u002Fdockerfile\u002F) and [Nvidia-Docker Install Script](.\u002Fexamples\u002Fscripts\u002Fnvidia_docker_install.sh).\n\n### Prepare Datasets\n\nOpenRLHF provides flexible data processing methods:\n\n**Key Parameters**:\n- `--data.input_key`: Specify JSON key name for input data\n- `--data.apply_chat_template`: Use HuggingFace tokenizer's [chat template](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fchat_templating)\n- `--data.input_template`: Custom template string (alternative to chat template)\n- `--data.prompt_probs` \u002F `--data.dataset_probs`: Mix multiple datasets (e.g., `0.1,0.4,0.5`)\n- `--eval.dataset`: Specify evaluation dataset path\n\n**Chat Template Example**:\n\n```python\ndataset = [{\"input_key\": [\n  {\"role\": \"user\", \"content\": \"Hello, how are you?\"},\n  {\"role\": \"assistant\", \"content\": \"I'm doing great. How can I help you today?\"},\n  {\"role\": \"user\", \"content\": \"I'd like to show off how chat templating works!\"},\n]}]\n\ntokenizer.apply_chat_template(dataset[0][\"input_key\"], tokenize=False)\n# Output: \"\u003Cs>[INST] Hello, how are you? [\u002FINST]I'm doing great...\u003C\u002Fs> [INST] I'd like to show off... [\u002FINST]\"\n```\n\n> [!NOTE]\n> JSON key options vary by dataset type. See [Reward Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Freward_dataset.py#L10), [SFT Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fsft_dataset.py#L9), and [Prompt Dataset](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fblob\u002Fmain\u002Fopenrlhf\u002Fdatasets\u002Fprompts_dataset.py#L6)\n\n\u003Ca id=\"supervised-fine-tuning\">\u003C\u002Fa>\n### Supervised Fine-tuning\n\nOpenRLHF's model checkpoint is fully compatible with HuggingFace models. You can specify the model name or path using `--actor.model_name_or_path  {name or path}`, `--reward.model_name_or_path  {name or path}` and `--critic.model_name_or_path  {name or path}`. We have provided some pre-trained checkpoints and datasets on [HuggingFace OpenRLHF](https:\u002F\u002Fhuggingface.co\u002FOpenRLHF).\n\nThen you can use the startup scripts we provide in the [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) directory, or start the training using the following commands.\n\n\u003Cdetails>\n\u003Csummary>SFT command\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_sft \\\n   --data.max_len 4096 \\\n   --data.dataset Open-Orca\u002FOpenOrca \\\n   --data.input_key question \\\n   --data.output_key response \\\n   --data.input_template $'User: {}\\nAssistant: ' \\\n   --train.batch_size 256 \\\n   --train.micro_batch_size 2 \\\n   --data.max_samples 500000 \\\n   --actor.model_name_or_path meta-llama\u002FMeta-Llama-3-8B \\\n   --ckpt.output_dir .\u002Fcheckpoint\u002Fllama3-8b-sft \\\n   --ckpt.save_steps -1 \\\n   --logger.logging_steps 1 \\\n   --eval.steps -1 \\\n   --ds.zero_stage 2 \\\n   --train.max_epochs 1 \\\n   --ds.packing_samples \\\n   --ds.param_dtype bf16 \\\n   --adam.lr 5e-6 \\\n   --actor.gradient_checkpointing_enable \\\n   --logger.wandb.key {wandb_token}\n\n# Additional options:\n# --data.apply_chat_template                # Use HF tokenizer chat template\n# --ds.ring_attn_size 2                      # Enable RingAttention (install ring_flash_attn first)\n# --data.multiturn                          # Multi-turn fine-tuning loss\n# --actor.pretrain_mode_enable                      # Continued pre-training mode\n```\n\n\u003C\u002Fdetails>\n\n\n### Reward Model Training\n\n\u003Cdetails>\n\u003Csummary>Reward model training command\u003C\u002Fsummary>\n\n```bash\ndeepspeed --module openrlhf.cli.train_rm \\\n   --ckpt.output_dir .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n   --ckpt.save_steps -1 \\\n   --logger.logging_steps 1 \\\n   --eval.steps -1 \\\n   --train.batch_size 256 \\\n   --train.micro_batch_size 1 \\\n   --actor.model_name_or_path OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --ds.param_dtype bf16 \\\n   --train.max_epochs 1 \\\n   --data.max_len 8192 \\\n   --ds.zero_stage 3 \\\n   --adam.lr 9e-6 \\\n   --data.dataset OpenRLHF\u002Fpreference_dataset_mixture2_and_safe_pku \\\n   --data.apply_chat_template \\\n   --chosen_key chosen \\\n   --rejected_key rejected \\\n   --ds.packing_samples \\\n   --actor.gradient_checkpointing_enable \\\n   --logger.wandb.key {wandb_token}\n\n```\n\n\u003C\u002Fdetails>\n\nIt is recommended to set the `--value_prefix_head` option of the Reward Model to `score`, so that we can load the model using `AutoModelForSequenceClassification`:\n\n```python\nreward_model = AutoModelForSequenceClassification.from_pretrained(\n              reward_model_path,\n              num_labels=1,\n              torch_dtype=torch.bfloat16,\n              attn_implementation=\"flash_attention_2\",\n              use_cache=False,\n          )\ninputs = xxxx (Left Padding Input Tokens)\nreward = reward_model.model(*inputs).last_hidden_state\nreward = reward_model.score(reward)[:, -1]\n```\n\n### RL Training: PPO\u002FREINFORCE++ with Ray and vLLM\n\nAll RL training in OpenRLHF runs through the **agent execution pipeline**. The following example shows single-turn agent execution (default mode) with Hybrid Engine for optimal performance:\n\n```bash\n# launch the master node of ray in container\nray start --head --node-ip-address 0.0.0.0 --num-gpus 8\n\n# if you want to launch ray on more nodes, use\nray start --address {MASTER-NODE-ADDRESS}:6379  --num-gpus 8\n\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n   --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n   -- python3 -m openrlhf.cli.train_ppo_ray \\\n   --ref.num_nodes 1 \\\n   --ref.num_gpus_per_node 8 \\\n   --reward.num_nodes 1 \\\n   --reward.num_gpus_per_node 8 \\\n   --critic.num_nodes 1 \\\n   --critic.num_gpus_per_node 8 \\\n   --actor.num_nodes 1 \\\n   --actor.num_gpus_per_node 8 \\\n   --vllm.num_engines 4 \\\n   --vllm.tensor_parallel_size 2 \\\n   --train.colocate_all \\\n   --vllm.gpu_memory_utilization 0.5 \\\n   --actor.model_name_or_path OpenRLHF\u002FLlama-3-8b-sft-mixture \\\n   --reward.model_name_or_path OpenRLHF\u002FLlama-3-8b-rm-700k \\\n   --ckpt.output_dir \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Ffinal\u002Fllama3-8b-rlhf \\\n   --ckpt.path \u002Fopenrlhf\u002Fexamples\u002Ftest_scripts\u002Fckpt\u002Fllama3-8b-rlhf \\\n   --ckpt.save_hf \\\n   --train.batch_size 128 \\\n   --rollout.batch_size 1024 \\\n   --train.dynamic_batch_enable \\\n   --rollout.n_samples_per_prompt 1 \\\n   --train.max_epochs 1 \\\n   --prompt_max_len 1024 \\\n   --data.max_samples 100000 \\\n   --generate_max_len 1024 \\\n   --ds.zero_stage 3 \\\n   --ds.param_dtype bf16 \\\n   --actor.adam.lr 5e-7 \\\n   --critic.adam.lr 9e-6 \\\n   --algo.kl.init_coef 0.01 \\\n   --data.prompt_dataset OpenRLHF\u002Fprompt-collection-v0.1 \\\n   --data.input_key context_messages \\\n   --data.apply_chat_template \\\n   --reward.normalize_enable \\\n   --actor.gradient_checkpointing_enable \\\n   --ds.packing_samples \\\n   --vllm.sync_backend nccl \\\n   --vllm.enforce_eager \\\n   --vllm.enable_sleep \\\n   --ds.enable_sleep \\\n   --logger.wandb.key {wandb_token}\n\n# Algorithm Variants (all use single-turn agent execution):\n# --algo.advantage.estimator reinforce        # REINFORCE++\n# --algo.advantage.estimator rloo             # RLOO\n# --algo.advantage.estimator reinforce_baseline  # REINFORCE++-baseline (best for RLVR)\n# --algo.advantage.estimator group_norm       # GRPO\n# --algo.advantage.estimator dr_grpo          # Dr. GRPO\n\n# Advanced Options:\n# --algo.kl.init_coef 0                                    # No reference model\n# --reward.remote_url http:\u002F\u002Fhost:5000\u002Fget_reward         # HTTP reward model\n# --rollout.n_samples_per_prompt 4                            # Multiple samples per prompt\n# --rollout.vllm_generate_batch_size 2048                     # Oversample at generation (> rollout_batch_size); requires --train.async_enable\n# --algo.advantage.is_correction_enable                         # vLLM importance sampling correction for off-policy rollouts\n# --algo.advantage.is_correction_type tis                       # Correction type: tis (token clamp) | icepop (token filter) | seq-mask-tis (seq-level geom mean)\n# --algo.advantage.is_correction_threshold 0.5 5.0               # IS truncation interval: [low, high]\n# --ckpt.best_metric_key eval_default_pass1                # Save best checkpoint by eval metric (empty = auto-detect first pass1, 'none' = disable)\n# --actor.policy_loss_type gspo                             # Use GSPO policy loss variant (vs default 'ppo')\n```\n\n> [!TIP]\n> **For reasoning tasks (RLVR)**: Use `--algo.advantage.estimator reinforce_baseline` for REINFORCE++-baseline—it's robust to different reward scales.\n\n> [!NOTE]\n> **Ray Environment Setup**: Let Ray auto-deploy with `--runtime-env-json='{\"setup_commands\": [\"pip install openrlhf[vllm]\"]}'`\n\n> [!NOTE]\n> **Troubleshooting GPU index errors**: Set `export RAY_EXPERIMENTAL_NOSET_CUDA_VISIBLE_DEVICES=1` if you encounter DeepSpeed GPU device setup issues.\n\n📚 **More Examples**: See [examples\u002Fscripts](.\u002Fexamples\u002Fscripts\u002F) and [Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fusage.html)\n\n---\n\n\u003Ca id=\"single-turn-agent-reinforced-fine-tuning-with-custom-rewards\">\u003C\u002Fa>\n## 🎯 Single-Turn Agent: Reinforced Fine-tuning with Custom Rewards\n\nThe **single-turn agent execution** (default mode) supports custom reward functions—perfect for reinforced fine-tuning without a trained reward model. Instead of using a pre-trained reward model, you provide a Python function that computes rewards on-the-fly.\n\n**Ideal for**:\n- Rule-based rewards (length, format, code execution, math verification)\n- External API rewards (judge models, compilers, test suites)\n- Hybrid rewards (combining multiple signals)\n\n### Example: Custom Reward Function\n\n```python\n# reward_func.py\nimport torch\n\ndef reward_func(queries, prompts, labels):\n    \"\"\"\n    Compute custom rewards for generated responses.\n    \n    Args:\n        queries: List[str] - Full text (prompt + response)\n        prompts: List[str] - Original prompts only\n        labels: List[str] - Ground truth labels (from --label_key)\n    \n    Returns:\n        dict with:\n            - rewards: Tensor for advantage calculation\n            - scores: Tensor for dynamic filtering (0-1 range)\n            - extra_logs: Dict for wandb logging\n    \"\"\"\n    batch_size = len(queries)\n    \n    # Example: Random rewards (replace with your logic)\n    # Real examples: code execution, math verification, format checking\n    reward = torch.randint(0, 2, (batch_size,)).float()\n\n    return {\n        \"rewards\": reward,           # Used in RL advantage calculation\n        \"scores\": reward,            # Used for dynamic filtering (--dynamic_filtering)\n        \"extra_logs\": {              # Logged to wandb\n            \"custom_metric\": reward.mean().item(),\n        },\n    }\n```\n\n### Usage\n\n```bash\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  --actor.model_name_or_path meta-llama\u002FMeta-Llama-3-8B \\\n  --train.dynamic_batch_enable \\\n  --reward.remote_url \u002Fpath\u002Fto\u002Freward_func.py \\\n  --data.label_key answer \\\n  --data.prompt_dataset your_prompt_dataset \\\n  ... # other training args\n```\n\n**Key Parameter**: `--data.label_key answer` passes the \"answer\" field from your dataset to `reward_func` as `labels`.\n\n> [!TIP]\n> **Use Cases**: Code generation (execute tests), Math (verify solutions), Formatting (check structure), Multi-objective (combine multiple signals)\n\n📖 **Full Example**: [examples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh)\n\n---\n\n\u003Ca id=\"multi-turn-agent-complex-environment-interactions\">\u003C\u002Fa>\n## 🤖 Multi-Turn Agent: Complex Environment Interactions\n\nFor tasks requiring **multi-step interactions** (reasoning chains, coding with feedback, game playing), OpenRLHF provides the **Multi-Turn Agent Execution** mode.\n\n### Building Custom Multi-Turn Agents\n\nImplement `AgentInstanceBase` with `reset\u002Fstep` methods:\n\n```python\n# agent_func.py\nimport random\nfrom typing import Any, Dict\n\nimport torch\nfrom openrlhf.utils.agent import AgentInstanceBase, MultiTurnAgentExecutor\n\n\n# A simple n-step random environment\nclass AgentInstance(AgentInstanceBase):\n    async def __init__(self, *args, **kwargs):\n        self.step_idx = 0\n        self.max_steps = random.randint(1, 3)  # 1-3 steps\n\n    async def reset(self, states: dict, **kwargs):\n        return {\"observation\": states[\"observation\"]}  # Return original text observation\n\n    async def step(self, states: dict, **kwargs) -> Dict[str, Any]:\n        print(f\"step_idx: {self.step_idx}, max_steps: {self.max_steps}\")\n\n        observation_text = states[\"observation_text\"]\n        action_text = states[\"action_text\"]\n        label = states[\"label\"]\n\n        # Check if episode is done\n        done = self.step_idx >= self.max_steps\n        reward = torch.randint(0, 2, (1,)).float() if done else torch.tensor(0)\n\n        # Generate environment feedback based on whether episode is done\n        environment_feedback = (\n            \"\\n\\nHuman: [CORRECT]\\n\u003C\u002Fs>\"\n            if done\n            else \"\\n\\nHuman: [INCORRECT]\\nPlease analyze the issues and try again.\\n\u003C\u002Fs>\\n\\nAssistant: \"\n        )\n\n        self.step_idx += 1\n\n        return {\n            \"rewards\": reward,  # Rewards for advantage calculation\n            \"scores\": reward,  # Scores for dynamic filtering (0-1 reward)\n            \"environment_feedback\": environment_feedback,  # Environment feedback text\n            \"done\": done,  # Boolean indicating if the episode is complete\n            \"sampling_params\": states.get(\"sampling_params\", None),  # Parameters for vLLM sampling in next step\n            \"extra_logs\": {\"dummy_scores\": reward},  # Additional logging information\n        }\n\n\nclass AgentExecutor(MultiTurnAgentExecutor):\n    def __init__(self):\n        super().__init__(AgentInstance)\n```\n\nThen launch with:\n\n```bash\nray job submit --address=\"http:\u002F\u002F127.0.0.1:8265\" \\\n  --runtime-env-json='{\"working_dir\": \"\u002Fopenrlhf\"}' \\\n  -- python3 -m openrlhf.cli.train_ppo_ray \\\n  ...\n  --train.dynamic_batch_enable \\\n  --train.agent_func_path \u002Fpath\u002Fto\u002Fagent_func.py \\\n  --train.async_enable  # Optional: enable async pipeline\n```\n\n### Configuration Options\n\n**Async Pipeline** (for higher throughput):\n- Enable: `--train.async_enable`\n- Buffer size: `--train.async_queue_size 1` (larger = more off-policy, default 1)\n- Partial rollout: `--train.partial_rollout_enable` — uses vLLM pause\u002Fresume for weight sync instead of locking, allowing generation to overlap with training. In-flight samples may contain tokens from both old and new weights.\n\n**Training Modes**:\n- **Synchronous**: Default, better stability\n- **Asynchronous**: Higher throughput, may affect convergence\n- **Hybrid Engine**: Best GPU utilization with `--train.colocate_all` (remove `--train.async_enable`)\n\n> [!NOTE]\n> For fully custom token-level execution, inherit `AgentExecutorBase` and implement `execute()`. This design enforces the **token-in-token-out principle** to keep sampling and training consistent.\n\n> [!WARNING] \n> Asynchronous training may affect training stability. Use it only when throughput is critical and convergence is validated.\n\n📚 **Examples**:\n- Single-turn: [train_ppo_ray_hybrid_engine.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_ray_hybrid_engine.sh)\n- Custom reward: [train_ppo_with_reward_fn.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_ppo_with_reward_fn.sh)\n- Multi-turn: [train_reinforce_baseline_ray_agent_async.sh](.\u002Fexamples\u002Fscripts\u002Ftrain_reinforce_baseline_ray_agent_async.sh)\n- Multi-turn VLM (image feedback): [vlm_multiturn_agent.py](.\u002Fexamples\u002Fpython\u002Fvlm_multiturn_agent.py)\n\n### OpenAI-Compatible Agent Server\n\nFor multi-turn agents that need an OpenAI-compatible chat API (e.g., integrating external tool-use frameworks), [`agent_func_openai_server_executor.py`](.\u002Fexamples\u002Fpython\u002Fagent_func_openai_server_executor.py) wraps vLLM as a local `\u002Fv1\u002Fchat\u002Fcompletions` server while collecting token-level traces for RL training.\n\n- Exposes standard OpenAI endpoints (`\u002Fv1\u002Fchat\u002Fcompletions`, `\u002Fv1\u002Fmodels`, `\u002Ftokenize`)\n- Automatically collects token IDs and logprobs per session for RL training\n- Delta-tokenization reuses prefix tokens across multi-turn calls\n- Override `run_agent()` to plug in your own multi-turn workflow\n\n```bash\npython3 -m openrlhf.cli.train_ppo_ray \\\n  --train.agent_func_path examples\u002Fpython\u002Fagent_func_openai_server_executor.py \\\n  ... # other training args\n```\n\n---\n\n\u003Ca id=\"advanced-topics\">\u003C\u002Fa>\n## 🔧 Advanced Topics\n\n### LoRA: Merging Adapters\n\nWhen using LoRA\u002FQLoRA, OpenRLHF saves only the adapter weights. To deploy or continue training, merge the adapter with the base model:\n\n```bash\npython -m openrlhf.cli.lora_combiner \\\n    --model_path meta-llama\u002FMeta-Llama-3-8B \\\n    --lora_path .\u002Fcheckpoint\u002Fllama3-8b-rm \\\n    --output_path .\u002Fcheckpoint\u002Fllama-3-8b-rm-combined \\\n    --is_rm \\\n    --ds.param_dtype bf16\n```\n\n### Performance Tuning Guide\n\nOptimize OpenRLHF for your hardware and workload with these recommendations:\n\n#### 🎯 Execution Modes: Throughput vs. Stability\n\nPick the execution mode based on your priority — OpenRLHF gives you a clear tradeoff knob:\n\n| Mode | Flags | Characteristics | When to Use |\n|------|-------|-----------------|-------------|\n| **Hybrid Engine (colocated)** | `--train.colocate_all`\u003Cbr>`--vllm.enable_sleep`\u003Cbr>`--ds.enable_sleep` | **Most stable** — strictly on-policy, every rollout uses the latest weights. Serial generate→train cycle. | Research, sensitive RL algorithms, reproducibility, recipe validation |\n| **Async Training** | `--train.async_enable`\u003Cbr>`--train.async_queue_size N` | **Highest throughput** — generation and training run in parallel. Tune off-policyness via `--train.async_queue_size` (larger = more off-policy). | Production throughput when convergence is already validated |\n| **Async + Partial Rollout** | `--train.async_enable`\u003Cbr>`--train.partial_rollout_enable` | **Maximum overlap** — vLLM pause\u002Fresume instead of locking, in-flight samples may mix old\u002Fnew weights. Most aggressive off-policy. | Pushing async throughput further; pair with `--algo.advantage.is_correction_enable` |\n\n#### ⚡ Other Speed Optimizations\n\n| Optimization | Flag | When to Use |\n|--------------|------|-------------|\n| **Sample Packing** | `--ds.packing_samples` | Always (especially training) |\n| **Dynamic Batch** | `--train.dynamic_batch_enable` | Variable sequence lengths |\n| **DeepCompile** | `--ds.deepcompile` | PyTorch 2.0+ |\n| **Overlap Comm** | `--ds.overlap_comm` | Sufficient GPU memory |\n| **Prefix Caching** | vLLM config | `n_samples_per_prompt` > 1 |\n| **Oversampling** | `--rollout.vllm_generate_batch_size > --rollout.batch_size` | Async mode, to amortize generation cost \u002F feed dynamic filtering |\n\n#### 💾 Memory Management\n\n**When you have enough memory**:\n- ✅ Disable `--ds.adam_offload`\n- ✅ Enable `--ds.overlap_comm`\n- ✅ Use `--train.colocate_critic_reward` and `--train.colocate_actor_ref`\n\n**When hitting OOM**:\n- ❌ Disable all `--colocate_*` options\n- ✅ Reduce batch sizes\n- ✅ Enable gradient checkpointing\n\n#### 🎮 Batch Size Tuning\n\n1. **Generation Phase**: Maximize `--rollout.micro_batch_size`, minimize vLLM TP size\n2. **Training Phase**: Maximize `--train.micro_batch_size`, enable `--ds.packing_samples`\n3. **vLLM**: Always use `--vllm.sync_backend nccl`\n\n> [!TIP]\n> **Quick Start Template**: For 8x A100 (80GB), try Hybrid Engine + `--vllm.gpu_memory_utilization 0.5` + `--train.colocate_all`\n\n📖 **More Details**: [Performance Tuning Documentation](https:\u002F\u002Fopenrlhf.readthedocs.io\u002Fen\u002Flatest\u002Fperformance.html)\n\n\n## Companies and Organizations using OpenRLHF\n\n- Google\n- ByteDance\n- Tencent\n- Alibaba\n- Baidu\n- China Telecom\n- Vivo\n- Allen AI\n- NexusFlow\n- Jülich Supercomputing Centre (JSC)\n- Berkeley Starling Team\n- M-A-P\n- ...\n\n## Join Us\n\n**How to Join?**\n\n1. Email us at janhu9527@gmail.com or join [GitHub Organization](https:\u002F\u002Fgithub.com\u002FOpenRLHF). Please include the following details:\n   - Your name\n   - Your GitHub username\n   - Your areas of interest\n   - Your skills and experience related to NLP and\u002For AI\n1. You can also join us through the official GitHub [OpenRLHF ↗](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF) project page. Just create an issue about your interest to contribute and we will get back to you.\n\n**What can you do?**\n\n1. Join the team and participate in the development of the OpenRLHF project.\n1. Contribute to the project by submitting pull requests.\n1. Help improve documentation, fix bugs, or create new features.\n1. Share the project and help us grow the community.\n\n## Sponsor Us\n\nYour sponsorship can help us maintain and improve OpenRLHF. If you find this project useful, please consider sponsoring us. You can sponsor us on [Open Collective ↗](https:\u002F\u002Fopencollective.com\u002FOpenRLHF).\n\n## Starchart\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenRLHF\u002FOpenRLHF&type=Date)](https:\u002F\u002Fstar-history.com\u002F#OpenRLHF\u002FOpenRLHF&Date)\n\n## Contributors\n\nA big thank you to all our contributors! If you want to contribute, feel free to make a pull request or create an issue.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=OpenRLHF\u002FOpenRLHF\" \u002F>\n\u003C\u002Fa>\n\n## References & Acknowledgements\n\nWe would like to express our gratitude to the following projects and organizations for their contributions to the field of AI and NLP:\n\n- [Hugging Face Transformers ↗](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)\n- [OpenAI GPT ↗](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-3)\n- [LLaMA ↗](https:\u002F\u002Fllama.meta.com\u002F)\n- [DeepSpeed ↗](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n- [Ray ↗](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray)\n\nOur project would also like to thank [ColossalChat](https:\u002F\u002Fgithub.com\u002Fhpcaitech\u002FColossalAI\u002Ftree\u002Fmain\u002Fapplications\u002FColossalChat) and [DeepSpeedChat](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeedExamples\u002Ftree\u002Fmaster\u002Fapplications\u002FDeepSpeed-Chat). In the early stages of the project, we referred to their code design. \nOur project would like to thank [Netmind.AI](https:\u002F\u002Fwww.netmind.ai\u002F) for the GPU support of developing ring attention.\n\n(2024\u002F7) Our GitHub organization has changed from OpenLLMAI to OpenRLHF.\n\n## Citation\nOpenRLHF\n\n```\n@article{hu2024openrlhf,\n  title={OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework},\n  author={Jian Hu and Xibin Wu and Zilin Zhu and Xianyu and Weixun Wang and Dehao Zhang and Yu Cao},\n  journal={arXiv preprint arXiv:2405.11143},\n  year={2024}\n}\n```\nREINFORCE++-baseline\n```\n@article{hu2026reinforce++,\n  title={Reinforce++: A simple and efficient approach for aligning large language models},\n  author={Hu, Jian},\n  journal={arXiv preprint arXiv:2501.03262},\n  year={2026}\n}\n```\n\n______________________________________________________________________\n\n*OpenRLHF © 2026 OpenRLHF. All Rights Reserved.*\n","OpenRLHF是一个基于Ray的易用、可扩展且高性能的代理强化学习框架。它集成了PPO、DAPO、REINFORCE++等先进算法，并支持视觉语言模型（VLM）和文本-图像合成（TIS），通过vLLM分布式架构实现高效的并行处理。该框架采用统一的基于代理的设计模式，能够灵活应对不同规模的任务需求。特别适用于需要从人类反馈中学习的大规模语言模型训练场景，如对话系统优化、游戏AI开发等领域。",2,"2026-06-11 03:35:24","high_star"]