[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1132":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},1132,"Relax","redai-infra\u002FRelax","redai-infra","An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale","https:\u002F\u002Fredai-infra.github.io\u002FRelax",null,"Python",423,45,343,8,0,10,20,61,30,81.09,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:07","\u003Cdiv align=\"center\">\n\n## Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale\n\n**Towards Async, Omni-Modal RL at Scale, Just Relax.**\n\n\u003Cimg src=\".\u002Fassets\u002FRelax.jpg\" width=\"100%\" alt=\"Relax\">\n\n\u003Cp>\n  \u003Ca href=\".\u002FLICENSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.12-blue.svg\" alt=\"Python 3.12\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11554\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=arXiv&message=Paper&color=red\" alt=\"arXiv\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fredai-infra.github.io\u002FRelax\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-latest-brightgreen.svg\" alt=\"Documentation\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fmy.feishu.cn\u002Fwiki\u002FZcTQwrmwbiWRhvkgcMxciefHn7f\" target=\"_blank\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-green?logo=wechat\" alt=\"WeChat QR\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp>\n  \u003Ca href=\".\u002FREADME.md\">📖 English\u003C\u002Fa> | \u003Ca href=\".\u002FREADME_zh.md\">📖 中文\u003C\u002Fa>\n\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n______________________________________________________________________\n\n**Relax** (**R**einforcement **E**ngine **L**everaging **A**gentic **X**-modality) is a high-performance reinforcement learning post-training framework open-sourced by the Xiaohongshu AI Infra Team for multimodal large language models. Built on Ray Serve with a service-oriented architecture, Relax uses Megatron-LM as the training backend and SGLang as the inference engine. Through the [TransferQueue](https:\u002F\u002Fgithub.com\u002Fredai-infra\u002FTransferQueue) data transfer system, it achieves complete decoupling of training and inference, supporting end-to-end multimodal RL training from text to images, videos, and audio.\n\n______________________________________________________________________\n\n## ✨ Highlights\n\n- 🌐 **Full Omni-Modal Training** — One unified framework for text, vision, and audio RL — one of the few systems capable of end-to-end Omni model (Qwen3-Omni) post-training\n- ⚙️ **Service-Oriented Six-Layer Architecture** — Every role is an independent Ray Serve deployment, with native service-level elastic scheduling and fault recovery\n- ⚡ **Fully Async via TransferQueue** — Rollout, Actor, ActorFwd, Reference, and Advantages run on independent GPU clusters with streaming data exchange and configurable staleness\n- 🤖 **Agentic RL** — Multi-turn interaction, loss masking, flexible termination, and VLM multimodal context carry-over for closed-loop \"execute → observe → decide\" training\n- 🔀 **Elastic Rollout Scaling** — Dynamically grow\u002Fshrink inference engines mid-training via HTTP REST API, with same-cluster (`ray_native`) and cross-cluster (`external`) federation modes\n- 🧠 **Rich Algorithm Suite** — GRPO, GSPO, SAPO, and On-Policy Distillation out of the box, with pluggable rewards and built-in **GenRM** (LLM-as-judge) mode\n- 🚀 **Megatron + SGLang Backends** — Megatron-LM (TP\u002FPP\u002FCP\u002FEP) for MoE and deep models, SGLang for high-throughput inference, DCS for NCCL-broadcast weight sync\n- 📦 **Production-Ready Ops** — HealthManager auto-recovery, centralized Metrics Service (WandB \u002F TensorBoard \u002F ClearML), and Apprise real-time notifications\n\n______________________________________________________________________\n\n## 📢 News\n\n| 📣 Updates                                      |\n| :---------------------------------------------- |\n| **\\[04\u002F15\u002F2026\\]** 🎉 Relax is now open-source! |\n\n______________________________________________________________________\n\n## 🏗️ Architecture\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Farch.png\" width=\"80%\" alt=\"Relax Architecture\">\n\u003C\u002Fdiv>\n\nRelax adopts a **six-layer service-oriented architecture** where every role is deployed as an independent [Ray Serve](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fserve\u002Findex.html) deployment, cleanly separating orchestration, components, engines, backends, and distributed capabilities:\n\n| Layer             | Responsibility                                                                                                               |\n| :---------------- | :--------------------------------------------------------------------------------------------------------------------------- |\n| **Entrypoints**   | `train.py` — signal handling, CLI parsing, Ray cluster connection, Controller launch                                         |\n| **Orchestration** | `Controller` (training loop, global restart), `Service` (placement groups, lifecycle), `Registry` (role & algorithm mapping) |\n| **Components**    | Ray Serve deployments: **Actor**, **Rollout**, **Critic**, **ActorFwd**, **Advantages**, **GenRM**                           |\n| **Engine**        | SGLang rollout engine, pluggable reward functions, request router, data filters                                              |\n| **Backends**      | **Megatron-LM** training backend (TP\u002FPP\u002FCP\u002FEP) and **SGLang** inference engine                                               |\n| **Distributed**   | Ray Actor groups (RolloutManager \u002F GenRMManager) and **DCS** (Distributed Checkpoint Service) for NCCL\u002FGLOO weight sync      |\n\n**Two execution modes** are supported:\n\n- **Colocate (Sync)** — Actor and Rollout time-share the same GPUs; Rollout writes a full batch to TransferQueue, then yields GPUs for training. Memory-efficient for constrained hardware and strict on-policy (`max_staleness=0`).\n- **Fully Async** — Actor, Rollout, ActorFwd, Reference, and Advantages run on **independent GPU clusters** in parallel, exchanging data through TransferQueue and syncing weights asynchronously through DCS for maximum throughput with configurable staleness.\n\n> 📖 Learn more: [Architecture Guide](docs\u002Fen\u002Fguide\u002Farchitecture.md) · [Fully Async Training](docs\u002Fen\u002Fguide\u002Ffully-async-training.md) · [Elastic Rollout Scaling](docs\u002Fen\u002Fguide\u002Felastic-rollout.md)\n\n______________________________________________________________________\n\n## 🧠 Supported Algorithms\n\n| Algorithm                  | Type                | Description                             |\n| :------------------------- | :------------------ | :-------------------------------------- |\n| **GRPO**                   | Policy Optimization | Group Relative Policy Optimization      |\n| **GSPO**                   | Policy Optimization | Group Sample Policy Optimization        |\n| **SAPO**                   | Policy Optimization | Sample-Aware Policy Optimization        |\n| **On-Policy Distillation** | Knowledge Transfer  | Teacher-student KL penalty distillation |\n\n> 📖 Adding a new algorithm is straightforward — implement a service class, register it in the `ALGOS` registry, and you're done.\n\n______________________________________________________________________\n\n## 🤖 Supported Models\n\nRelax is designed for **omni-modal RL training** — text, vision, and audio in one unified framework. Multimodal data is configured via the `--multimodal-keys` flag, with complete image\u002Fvideo\u002Faudio processing pipelines under `relax\u002Futils\u002Fmultimodal\u002F` for fine-grained control over image token counts, video frame sampling, and audio sample rates.\n\n| Model Family   | Sizes             | Modality              | Typical Tasks                                        | Backend  |\n| :------------- | :---------------- | :-------------------- | :--------------------------------------------------- | :------- |\n| **Qwen3**      | 4B, 30B-A3B (MoE) | Text                  | Math reasoning, code, multi-turn dialogue, tool use  | Megatron |\n| **Qwen3-VL**   | 4B, 30B-A3B       | Vision + Language     | Visual QA, image understanding, multimodal reasoning | Megatron |\n| **Qwen3.5**    | 30B-A3B           | Vision + Language     | Visual QA, image understanding, multimodal reasoning | Megatron |\n| **Qwen3-Omni** | 30B-A3B           | Text + Vision + Audio | Audio-visual QA, omni-modal understanding            | Megatron |\n\n> 📖 New architectures are integrated via [Megatron Bridge](relax\u002Fbackends\u002Fmegatron\u002Fmbridge\u002F) for automatic HF ↔ Megatron weight conversion.\n\n______________________________________________________________________\n\n## 📦 Installation\n\nThe recommended way to run Relax is via the official Docker image, which ships with all CUDA, PyTorch, Megatron-LM, SGLang, and Ray dependencies pre-installed and version-matched.\n\n```bash\n# Pull the official image\ndocker pull relaxrl\u002Frelax:latest\n\n# Launch a container with GPUs, shared memory, and your workspace mounted\ndocker run -it --gpus all --ipc=host --network=host \\\n  -v \u002Fpath\u002Fto\u002Fyour\u002Fworkspace:\u002Froot \\\n  relaxrl\u002Frelax:latest bash\n\n# Inside the container\ngit clone https:\u002F\u002Fgithub.com\u002Fredai-infra\u002FRelax.git \u002Froot\u002FRelax\ncd \u002Froot\u002FRelax && pip install -e .\n```\n\n> 📖 For GPU driver requirements, multi-node setup, and persistent storage mounts, see the [Installation Guide](docs\u002Fen\u002Fguide\u002Finstallation.md).\n\n______________________________________________________________________\n\n## 🚀 Quick Start\n\nThree end-to-end tasks cover **text**, **vision-language**, and **omni-modal** training. Each task downloads a public HuggingFace dataset and model, then launches training with a single script. Set `EXP_DIR=\u002Froot` (or wherever your models and datasets live) and the scripts will locate them automatically.\n\n### Task 1 — DAPO Math (Text, 8 GPUs)\n\nTrain Qwen3-4B on [`dapo-math-17k`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhuzilin\u002Fdapo-math-17k) with GRPO. Reward is rule-based answer extraction plus symbolic math verification.\n\n```bash\nhf download --repo-type dataset zhuzilin\u002Fdapo-math-17k --local-dir \u002Froot\u002Fdapo-math-17k\nhf download Qwen\u002FQwen3-4B --local-dir \u002Froot\u002FQwen3-4B\n\ncd \u002Froot\u002FRelax && export EXP_DIR=\u002Froot\nbash scripts\u002Ftraining\u002Ftext\u002Frun-qwen3-4B-8xgpu.sh\n```\n\n### Task 2 — Open-R1 (Vision-Language, 8 GPUs)\n\nTrain Qwen3-VL-4B on [`multimodal-open-r1-8k-verified`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flmms-lab\u002Fmultimodal-open-r1-8k-verified) with GRPO using the `openr1mm` reward.\n\n```bash\nhf download --repo-type dataset lmms-lab\u002Fmultimodal-open-r1-8k-verified \\\n  --local-dir \u002Froot\u002Fmultimodal-open-r1-8k-verified\nhf download Qwen\u002FQwen3-VL-4B-Instruct --local-dir \u002Froot\u002FQwen3-VL-4B-Instruct\n\ncd \u002Froot\u002FRelax && export EXP_DIR=\u002Froot\nbash scripts\u002Ftraining\u002Fmultimodal\u002Frun-qwen3-vl-4B-8xgpu.sh\n```\n\n### Task 3 — AVQA (Omni-Modal: Image + Audio, 16 GPUs \u002F 2 nodes)\n\nTrain Qwen3-Omni-30B-A3B on [`AVQA-R1-6K`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fharryhsing\u002FAVQA-R1-6K) with GRPO and a multiple-choice reward.\n\n```bash\nhf download --repo-type dataset harryhsing\u002FAVQA-R1-6K --local-dir \u002Froot\u002FAVQA-R1-6K\nhf download Qwen\u002FQwen3-Omni-30B-A3B-Instruct --local-dir \u002Froot\u002FQwen3-Omni-30B-A3B-Instruct\n\ncd \u002Froot\u002FRelax && export EXP_DIR=\u002Froot\nbash -x scripts\u002Fentrypoint\u002Fspmd-multinode.sh \\\n  scripts\u002Ftraining\u002Fmultimodal\u002Frun-qwen3-30B-A3B-omni-16xgpu.sh\n```\n\nOnce running, you should see logs like:\n\n```text\nFinish rollout 0\u002F200\ntraining step 0\u002F200\n```\n\nCheckpoints are saved in Megatron DCP format; convert them to HuggingFace weights with `scripts\u002Ftools\u002Fconvert_torch_dist_to_hf_bridge.py`.\n\n> 📖 Full walkthrough: [Quick Start Guide](docs\u002Fen\u002Fguide\u002Fquick-start.md) · [Customize Training](docs\u002Fen\u002Fguide\u002Fcustomize-training.md) · [Configuration Guide](docs\u002Fen\u002Fguide\u002Fconfiguration.md)\n\n______________________________________________________________________\n\n## ⚡ Key Features\n\n### Fully Async Training via TransferQueue\n\nIn fully-async mode, Rollout, Actor, ActorFwd, Reference, and Advantages run on **independent GPU clusters** in parallel. Three mechanisms make this efficient:\n\n- **StreamingDataLoader** — Actor begins consuming samples as Rollout incrementally writes them to TransferQueue, eliminating GPU idle time between phases.\n- **Configurable staleness** — `--max-staleness` precisely controls how off-policy training data can drift, flexibly balancing on-policy accuracy and throughput.\n- **DCS weight sync** — After each training step, weights are NCCL-broadcast from Actor to Rollout\u002FActorFwd\u002FReference via the Distributed Checkpoint Service, overlapped with the next training computation.\n\n### Agentic RL\n\nRelax provides first-class support for multi-turn, closed-loop \"execute → observe → decide\" training:\n\n- **Multi-turn sampling with loss masking** — model outputs (mask=1) are cleanly separated from environment observations (mask=0) so only model actions participate in training.\n- **Environment \u002F Rollout decoupling** — a standard `BaseInteractionEnv` interface (`reset`, `step`, `format_observation`) lets environments evolve independently of the sampler.\n- **VLM multimodal context carry-over** — `image_data` on the Rollout side and `multimodal_train_inputs` on the training side are incrementally merged each turn so visual observations concatenate correctly.\n- **Flexible termination** — combine `max_turns`, token-budget exhaustion, and env-signalled `done`. The DeepEyes example demonstrates Agentic multi-turn GRPO with Qwen3-VL-30B-A3B.\n\n### Elastic Rollout Scaling\n\nSince 60–70% of RL training time is spent in the Rollout phase, Relax exposes **HTTP REST APIs** to dynamically add or remove inference engines mid-training without interrupting the training loop:\n\n- **`ray_native`** mode — specify a target engine count; Relax allocates resources and launches new SGLang engines inside the current Ray cluster.\n- **`external`** mode — register SGLang engines already deployed in other clusters for cross-cluster federated inference on preemptible or idle resources.\n\nScaling is asynchronous, idempotent, mutually exclusive, and supports graceful drain-and-remove plus cancellation with rollback. Engines from startup parameters are protected; only dynamically added engines can be scaled in.\n\n### Megatron Training Backend & SGLang Inference\n\nTraining uses **Megatron-LM** with full Tensor \u002F Pipeline \u002F Context \u002F Expert parallelism for MoE and ultra-deep models. Inference uses **SGLang** with process-lifecycle management. New model architectures plug in through Megatron Bridge for automatic HF ↔ Megatron weight conversion.\n\n### Pluggable Reward Hub\n\nBuilt-in rewards for math (DeepScaler, DAPO), GPQA, F1, IFBench, multiple-choice, multimodal Open-R1, and **GenRM** (generative LLM-as-judge). Add a custom reward by dropping a single file into `relax\u002Fengine\u002Frewards\u002F`.\n\n### Production Operations\n\n- **HealthManager** — heartbeat monitoring with two-tier auto-recovery (in-place restart first, global restart as fallback).\n- **Metrics Service** — centralized Ray Serve deployment that fans out to TensorBoard, WandB, and ClearML.\n- **Notifications** — real-time training alerts via Apprise (Slack, WeChat, email, and more).\n\n______________________________________________________________________\n\n## 📚 Documentation\n\nFull bilingual documentation is available at **[redai-infra.github.io\u002FRelax](https:\u002F\u002Fredai-infra.github.io\u002FRelax)**.\n\n______________________________________________________________________\n\n## 🧪 Examples\n\n| Example                                                      | Description                                           |\n| :----------------------------------------------------------- | :---------------------------------------------------- |\n| [DeepEyes](.\u002Fexamples\u002Fdeepeyes\u002F)                             | Multi-modal vision-language RL with Qwen3-VL          |\n| [On-Policy Distillation](.\u002Fexamples\u002Fon_policy_distillation\u002F) | Teacher-student knowledge distillation via KL penalty |\n\n______________________________________________________________________\n\n## 🤝 Contributing\n\nWe welcome contributions of all kinds! Please read our [Contributing Guide](docs\u002Fen\u002Fguide\u002Fhow-to-contribute.md) to get started.\n\n______________________________________________________________________\n\n## 📝 Citation\n\nIf you find Relax useful in your research, please cite:\n\n```bibtex\n@software{relax2026,\n  title  = {Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale},\n  author = {Relax Contributors},\n  url    = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11554},\n  year   = {2026}\n}\n```\n\n______________________________________________________________________\n\n## 📜 License\n\nThis project is licensed under the [Apache License 2.0](.\u002FLICENSE).\n\n______________________________________________________________________\n\n## 🙏 Acknowledgements\n\nRelax is built upon the shoulders of excellent open-source projects:\n\n- [Slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime) — Scalable training and inference framework for reinforcement learning\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) — Fast serving framework for large language models\n- [Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) & [Megatron-Bridge](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FMegatron-Bridge) — Large-scale distributed training framework and HF ↔ Megatron weight conversion bridge, with sincere thanks to the entire **NVIDIA** team\n- [TransferQueue](https:\u002F\u002Fgithub.com\u002FAscend\u002FTransferQueue) — High-performance distributed data transfer queue\n- [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) — Distributed computing framework\n- [HuggingFace Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) — State-of-the-art model hub\n\nWe sincerely thank all contributors and the open-source community for making this project possible.\n","Relax 是一个用于大规模多模态后训练的异步强化学习引擎。它基于Ray Serve构建，采用服务导向的六层架构设计，利用Megatron-LM作为训练后端和SGLang作为推理引擎，并通过TransferQueue实现训练与推理的完全解耦。该框架支持文本、图像、视频及音频等多模态数据的端到端强化学习训练，特别适用于需要跨模态处理能力的大规模语言模型优化场景。此外，Relax还具备弹性扩展、全异步操作以及丰富的内置算法库等特点，能够满足复杂环境下的高性能需求。",2,"2026-06-11 02:41:48","CREATED_QUERY"]