[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80684":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":11,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":13,"stars30d":13,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":14,"rankGlobal":8,"rankLanguage":8,"license":15,"archived":16,"fork":16,"defaultBranch":17,"hasWiki":18,"hasPages":16,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":12,"starSnapshotCount":12,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},80684,"D-ARL","YinqiBai962\u002FD-ARL","YinqiBai962",null,"Python",95,1,0,48,0.9,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:05","\u003Cdiv align=center>\n\n# D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning\n\n### ICML 2026\n### Stable and efficient asynchronous RL for LLM reasoning\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-ICML_2026-red?style=flat-square)](.\u002F20518_D_ARL_A_Distribution_Mat.pdf)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-orange?style=flat-square)](.\u002FLICENSE)\n[![Built on verl](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuilt_on-verl-blue?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl)\n[![Docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-Available-black?style=flat-square)](.\u002Fdocs)\n\n\u003C\u002Fdiv>\n\u003Cbr>\n\n> [!IMPORTANT]\n> This repository contains the code for **D-ARL**, our **ICML 2026 accepted** paper on asynchronous reinforcement learning for language reasoning.\n>\n> D-ARL is built on top of `verl`, and focuses on one central challenge in async RL for LLM post-training:\n> **distribution mismatch between stale behavior policies and the current policy**.\n\n---\n\n## News 🔥\n\n- 2026.05  **D-ARL** is accepted to **ICML 2026**.\n\n---\n\n## Introduction 📖\n\nAsynchronous RL is attractive for LLM post-training because it decouples rollout generation from policy optimization and improves hardware utilization.\n\nHowever, this efficiency comes with a price:\n\n- rollouts are produced by **stale behavior policies**,\n- replayed data may no longer match the **current policy**,\n- and naive async training can become unstable.\n\n**D-ARL** addresses this problem with:\n\n1. **Distribution-matched replay** over the most recent `K` behavior policies,\n2. **Variance-guided sample selection** for better-aligned asynchronous data,\n3. **Multi-behavior policy optimization** to exploit multi-source replay.\n\nIn short, D-ARL aims to make async RL both **fast** and **stable**.\n\n\u003Cimg src=\".\u002Fdocs\u002Ffigures\u002Fdarl_page-01.png\" alt=\"darl-framework\" width=\"900\" div align=center>\n\n---\n\n## Why D-ARL?\n\nMost async RL pipelines use stale data because it is available.  \nD-ARL uses stale data only when it is still **useful**.\n\n### Core idea\n\n- Keep a replay buffer over recent behavior policies\n- Measure how well samples match the current policy\n- Select **distribution-matched** high-quality samples\n- Optimize with a **multi-behavior** objective instead of collapsing all replay data into one source\n\nThis is the key difference between D-ARL and plain async RL.\n\n\u003Cimg src=\".\u002Fdocs\u002Ffigures\u002Fdarl_page-02.png\" alt=\"darl-two-stage-framework\" width=\"900\" div align=center>\n\n---\n\n## Highlights\n\n- **Stable and fast async RL** for language reasoning\n- **Replay buffer over multiple behavior policies**\n- **Distribution-aware sample selection**\n- **Multi-behavior policy optimization**\n- Implemented on top of **verl**\n- Ready-to-run scripts for **GRPO\u002FPPO-style** LLM post-training\n\n---\n\n## Repository Structure\n\n- [`shells\u002F`](.\u002Fshells): launch scripts for async, decoupled, and D-ARL experiments\n- [`recipe\u002F`](.\u002Frecipe): algorithm entrypoints and replay-related training recipes\n- [`docs\u002F`](.\u002Fdocs): extended documentation from the `verl` ecosystem\n- [`examples\u002F`](.\u002Fexamples): example usages inherited from `verl`\n- [`tests\u002F`](.\u002Ftests): tests\n\n---\n\n## Quick Start 💻\n\n### Step 1: Install\n\nWe recommend Python `>= 3.10` in a clean conda environment.\n\n```bash\nconda create -n darl python=3.10 -y\nconda activate darl\n\npip install -r requirements.txt\npip install -e .\n```\n\nAdditional dependency files are also provided:\n\n- [`requirements-cuda.txt`](.\u002Frequirements-cuda.txt)\n- [`requirements-npu.txt`](.\u002Frequirements-npu.txt)\n- [`requirements_sglang.txt`](.\u002Frequirements_sglang.txt)\n- [`requirements_transferqueue.txt`](.\u002Frequirements_transferqueue.txt)\n\n### Step 2: Run D-ARL\n\nFor a direct D-ARL-style run:\n\n```bash\nbash shells\u002Fone_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh\n```\n\n### Step 3: Compare baselines\n\nUse the following scripts as a minimal comparison suite:\n\n```bash\nbash shells\u002Fone_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_async.sh\nbash shells\u002Fone_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_decoupled.sh\nbash shells\u002Fone_step_off_grpo_1.7b_gsm8k_fsdp2_1_1_replay_4_pmis_sg.sh\n```\n\nConceptually:\n\n- `*_async.sh`: naive asynchronous baseline\n- `*_decoupled.sh`: decoupled baseline\n- `*_pmis_sg.sh`: D-ARL configuration with replay-based selection\n\n### Step 4: Cluster launch\n\nIf you are using Slurm:\n\n```bash\nbash shells\u002Frun.sh\n```\n\nYou will likely need to adapt:\n\n- environment name\n- CUDA \u002F NCCL setup\n- model path\n- data path\n- GPU allocation\n\n---\n\n## Main D-ARL Knobs\n\nThe D-ARL logic is exposed directly in the shell configs, e.g.:\n\n```bash\n+replay.enable=True\n+replay.k=4\n+replay.pmis=True\n+replay.warmup=5\n```\n\nCommonly used async stability options include:\n\n```bash\nalgorithm.rollout_is=True\nalgorithm.rollout_is_threshold=2.0\n```\n\nThese options control:\n\n- whether replay is enabled,\n- how many recent behavior policies are retained,\n- whether distribution-aware replay selection is used,\n- and how aggressively off-policy mismatch is controlled.\n\n---\n\n## Benchmarks\n\nWe evaluate D-ARL on six public reasoning benchmarks:\n\n- **Mathematical reasoning**\n  - GSM8K\n  - LightEval\n  - MATH-500\n  - AIME24\n- **Code reasoning**\n  - LiveCodeBench\n  - HumanEval\n\nBackbone models in the paper include:\n\n- Qwen3-1.7B\n- Qwen3-4B\n\n---\n\n## Documentation\n\nFor more implementation details, see:\n\n- [`docs\u002Findex.rst`](.\u002Fdocs\u002Findex.rst)\n- [`docs\u002Fadvance\u002Fone_step_off.md`](.\u002Fdocs\u002Fadvance\u002Fone_step_off.md)\n- [`docs\u002Fadvance\u002Ffully_async.md`](.\u002Fdocs\u002Fadvance\u002Ffully_async.md)\n- [`docs\u002Fadvance\u002Frollout_is.md`](.\u002Fdocs\u002Fadvance\u002Frollout_is.md)\n\nThese are particularly useful for understanding:\n\n- one-step off-policy training,\n- fully asynchronous training,\n- rollout importance sampling,\n- and system-level async RL design.\n\n---\n\n## Citation\n\nIf you find D-ARL helpful for your research, please cite:\n\n```bibtex\n@inproceedings{darl2026,\n  title     = {D-ARL: A Distribution-Matched Asynchronous Reinforcement Learning Framework for Language Reasoning},\n  booktitle = {International Conference on Machine Learning (ICML)},\n  year      = {2026}\n}\n```\n\nYou can replace this with the full camera-ready BibTeX once the official proceedings metadata is available.\n\n---\n\n## Acknowledgement\n\nThis repository is built on top of [`verl`](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl), an excellent open-source RL training framework for LLMs.\n\nD-ARL extends this foundation with a research focus on:\n\n- asynchronous policy staleness,\n- distribution-matched replay,\n- and robust optimization for language reasoning.\n","D-ARL 是一个针对语言推理的分布匹配异步强化学习框架。该项目通过分布匹配重放、方差引导样本选择和多行为策略优化等核心技术，解决了传统异步强化学习中因行为策略过时而导致的数据分布不匹配问题，从而在保持训练速度的同时提高了算法的稳定性。适合用于大型语言模型（LLM）后训练阶段，特别是需要高效利用计算资源并保证训练过程稳定性的场景。基于 Python 开发，并构建于 verl 之上，提供了一系列易于运行的实验脚本。",2,"2026-06-11 04:01:37","CREATED_QUERY"]