[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80732":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":30,"discoverSource":31},80732,"DARE","EtaYang10th\u002FDARE","EtaYang10th","DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation","https:\u002F\u002Fetayang10th.github.io\u002Fdare.github.io\u002F",null,"Python",79,10,2,1,0,14,17,36,42,3.12,"MIT License",false,"main",true,[],"2026-06-12 02:04:06","# DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.09188\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.09188-b31b1b.svg\" alt=\"arXiv:2605.09188\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.09188\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-PDF-1a56db.svg\" alt=\"Paper PDF\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FEtaYang10th\u002FDARE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode-GitHub-24292e?logo=github&logoColor=white\" alt=\"DARE code\"\u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg\" alt=\"License: MIT\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Project Page:\u003C\u002Fb> \u003Ca href=\"https:\u002F\u002Fetayang10th.github.io\u002Fdare.github.io\u002F\">etayang10th.github.io\u002Fdare.github.io\u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Cb>Paper:\u003C\u002Fb> \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.09188\">arXiv:2605.09188\u003C\u002Fa> · \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.09188\">PDF\u003C\u002Fa>\n  \u003Cbr\u002F>\n  \u003Cb>Code:\u003C\u002Fb> \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FEtaYang10th\u002FDARE\">github.com\u002FEtaYang10th\u002FDARE\u003C\u002Fa>\n\u003C\u002Fp>\n\nImplementation of **DARE**, a difficulty-adaptive RL framework for LLM reasoning that couples policy-aligned difficulty estimation with difficulty-specific training strategies. DARE improves **training efficiency**, **final accuracy**, and **inference-time token usage** over existing difficulty-aware RL methods.\n\n![Overview](figures\u002Foverview.jpg)\n\n---\n\n## Table of Contents\n[Empirical Highlights](#empirical-highlights) | [Why DARE?](#why-dare) | [What DARE Does](#what-dare-does) | [Installation](#installation) | [Stage 1 — Cold-Start Difficulty Estimator](#stage-1--cold-start-difficulty-estimator-optional-but-recommended) | [Stage 2 — DARE RL Training](#stage-2--dare-rl-training) | [Repository Layout](#repository-layout) | [Acknowledgements](#acknowledgements)\n\n---\n\n## Empirical Highlights\n\nAcross three model scales (Qwen2.5-Math-1.5B, SmolLM3-3B-Base, Qwen2.5-Math-7B) and five math-reasoning benchmarks (MATH-500, GSM8K, AIME-AMC, MinervaMath, OlympiadBench), plus a code-generation transfer setting (HumanEval \u002F MBPP \u002F LiveCodeBench), DARE consistently:\n\n- converges faster than GRPO, DOTS, EDCO, MoPPS, LLM-Judge, and Previous-FR baselines,\n- produces **shorter** outputs on easy prompts and **higher** accuracy on hard prompts,\n- improves final accuracy beyond filtration-only methods.\n\n![Results](figures\u002Fperformance.jpg)\n\nSee the paper for full tables, ablations (reward-shaping, clipping, Beta concentration `κ`, SNIS clip `c`), and per-difficulty-level token\u002Faccuracy breakdowns.\n\n---\n\n\n## Why DARE?\n\nRL fine-tuning for LLM reasoning is expensive, and many rollouts produce weak learning signals. Prior *difficulty-aware data selection* methods (e.g., embedding-based, entropy-based, Bayesian bandit, and LLM-as-judge estimators) try to focus on medium-difficulty prompts, but an audit of these methods reveals three gaps:\n\n1. **Inaccurate difficulty under policy drift.** Static or slowly-adapting estimators drift away from the current policy as training proceeds, so the \"medium\" prompts they pick are often trivially easy or intractably hard for the live policy.\n2. **Limited final-performance gains from selection alone.** Filtration-only methods primarily shift *which* prompts are trained on; with enough budget they converge to roughly the same final accuracy as plain GRPO, leaving hard tasks unsolved.\n3. **No change in inference efficiency.** Models trained with difficulty filtering still emit uniformly long CoT responses across difficulty levels.\n\nDARE addresses all three at once.\n\n## What DARE Does\n\nDARE is organized around three components that run each epoch (see the pseudo-code in the paper appendix):\n\n1. **Co-Evolved Difficulty Estimation (SNIS + FIFO buffer).**\n   A prompt-wise FIFO replay buffer stores `(response, reward, behavior log-prob)` tuples. For each prompt, DARE estimates the *current-policy* failure rate via self-normalized importance sampling over the buffer, with a clipped log-ratio for stability. Unseen prompts get an embedding-based cold-start difficulty from a small reference set. The resulting estimate `d_q` co-evolves with the policy without re-rolling every prompt each step.\n2. **Symmetric-Beta Dynamic Data Selection.**\n   Prompts are sampled without replacement according to `p(q) ∝ Beta(d_q; α, α)` with `α = 1 + κ\u002F2`. This keeps high probability mass on medium prompts (where GRPO's group-relative advantage is largest) while preserving a nonzero tail on easy and hard prompts, avoiding both easy-skill forgetting and hard-prompt starvation — a strict generalization over hard-threshold filtering.\n3. **Difficulty-Adaptive Policy Optimization.**\n   With thresholds `(d_easy, d_hard)`, prompts are partitioned into three tiers and trained differently:\n   - **Easy** (`d_q \u003C d_easy`): fewer rollouts `G_easy \u003C G`, a length-weighted penalty on *correct* rollouts, and a relaxed upper clip `ε⁺ > ε` to learn concise correct solutions.\n   - **Medium** (`d_easy ≤ d_q ≤ d_hard`): standard GRPO with `G` rollouts and symmetric clip `ε`.\n   - **Hard** (`d_q > d_hard`): more rollouts `G_hard > G`, with a fraction drawn as **hint-augmented rollouts** (successful historical trajectories retrieved from the replay buffer), plus a bounded length bonus on *incorrect* rollouts to prevent early give-up. Correctness stays the dominant signal because `λ_hard \u003C 1`.\n\nEach batch mixes a fraction `σ` of fresh on-policy rollouts with `1 − σ` replay-buffer trajectories, trained under a difficulty-conditioned clipped surrogate objective.\n\n\n## Installation\n\nTested with **Python 3.10 \u002F 3.11** and **CUDA 12.1**. Pick one of the two setups.\n\n### Option A — Conda (recommended for local boxes)\n\n```bash\nconda create -n dare python=3.10 -y\nconda activate dare\n\ncd rl_training\npip install -e .\u002Fverl\npip install -r ..\u002Frequirements.txt\n```\n\n### Option B — One-shot bootstrap\n\n`environment.sh` creates a fresh conda env on local SSD, installs a matching `torch` + prebuilt `flash-attn` wheel, then installs the local `verl`:\n\n```bash\nbash environment.sh\n```\n\n---\n\n## Stage 1 — Cold-Start Difficulty Estimator (Optional but Recommended)\n\nDARE uses an embedding-based teacher only for **prompts without buffered rollouts**. If you already ship a cached teacher bank (see `adaptive_difficulty_prediction\u002Fall_merged_teacher_bank\u002F` and `adaptive_difficulty_prediction\u002Foutputs\u002F…\u002Fmodel_final.pt`), you can skip straight to Stage 2.\n\nTo retrain the cold-start teacher on your own data:\n\n1. Prepare training data. In `adaptive_difficulty_prediction\u002Fload_data.py`, replace the training and reference pickles; see formats under `adaptive_difficulty_prediction\u002Fdatasets\u002F`.\n2. Run embedding extraction and teacher training:\n\n   ```bash\n   cd adaptive_difficulty_prediction\n   bash run_bash\u002Frun_embed.sh\n   bash run_bash\u002Frun_train.sh\n   ```\n\nThe resulting checkpoint is consumed by the RL trainer via `TEACHER_MODEL_CHECKPOINT_PATH` and `teacher_model.embedding_path` (set in the scripts under `rl_training\u002Frun_bash\u002F`).\n\n---\n\n## Stage 2 — DARE RL Training\n\nAll training entry points live in `rl_training\u002Frun_bash\u002F`. They share `common_ds_teacher_replay_config.sh`, which exposes the key DARE knobs:\n\n| Group | Variable | Meaning |\n|---|---|---|\n| Selection | `SELECTION_METHOD` | `is` (SNIS), `bayesian` (MoPPS-style), `teacher` (DOTS), `LLM_predict`, or `\"\"` (uniform). |\n| Selection | `SAMPLE_DIST_TYPE` + `BETA_PEAK` \u002F `BETA_KAPPA` | Enable `beta` to match the paper's symmetric-Beta sampler. |\n| SNIS | `IS_CLIP_RANGE`, `IS_ESS_THRESHOLD`, `IS_TOKEN_CLIP_EPSILON` | Clipping `c`, ESS fallback, and token-level IS clip. |\n| Rollouts | `NUM_GENERATIONS`, `MAX_NUM_GENERATIONS` | Base `G` and max group size for tiered reallocation. |\n| Tiered shaping | `EASY_THRESHOLD`, `EASY_LENGTH_PENALTY_COEFF` | `d_easy` and `λ_easy`. Set coeff to `0` to disable. |\n| Tiered shaping | `HARD_LENGTH_THRESHOLD`, `HARD_LENGTH_BONUS_COEFF` | `d_hard` and `λ_hard`. |\n| Hard memory | `HARD_MEMORY_*` | Control hint-augmented rollouts (retrieval from replay buffer). |\n| Replay | `SIGMA`, `BUFFER_SIZE`, `REPLAY_STRATEGY` | Fresh\u002Freplay mix and buffer capacity. |\n| Clipping | `POLICY_CLIP_RATIO`, `POLICY_CLIP_LOWER_BOUND`, `POLICY_CLIP_UPPER_BOUND` | Base `ε`, with optional relaxed upper clip for easy prompts. |\n\nRecommended defaults matching the paper: `IS_CLIP_RANGE=4.0`, `BETA_KAPPA=100`, `(EASY_THRESHOLD, HARD_LENGTH_THRESHOLD) = (0.3, 0.8)` (i.e. `d_easy=0.3`, `d_hard=0.8` measured as failure rate), `(G_easy, G, G_hard) = (4, 8, 16)`, `λ_easy = λ_hard = 1e-4`.\n\n### Run\n\n```bash\ncd rl_training\n\n# Small model, single node: Qwen2.5-Math-1.5B on DeepScaleR\nbash run_bash\u002F1_ours_small_model.sh\n# or the 4-GPU variant\nbash run_bash\u002F2_ours_small_model.sh\n\n# IS-only selection baseline\nbash run_bash\u002FIS_big_model.sh\n\n# Large model: Qwen2.5-3B on 8 GPUs with teacher + replay\nbash run_bash\u002F12_final_ds_teacher_replay.sh\n```\n\nEach script prints the resolved paths (`TEACHER_ROOT`, `OUTPUT_BASE`, Ray port, CUDA devices, wandb name) before launching. Output checkpoints and eval CSVs land under `rl_training\u002Foutput\u002F\u003Cmodel>\u002F\u003Crun>\u002F`.\n\n### Evaluation\n\n`EVALUATE_DATASET` selects which benchmarks run each epoch; defaults to `math500,aime2024,aime2025,gsm8k,aimo_amc`. Per-epoch accuracies are written to `plots\u002Feval_results.csv` inside the run directory.\n\n---\n\n## Repository Layout\n\n```\n.\n├── adaptive_difficulty_prediction\u002F   # Stage 1: embedding teacher + cold-start estimator\n│   ├── load_data.py, train.py, save_embedding.py, model.py\n│   ├── datasets\u002F                     # example data formats\n│   └── run_bash\u002F{run_embed.sh, run_train.sh, run_train_coding.sh}\n├── rl_training\u002F                      # Stage 2: DARE RL loop\n│   ├── verl\u002F                         # local verl fork (editable install)\n│   └── run_bash\u002F*.sh                 # entry points (see table above)\n├── environment.sh                    # one-shot bootstrap\n└── requirements.txt\n```\n\nThe core DARE logic lives in `rl_training\u002Fverl\u002Fverl\u002Ftrainer\u002Fppo\u002F` — notably `is_data_selector.py` (SNIS + Beta sampler + Bayesian variant) and `ray_trainer.py` (tiered rollout allocation, reward shaping, replay mix).\n\n---\n\n## Citation\n\nIf DARE helps your research, please cite:\n\n```bibtex\n@misc{zhou2026dare,\n  title         = {DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation},\n  author        = {Zhou, Yang and Jin, Can and Dong, Zihan and Wang, Zhepeng and\n                   Yang, Yanting and Zhao, Shiyu and Li, Lei and Bao, Runxue and\n                   Xie, Yaochen and Metaxas, Dimitris N.},\n  year          = {2026},\n  eprint        = {2605.09188},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.LG},\n  url           = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.09188}\n}\n```\n\n---\n\n## Acknowledgements\n\nParts of this code build on [`verl`](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) and [`rllm`](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm). We thank the authors for releasing their implementations.\n","DARE是一个面向大语言模型推理的难度自适应强化学习框架，它结合了策略对齐的难度估计与特定难度的训练策略。该项目的核心功能在于通过动态调整训练数据的难度来优化训练过程，从而提高训练效率、最终准确性和推理时的token使用效率。技术上，DARE采用Python实现，并且在多个数学推理基准测试中表现出色，相较于其他难度感知的RL方法，能够更快收敛并产生更优的结果。适用于需要高效训练大型语言模型以进行复杂推理任务或代码生成的应用场景。","2026-06-11 04:01:48","CREATED_QUERY"]