[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80963":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":26,"discoverSource":27},80963,"DenoiseRL","ALEX-nlp\u002FDenoiseRL","ALEX-nlp","DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes",null,"Python",37,4,31,1,0,2,6,2.1,false,"main",true,[],"2026-06-12 02:04:09","# DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes\n\n**Caijun Xu, Changyi Xiao, Zhongyuan Peng, Yixin Cao**\n\nFudan University · Shanghai Innovation Institute\n\n\u003C!-- [![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-PDF-b31b1b)](paper\u002F_Arxiv__TEAI_DenoiseRL__Bootstrapping_Reasoning_Models_to_Recover_from_Noisy_Prefixes__Copy_.pdf)\n[![Code](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode-official-blue)](#)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green)](#) -->\n\n![DenoiseRL overview](.\u002Fimg\u002FDenoiseRL.svg)\n\n*Figure 1. DenoiseRL conditions the policy on a truncated incorrect prefix produced by a weak model and trains it, via verifiable-reward RL, to denoise the corrupted reasoning state and recover the correct solution path.*\n\n---\n\nThis repository contains the **official implementation** of *DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes*. DenoiseRL is a recovery-oriented reinforcement learning framework that **replaces stronger-teacher supervision with structured perturbations derived from weak-model failures**. Rather than imitating a stronger model or curating harder data, the policy is conditioned on incorrect reasoning prefixes and explicitly optimized to revise mistakes and reach a verified answer.\n\n## Table of Contents\n\n- [DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes](#denoiserl-bootstrapping-reasoning-models-to-recover-from-noisy-prefixes)\n  - [Table of Contents](#table-of-contents)\n  - [1. Motivation](#1-motivation)\n  - [2. Method](#2-method)\n  - [3. Key Results](#3-key-results)\n  - [4. Repository Layout](#4-repository-layout)\n  - [5. Installation](#5-installation)\n  - [6. Data Preparation](#6-data-preparation)\n  - [7. Training](#7-training)\n  - [8. Reproduction Guidance](#8-reproduction-guidance)\n\n---\n\n## 1. Motivation\n\nState-of-the-art reasoning RL pipelines (e.g., GRPO and DAPO) are typically constrained along two axes:\n\n1. **Supervisory ceiling.** Performance gains often hinge on access to a *stronger* teacher model, capping further progress when such teachers are unavailable.\n2. **Data engineering cost.** Capability scaling commonly relies on heavy hard-data curation, adversarial synthesis, or trajectory filtering.\n\nDenoiseRL departs from both directions. We **invert the role of weak models**: instead of treating them as imperfect supervisors, we exploit them as low-cost generators of structured corruptions. The policy is conditioned on truncated incorrect prefixes and trained — under standard verifiable rewards — to **denoise** the corrupted state and arrive at a verified solution. This casts reasoning RL as a denoising problem, drawing a conceptual parallel to denoising autoencoders and BART-style pretraining.\n\n## 2. Method\n\nEach training step samples, per problem, a mixture of two rollout types and updates the policy with a single GRPO-style group baseline shared across the mixture:\n\n- **Main rollouts** *(N per problem):* standard on-policy generations conditioned on the prompt.\n- **Denoise rollouts** *(K per problem):* generations conditioned on a truncated weak-model wrong prefix. Given a wrong solution `w`, we retain its first `p = max(1, ⌊rho · |w|⌋)` tokens as an assistant-side prefix; the policy continues from this corrupted state.\n\nThree design choices stabilize and amplify the recovery signal:\n\n- **Length-fair folding.** The visible response is `ỹ = [w₁:p, y_{p+1:p+L}]` with `p + L ≤ R`, preserving a comparable response budget against main rollouts.\n- **Continuation-only optimization.** PPO\u002FGRPO gradients flow primarily through the model-generated continuation; the heavily off-policy prefix is verifier-visible but excluded from the loss, avoiding the high-variance importance ratios documented in prior PPO-style off-policy literature.\n- **Shared group baseline.** Main and denoise trajectories of the same problem share a single advantage baseline, so denoise rollouts naturally provide negative or contrastive signal for problems that are otherwise saturated.\n\nThe joint objective can be written as a mixture\n`J(θ) = N\u002F(N+K) · J_main(θ) + K\u002F(N+K) · J_denoise(θ)`,\nwhich is interpretable as optimizing the policy under a mixture of *solving-from-scratch* and *recovering-from-corruption* distributions.\n\n## 3. Key Results\n\nReported in the paper across Qwen3-4B and Qwen3-8B policy backbones (training corpus: MATH-7.5K; weak model: Qwen2.5-1.5B-Instruct). For AMC23, AIME24, and AIME25 we report AVG@16; for MATH500 and BBEH we report AVG@1.\n\n**Qwen3-4B-Base**\n\n\n| Method             | MATH500  | AMC23    | AIME24   | AIME25   | BBEH     | Avg.     |\n| ------------------ | -------- | -------- | -------- | -------- | -------- | -------- |\n| Base               | 70.0     | 43.1     | 8.3      | 7.7      | 4.1      | 26.6     |\n| GRPO               | 83.6     | 63.1     | 22.1     | 18.1     | 11.1     | 39.6     |\n| DAPO               | 83.8     | 62.5     | 20.6     | 21.5     | 10.4     | 39.8     |\n| **DenoiseRL-GRPO** | **85.8** | 61.4     | **24.8** | **23.3** | 14.8     | **42.0** |\n| DenoiseRL-DAPO     | 84.6     | **63.6** | 21.9     | 21.7     | **15.7** | 41.5     |\n\n\n**Qwen3-8B-Base**\n\n\n| Method             | MATH500  | AMC23    | AIME24   | AIME25   | BBEH     | Avg.     |\n| ------------------ | -------- | -------- | -------- | -------- | -------- | -------- |\n| Base               | 70.4     | 49.2     | 11.9     | 10.8     | 4.1      | 29.3     |\n| GRPO               | 87.8     | 69.7     | 24.0     | 22.9     | 10.6     | 43.0     |\n| DAPO               | 87.0     | 69.7     | 23.8     | 21.7     | 11.7     | 42.8     |\n| DenoiseRL-GRPO     | 87.2     | 70.3     | 24.6     | 23.1     | 11.5     | 43.3     |\n| **DenoiseRL-DAPO** | **88.2** | **71.4** | **27.0** | **24.8** | **12.6** | **44.8** |\n\n\nAdditional takeaways from the ablation studies:\n\n- **Recovery intensity matters.** Sweeping `K ∈ {1, 4, 8}` at `rho = 0.2` shows `K = 4` provides the best trade-off; over-emphasized recovery (`K = 8`) hurts the primary solving objective.\n- **Off-policy prefix updates are unstable.** Directly backpropagating through prefix tokens leads to validation collapse and runaway response length — consistent with prior observations on PPO sensitivity to heavily off-policy tokens.\n- **Length-fair folding helps.** Removing the `p + L ≤ R` cap weakens the 4B average by ~1.8 points (42.0 → 40.2).\n- **Throughput overhead is modest.** Per-step training time on Qwen3-4B-Base is 49.7s for DenoiseRL vs. 43.8s for GRPO under matched rollout budgets.\n\nWe refer readers to the paper for full ablations, the case-study analyses of recovery behavior, and the length \u002F overthinking dynamics under varying `rho`.\n\n## 4. Repository Layout\n\n```\nDenoiseRL\u002F\n├── recipe\u002Fdenoise\u002F                    # DenoiseRL recipe (entrypoints, trainer, configs)\n│   ├── main_dapo.py                   # training entrypoint\n│   ├── dapo_ray_trainer.py            # Ray-based DAPO\u002FGRPO trainer with denoise rollouts\n│   ├── data_prepare.py                # weak-model wrong-prefix construction\n│   ├── config\u002F                        # Hydra training configs\n│   ├── denoise_qwen3-{1.7b,4b,8b}_v1.0.sh\n│   └── dapo_denoise_qwen3-{1.7b,4b,8b}_v1.0.sh\n├── verl\u002F                              # local fork of the verl RL framework (editable install)\n├── img\u002FDenoiseRL.png                  # overview figure\n├── paper\u002F                             # paper PDF\n└── requirements.txt\n```\n\n## 5. Installation\n\nDenoiseRL builds on a customized fork of `verl`. Two steps in particular are **mandatory**: installing the pinned dependencies and registering the local `verl` package in editable mode.\n\n```bash\n# (1) Create an isolated environment and install pinned dependencies.\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -r requirements.txt\n```\n\n> **Note.** The `--no-deps` flag is intentional: dependency resolution is already pinned via `requirements.txt`, and re-resolving from `verl\u002Fsetup.py` can silently override critical versions (e.g., `vllm`, `transformers`, `flash-attn`). The editable install ensures that any local modification to the framework propagates at runtime without re-installation.\n\nHardware-sensitive components (`flash-attn`, `vllm`, `cupy-cuda12x`, `torch_npu`, etc.) should be installed against the CUDA\u002Fdriver stack of the target cluster.\n\n## 6. Data Preparation\n\n`recipe\u002Fdenoise\u002Fdata_prepare.py` constructs the per-problem pool `W(q)` of incorrect-but-well-formed weak-model rollouts. It runs the weak model with vLLM, scores each rollout against the ground truth, and augments the source parquet with a `wrong_answer_with_boxed` column storing the wrong rollouts that nevertheless emit a parseable `\\boxed{...}`.\n\n```bash\npython recipe\u002Fdenoise\u002Fdata_prepare.py \\\n  --model    \u002Fpath\u002Fto\u002Fweak-model \\\n  --dataset  \u002Fpath\u002Fto\u002Ftrain.parquet \\\n  --rollout-n 8 \\\n  --output-dir .\u002Fdata\n```\n\nThe resulting `*.with_wrong_boxed.parquet` is consumed directly by `TRAIN_FILE` in the training scripts. Problems with an empty wrong-rollout pool fall back to standard main rollouts, as described in the paper.\n\n## 7. Training\n\nEach model scale ships with two recipes: a GRPO-style backbone and a DAPO variant.\n\n```bash\n# GRPO backbone\nbash recipe\u002Fdenoise\u002Fdenoise_qwen3-1.7b_v1.0.sh\nbash recipe\u002Fdenoise\u002Fdenoise_qwen3-4b_v1.0.sh\nbash recipe\u002Fdenoise\u002Fdenoise_qwen3-8b_v1.0.sh\n\n# DAPO variant\nbash recipe\u002Fdenoise\u002Fdapo_denoise_qwen3-1.7b_v1.0.sh\nbash recipe\u002Fdenoise\u002Fdapo_denoise_qwen3-4b_v1.0.sh\nbash recipe\u002Fdenoise\u002Fdapo_denoise_qwen3-8b_v1.0.sh\n```\n\nThe DenoiseRL-specific knobs are exposed at the top of each script:\n\n\n| Knob                                      | Symbol | Description                                                                                |\n| ----------------------------------------- | ------ | ------------------------------------------------------------------------------------------ |\n| `n_resp_per_prompt`                       | `N`    | number of main on-policy rollouts per problem                                              |\n| `sub_rollout_k`                           | `K`    | number of denoise rollouts per problem                                                     |\n| `part_response_ratio_strategy`            | —      | `fixed` \u002F `normal` \u002F `uniform` sampler for `rho`                                           |\n| `part_response_ratio_fixed`               | `rho`  | prefix ratio under the `fixed` strategy                                                    |\n| `part_response_ratio_{mean,std,low,high}` | —      | parameters for `normal` \u002F `uniform` strategies                                             |\n| `partial_mode`                            | —      | `cutdown` (mask prefix, length-fair), `shift` (gradient on prefix), `none` (no length cap) |\n| `use_problem_id_as_uid`                   | —      | share a single GRPO baseline across all `N + K` rollouts of one problem                    |\n\n\nCluster \u002F path settings — `MODEL_PATH`, `TRAIN_FILE`, `TEST_FILE`, `num_gpus`, `tensor_model_parallel_size` — are likewise configured at the top of each script.\n\n## 8. Reproduction Guidance\n\nTo reproduce the headline numbers reported in the paper, we recommend:\n\n- **Rollout composition:** `N = 12, K = 4` per problem.\n- **Prefix intensity:** `part_response_ratio_strategy=fixed` with `part_response_ratio_fixed=0.2`.\n- **Folding policy:** `partial_mode=cutdown` (length-fair; prefix masked from PPO loss).\n- **Response budget:** `max_response_length` consistent across main and denoise rollouts.\n- **Optimization:** continuation-only gradient flow; do not enable gradients on the off-policy prefix.\n- **Group baseline:** `use_problem_id_as_uid=True` to share advantages across the full `N + K` group.\n\nDeviating from any of the above (in particular enabling gradient on the prefix or removing the length-fair cap) is documented in the paper as a source of instability.\n\n\u003C!-- ## 9. Citation\n\nIf you find this work useful, please consider citing:\n\n```bibtex\n@article{xu2026denoiserl,\n  title   = {DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes},\n  author  = {Xu, Caijun and Xiao, Changyi and Peng, Zhongyuan and Cao, Yixin},\n  journal = {arXiv preprint},\n  year    = {2026}\n}\n```\n\nFor questions or collaboration, please contact the corresponding authors as listed in the paper. -->","DenoiseRL 是一个基于强化学习的框架，旨在通过修正由弱模型产生的错误推理前缀来恢复正确的解决方案路径。其核心功能是利用结构化扰动（从弱模型的失败中导出）替代强教师模型的监督，使策略能够基于不正确的推理前缀进行训练，并在验证奖励下优化以纠正错误并达到正确答案。技术上，DenoiseRL 采用了类似于去噪自编码器的概念，将推理过程中的错误视为噪声，通过强化学习方法去除这些噪声。该项目适用于需要提高推理模型鲁棒性的场景，尤其是在缺乏高质量标注数据或强大教师模型的情况下，可以通过该方法有效提升模型性能。","2026-06-11 04:03:00","CREATED_QUERY"]