[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1947":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":13,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":24,"discoverSource":25},1947,"MEDS","Linxi000\u002FMEDS","Linxi000",null,"Python",144,1,3,2,0,38.1,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:12","\u003Cimg src=\"https:\u002F\u002Fcapsule-render.vercel.app\u002Fapi?type=waving&height=265&text=The%20Past%20Is%20Not%20Past&desc=Memory-Enhanced%20Dynamic%20Reward%20Shaping&fontAlign=50&fontAlignY=32&fontSize=47&descAlign=50&descAlignY=58&descSize=36&color=0:AAB7CB,33:739CCF,66:76AEA4,100:C0BC90&fontColor=FFFFFF\" \u002F>\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Freadme-typing-svg.demolab.com?font=Fira+Code&weight=600&size=15&duration=1&pause=9999999&color=3F335F&center=true&vCenter=true&repeat=false&width=620&height=26&lines=%E2%AD%90+If+you+like+this+project%2C+give+it+a+star%21+%E2%AD%90\" alt=\"Give it a star\" \u002F>\n\n  \u003Cbr>\n\n  \u003Cimg src=\"https:\u002F\u002Freadme-typing-svg.demolab.com?font=JetBrains+Mono&size=26&duration=3000&pause=1000&color=8A7FB8&center=true&vCenter=true&width=900&height=120&lines=Powered+by+OpenMOSS;From+Fudan+University+%26+Shanghai+Innovation+Institution\" alt=\"Typing SVG\" \u002F>\n  \u003Cbr>\n  \n  \u003Ca href=\"http:\u002F\u002Farxiv.org\u002Fabs\u002F2604.11297\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv%3A2604.11297-b31b1b\" alt=\"Paper\">\n  \u003C\u002Fa>\n\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2604.11297\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-%F0%9F%A4%97%20Hugging%20Face-FFD21E\" alt=\"Hugging Face Paper\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\n## Overview\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmain.jpg\" alt=\"Main Figure\" width=\"900\" \u002F>\n\u003C\u002Fdiv>\n\n\u003Cb>MEDS\u003C\u002Fb>💊 (\u003Cb>M\u003C\u002Fb>emory-\u003Cb>E\u003C\u002Fb>nhanced \u003Cb>D\u003C\u002Fb>ynamic \u003Cb>R\u003C\u002Fb>eward \u003Cb>S\u003C\u002Fb>haping) is a memory-enhanced RL training recipe for LLMs built on top of [veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl). Unlike standard memoryless reward designs, MEDS incorporates historical error signals into reward shaping, allowing the training process to recognize and discourage repeated mistakes.\n\nTo achieve this, MEDS reuses layer-wise logits from the forward pass as lightweight representations of reasoning behavior, clusters similar error patterns, and applies stronger penalties to repeated failures. This encourages broader exploration and leads to better reasoning performance and greater sampling diversity.\n\n## Contents\n\n- [Overview](#overveiw)\n- [Getting Started](#getting-started)\n  - [Environment Setup](#environment-setup)\n  - [Training with MEDS](#training-with-meds)\n- [Configuration](#configuration)\n- [Evaluation](#evaluation)\n- [Data Preparation](#data-preparation)\n- [Citation](#citation)\n- [License](#license)\n\n## Getting Started\n\n### Environment Setup\n\n```bash\n# Clone the MEDS repository\ngit clone https:\u002F\u002Fgithub.com\u002FLinxi000\u002FMEDS.git\ncd MEDS\n\n# Create a new conda environment\nconda create -n meds python=3.10\nconda activate meds\n\n# Install Python dependencies\npip install -r requirements.txt\n```\n\nMEDS is built on top of veRL. Please follow the [veRL installation guide](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) to set up the framework, then place the MEDS directory into your verl repo root to get started.\n\nIn addition to the standard veRL installation, MEDS depends on updated code components in the following veRL files:\n\n- `verl\u002Fworkers\u002Ffsdp_workers.py`: includes MEDS-related worker-side logic used during rollout\u002Ftraining.\n- `verl\u002Fworkers\u002Factor\u002Fdp_actor.py`: includes MEDS-related actor-side logic used for reward shaping behavior.\n\nBefore training, make sure these two files are synced with the MEDS-integrated version in this repository.\n\n### Training with MEDS\n\nSet the required paths and launch training:\n\n```bash\nexport MODEL_PATH=\"${HOME}\u002Fmodels\u002FQwen2.5-Math-7B\"  # Base model path\nexport TRAIN_FILE=\"${HOME}\u002Fdata\u002Funified_math.parquet\"\nexport TEST_FILE=\"${HOME}\u002Fdata\u002Faime-2024.parquet\"\nexport CKPTS_DIR=\"${HOME}\u002Fckpts\u002FMEDS\u002Fmeds_7b\"\n\nbash recipe\u002Fmeds\u002Frun_meds.sh\n```\n\nThis runs MEDS training with the default configuration (Qwen2.5-Math-7B, 8 GPUs per node, 100 epochs).\n\n## Configuration\n\nKey hyperparameters in `recipe\u002Fmeds\u002Frun_meds.sh`:\n\n\n| Parameter                | Default   | Description                                                      |\n| ------------------------ | --------- | ---------------------------------------------------------------- |\n| `cluster_method`         | `hdbscan` | Clustering algorithm for error pattern grouping                  |\n| `use_layer_diff`         | `False`   | Whether to use layer-difference features for clustering          |\n| `use_last_n_layers`      | `14`      | Number of last transformer layers used for clustering            |\n| `cluster_penalty_target` | `wrong`   | Which responses to penalize: `wrong` \u002F `right` \u002F `both` \u002F `none` |\n| `penalty_coef`           | `0.1`     | Strength of the diversity penalty                                |\n\n\nFine-grained Hydra config options are in `recipe\u002Fmeds\u002Fconfig\u002Fmeds_trainer.yaml`.\n\n## Evaluation\n\nOur evaluation pipeline fully follows the official open-source implementation from [LIMO](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMO), particularly its evaluation module.\n\nFirst, install the required dependencies under the `LIMO\u002Feval` directory:\n\n```bash\npip install -r requirements.txt\n```\n\nTo launch the full evaluation pipeline, run:\n\n```bash\nbash eval.sh\n```\n\nTo evaluate a specific checkpoint, update the `--model_name_or_path` argument in `eval.sh` to point to your target checkpoint directory.\n\nTo change the number of GPUs or specify particular devices, modify `CUDA_VISIBLE_DEVICES` in the script. For example:\n\n```bash\nCUDA_VISIBLE_DEVICES=0,1,2,3\n```\n\nTo evaluate pass@k, adjust the following parameters:\n\n- `n_sampling`: total number of samples generated per problem\n- `k`: the k value used for pass@k computation\n\nThe maximum generation length used in our experiments is `max_tokens = 8192`.\n\n## Data Preparation\n\nThe training set is a unified math dataset combining [DAPO-Math-17K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBytedTsinghua-SIA\u002FDAPO-Math-17K) and difficulty levels 3–5 of [MATH-lighteval](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDigitalLearningGmbH\u002FMATH-lighteval), with deduplication applied. The validation set is [AIME 2024](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceH4\u002Faime_2024).\n\n## Citation\n\nIf you find our work helpful, please consider citing:\n\n```bibtex\n@article{liu2026meds,\n  title={The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping},\n  author={Liu, Yang and Wang, Enxi and Gao, Yufei and Zhang, Weixin and Wang, Bo and Zeng, Zhiyuan and Zhang, Yikai and Zheng, Yining and Qiu, Xipeng},\n  journal={arXiv preprint arXiv:2604.11297},\n  year={2026}\n}\n\n```\n\n## License\n\nThis project is licensed under the [Apache-2.0 License](LICENSE).\n\n## Acknowledgments\n\nWe gratefully acknowledge the open-source projects that made this work possible. This project is built on top of the [veRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl) framework, uses [HDBSCAN](https:\u002F\u002Fgithub.com\u002Fscikit-learn-contrib\u002Fhdbscan) for clustering, and is partly inspired by [DAPO](https:\u002F\u002Fgithub.com\u002FBytedTsinghua-SIA\u002FDAPO). Our models are trained based on [Qwen2.5](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen2.5\u002F) and [Qwen3](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen3\u002F). We sincerely thank the contributors and maintainers of these projects for their valuable contributions to the open-source community.","MEDS是一个基于veRL框架的记忆增强动态奖励塑形方案，专为大语言模型的强化学习训练设计。其核心功能是通过引入历史错误信号来优化奖励机制，利用前向传递中的层间logits作为推理行为的轻量级表示，对相似错误模式进行聚类，并对重复失败施加更强的惩罚，从而促进更广泛的探索和提高推理性能及样本多样性。该项目适用于需要提升语言模型在复杂任务中表现的场景，如对话系统、文本生成等。采用Python开发，遵循Apache License 2.0开源协议。","2026-06-11 02:46:59","CREATED_QUERY"]