[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76195":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":22,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},76195,"realm-retrieve","bettyguo\u002Frealm-retrieve","bettyguo","When to retrieve during reasoning. Adaptive, step-level RAG for large reasoning models.","",null,"Python",117,13,9,10,0,40,3.44,"Other",false,"main",true,[],"2026-06-12 02:03:40","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fbanner.svg\" alt=\"ReaLM-Retrieve\" width=\"820\"\u002F>\n\n# ReaLM-Retrieve\n\n### When to Retrieve **During** Reasoning — Adaptive RAG for Large Reasoning Models\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Factions\u002Fworkflows\u002Fci.yml\">\u003Cimg alt=\"CI\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fbettyguo\u002Frealm-retrieve\u002Fci.yml?branch=main&label=CI&logo=github&style=flat-square\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-blue.svg?style=flat-square\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-3110\u002F\">\u003Cimg alt=\"Python\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%20%7C%203.11-blue.svg?style=flat-square&logo=python&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpytorch.org\u002F\">\u003Cimg alt=\"PyTorch\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.2%2B-EE4C2C.svg?style=flat-square&logo=pytorch&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdoi.org\u002F10.1145\u002F3805712.3809722\">\u003Cimg alt=\"Paper\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSIGIR-2026-b31b1b.svg?style=flat-square\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fbettyguo\u002Frealm-retrieve\u002Fblob\u002Fmain\u002Fnotebooks\u002F01_quickstart.ipynb\">\u003Cimg alt=\"Open In Colab\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FColab-Quickstart-F9AB00?style=flat-square&logo=googlecolab&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fstargazers\">\u003Cimg alt=\"Stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbettyguo\u002Frealm-retrieve?style=flat-square&color=ffd54f&label=%E2%98%85\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp>\n  \u003Ca href=\"#-quickstart\">⚡ Quickstart\u003C\u002Fa> ·\n  \u003Ca href=\"#-results\">📊 Results\u003C\u002Fa> ·\n  \u003Ca href=\"#-how-it-works\">🧠 How it works\u003C\u002Fa> ·\n  \u003Ca href=\"#-installation\">📦 Install\u003C\u002Fa> ·\n  \u003Ca href=\"docs\u002F\">📖 Docs\u003C\u002Fa> ·\n  \u003Ca href=\"#-citation\">📝 Cite\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n---\n\n> **TL;DR** — Large reasoning models (DeepSeek-R1, o1, QwQ) think for thousands of tokens before answering. Classic RAG retrieves *once*, up front. ReaLM-Retrieve learns **where inside the chain of thought** retrieval actually helps, and skips the rest. The result: **+5.8 F1** on MuSiQue, **47 % fewer retrieval calls**, and a **2.1×** better accuracy-per-call ratio than IRCoT.\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\">\n\n#### 🎯 What it solves\nLong-form reasoning generates **knowledge gaps mid-stream**. Retrieving once is too early; retrieving every sentence is wasteful. ReaLM-Retrieve detects the gaps *as they appear* and intervenes only when external evidence is likely to flip the next step.\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\">\n\n#### 🧪 Why it works\nA learned policy combines **three uncertainty signals** — verbalised self-assessment, entity-coverage entropy, and consistency across sampled continuations — into a single per-step score, then is fine-tuned with REINFORCE against an F1 \u002F cost reward.\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## ⚡ Quickstart\n\nA **2-minute, CPU-friendly toy run** so you can see the pipeline end-to-end without renting an A100:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve.git\ncd realm-retrieve\npip install -e \".[dev]\"\nmake quickstart            # runs examples\u002Fquickstart.py on a toy 5-question corpus\n```\n\nOutput (truncated):\n\n```\n[1\u002F5] Question: Which country hosted the 2008 Summer Olympics?\n      RSUS=0.18  →  policy = SKIP  (no retrieval)\n      Answer: China                            ✓\n[2\u002F5] Question: What's the capital of the country that...\n      RSUS=0.72  →  policy = RETRIEVE  (top-5 docs, 4.1 ms)\n      Answer: Stockholm                        ✓\n─────────────────────────────────────────────────\n  EM 80.0  |  F1 86.7  |  retrievals\u002Fq 1.6  |  F1\u002Fcall 54.2\n```\n\nPrefer notebooks?  → [**Open in Colab**](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fbettyguo\u002Frealm-retrieve\u002Fblob\u002Fmain\u002Fnotebooks\u002F01_quickstart.ipynb)\nPrefer Docker?     → `docker run --rm -it ghcr.io\u002Fbettyguo\u002Frealm-retrieve:latest make quickstart`\n\n---\n\n## 📊 Results\n\n### Main benchmark — MuSiQue (multi-hop QA, DeepSeek-R1-32B)\n\n| Method               | EM       | F1       | Retrievals\u002Fq | F1 \u002F call |\n|----------------------|---------:|---------:|-------------:|----------:|\n| No retrieval         | 41.2     | 48.7     | 0.0          |     —     |\n| Single RAG           | 52.6     | 59.4     | 1.0          | 59.4      |\n| IRCoT                | 58.3     | 65.4     | 3.4          | 19.2      |\n| FLARE                | 55.1     | 62.3     | 2.8          | 22.3      |\n| Self-RAG †           | 54.8     | 61.9     | 2.1          | 29.5      |\n| Search-R1            | 59.1     | 66.8     | 2.4          | 27.8      |\n| **ReaLM-Retrieve**   | **63.5** | **71.2** | **1.8**      | **39.6**  |\n\n\u003Csub>† Llama-2-13B base. All gains over IRCoT\u002FSearch-R1 significant at *p \u003C 0.01* (paired bootstrap, 10K iter, Bonferroni-corrected).\u003C\u002Fsub>\n\n### Generalisation across benchmarks\n\n| Dataset      | F1 (Δ vs. IRCoT) | Retrievals\u002Fq (Δ vs. IRCoT) |\n|--------------|-----------------:|---------------------------:|\n| MuSiQue      |  **71.2 (+5.8)** |  **1.8 (−47 %)**           |\n| HotpotQA     |  **78.4 (+3.1)** |  **1.4 (−51 %)**           |\n| 2WikiMHQA    |  **74.9 (+4.7)** |  **1.6 (−43 %)**           |\n\nReproduce: `make eval DATASET=hotpotqa`. Full numbers + ablations are in [§5 of the paper](paper\u002F).\n\n---\n\n## 🧠 How it works\n\n```\n                ┌─────────────────────────────────────────┐\n                │             user question               │\n                └──────────────────┬──────────────────────┘\n                                   ▼\n              ┌────────────────────────────────────────────┐\n              │   Large Reasoning Model (DeepSeek-R1 …)    │\n              │       generates extended chain-of-thought  │\n              └──────────────────┬─────────────────────────┘\n                                 ▼\n            ① ReasoningStepSegmenter   (94.2 F1 boundary classifier)\n                                 ▼\n            ② RSUS  =  α·U_verb + β·U_ent + γ·U_cons     ◀──── per step\n                                 ▼\n            ③ π(retrieve | state)        (REINFORCE policy, λ-curriculum)\n                                 ▼\n            ④ ColBERTv2 + PLAID retrieval  →  context fusion\n                                 ▼\n                        final answer\n```\n\n| Component                | Where                                            | What it does                                                  |\n|--------------------------|--------------------------------------------------|---------------------------------------------------------------|\n| **Segmenter**            | [segmentation.py](src\u002Frealm_retrieve\u002Fmodels\u002Fsegmentation.py) | Splits a reasoning chain into logical steps (avg 127 tok).    |\n| **RSUS calculator**      | [rsus.py](src\u002Frealm_retrieve\u002Fmodels\u002Frsus.py)     | 3-signal step-level uncertainty score.                        |\n| **Policy network**       | [policy.py](src\u002Frealm_retrieve\u002Fmodels\u002Fpolicy.py) | 4-layer transformer; binary retrieve vs. continue.            |\n| **Retriever**            | [retriever.py](src\u002Frealm_retrieve\u002Fmodels\u002Fretriever.py) | ColBERTv2 + PLAID late-interaction search.                    |\n| **LRM adapter**          | [reasoning_model.py](src\u002Frealm_retrieve\u002Fmodels\u002Freasoning_model.py) | Unified API for DeepSeek-R1, o1, QwQ.                         |\n\n---\n\n## 📦 Installation\n\n### Minimal (CPU, toy demo)\n```bash\npip install -e \".[dev]\"\npython -m spacy download en_core_web_sm\n```\n\n### Full (GPU, training + serving)\n```bash\npip install -e \".[all]\"           # serve + api + train + dev + docs\npython -m spacy download en_core_web_sm\n```\n\n### Docker\n```bash\ndocker build -t realm-retrieve .\ndocker run --gpus all --rm -it realm-retrieve make quickstart\n```\n\n### System requirements\n\n|                  | Toy demo    | Inference  | Full training            |\n|------------------|-------------|------------|--------------------------|\n| GPU              | none        | 1 × 24 GB  | 8 × A100 80 GB           |\n| RAM              | 4 GB        | 32 GB      | 512 GB (ColBERT index)   |\n| Disk             | 200 MB      | 50 GB      | 68 GB (data + indices)   |\n| Wall clock       | \u003C 2 min     | varies     | ≈ 12.5 days policy RL    |\n\n---\n\n## 🗂 Project layout\n\n```\nrealm-retrieve\u002F\n├── src\u002Frealm_retrieve\u002F      # installable package\n│   ├── models\u002F              #   segmentation · rsus · policy · retriever · LRM\n│   └── evaluation\u002F          #   QA + IR + efficiency + bootstrap\n├── configs\u002F                 # Hydra configs (datasets, models, experiments)\n├── examples\u002F                # Runnable demos — start here\n│   └── quickstart.py        #   end-to-end CPU toy\n├── notebooks\u002F               # Colab walk-throughs\n├── tests\u002F                   # pytest unit + integration suite\n├── docs\u002F                    # mkdocs site\n├── paper\u002F                   # SIGIR '26 camera-ready\n└── Makefile                 # `make help` for everything\n```\n\n---\n\n## 🛠 Common tasks\n\n```bash\nmake help                      # list every target\nmake quickstart                # toy CPU demo\nmake install-dev               # editable install + dev tools + pre-commit\nmake data                      # download MuSiQue \u002F HotpotQA \u002F 2WikiMHQA + build indices\nmake train-segmentation        # train the step-boundary classifier\nmake train-policy              # REINFORCE policy (50 K steps)\nmake eval DATASET=musique      # evaluate on a benchmark\nmake lint typecheck test       # local CI\nmake docs-serve                # browse docs at http:\u002F\u002Flocalhost:8000\n```\n\n---\n\n## ❓ FAQ\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>I don't have 8× A100s. Can I still use this?\u003C\u002Fb>\u003C\u002Fsummary>\n\nYes. The **policy and segmenter checkpoints are tiny** (≈ 40 MB and 6 MB).\nYou only need a big GPU for the *reasoning model* itself — and you can swap\nin any LRM via the OpenAI \u002F Anthropic adapter, or run DeepSeek-R1-Distill-Qwen-1.5B\nlocally on a single 24 GB card. See [`docs\u002Finference.md`](docs\u002Finference.md).\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>How does RSUS differ from FLARE's token-probability trigger?\u003C\u002Fb>\u003C\u002Fsummary>\n\nFLARE triggers on single low-probability tokens, which fires constantly during\nexploratory reasoning. RSUS aggregates **three orthogonal signals** at the\n*step* level (≈ 127 tokens), so it only fires when the model is collectively\nuncertain about a *claim*, not just a stylistic word choice.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Can I replace ColBERTv2 with my own retriever?\u003C\u002Fb>\u003C\u002Fsummary>\n\nYes — implement the [`Retriever` protocol](src\u002Frealm_retrieve\u002Fmodels\u002Fretriever.py)\n(`retrieve(query, k) -> List[Dict]`) and inject it via Hydra:\n`retrieval.checkpoint=path\u002Fto\u002Fyours`.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Does it work for code, math, or non-English?\u003C\u002Fb>\u003C\u002Fsummary>\n\nThe entity-entropy signal needs a NER model — swap `en_core_web_sm` for any\nspaCy language model. For code\u002Fmath, set `β=0` and rely on verbalised +\nconsistency signals; we report ablations in §6.3 of the paper.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>How do I cite this work?\u003C\u002Fb>\u003C\u002Fsummary>\n\nSee [Citation](#-citation) below or `CITATION.cff` in this repo.\n\u003C\u002Fdetails>\n\n---\n\n## 🛣 Roadmap\n\nWe ship in three-week milestones. Click into any issue for the design sketch\nand acceptance criteria — most are scoped tightly enough to land in a single\nPR.\n\n### v1.1 — *due 2026-06-11* &nbsp; [milestone](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fmilestone\u002F2)\n- [ ] [#6](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F6) `chore(types):` modernise type hints to PEP-604 — **good first issue**\n- [ ] [#7](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F7) `test(ci):` CPU smoke test that wires the full pipeline from a wheel — **good first issue**\n\n### v1.2 — *due 2026-07-09* &nbsp; [milestone](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fmilestone\u002F3)\n- [ ] [#8](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F8) `feat(retriever):` extract a `Retriever` Protocol (BM25 · SPLADE · custom) — **help wanted**\n\n### v2.0 — *due 2026-09-03* &nbsp; [milestone](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fmilestone\u002F4)\n- [ ] [#9](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F9) `research(rsus):` multilingual entity-entropy support (zh \u002F ja \u002F es)\n- [ ] [#10](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F10) `feat(demo):` HuggingFace Space + interactive playground\n\n### Recently shipped (v1.0)\n- [x] [#1](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F1) Hydra `config_path` corrected\n- [x] [#2](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F2) REINFORCE trainer no longer crashes on empty episodes\n- [x] [#3](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F3) Heavy imports are lazy → CPU consumers work out of the box\n- [x] [#4](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F4) Ship the missing `configs\u002Fexperiments\u002F*.yaml`\n- [x] [#5](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues\u002F5) Validate RSUS weights sum to 1\n\nSee [all open issues](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fissues) or\nthe [v1.1 board](https:\u002F\u002Fgithub.com\u002Fbettyguo\u002Frealm-retrieve\u002Fmilestone\u002F2) — and\nplease open an issue if you hit something unexpected.\n\n---\n\n## 🤝 Contributing\n\nPull requests are welcome. Please read [CONTRIBUTING.md](CONTRIBUTING.md) for the dev loop and our [Code of Conduct](CODE_OF_CONDUCT.md) before opening a PR. Security-sensitive reports go through [SECURITY.md](SECURITY.md).\n\n---\n\n## 🌟 Stargazers\n\nIf ReaLM-Retrieve helps your research, **a ⭐ keeps us going** — it's the single most useful signal we get from the community.\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=bettyguo\u002Frealm-retrieve&type=Date)](https:\u002F\u002Fstar-history.com\u002F#bettyguo\u002Frealm-retrieve&Date)\n\n---\n\n## 📝 Citation\n\n```bibtex\n@inproceedings{guo2026realmretrieve,\n  title        = {When to Retrieve During Reasoning: Adaptive Retrieval for Large Reasoning Models},\n  author       = {Guo, Dongxin and Wu, Jikun and Yiu, Siu Ming},\n  booktitle    = {Proceedings of the 49th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '26)},\n  year         = {2026},\n  publisher    = {ACM},\n  address      = {Melbourne, Australia},\n  doi          = {10.1145\u002F3805712.3809722}\n}\n```\n\n---\n\n## 📜 License\n\nReleased under the [Apache 2.0 License](LICENSE). The SIGIR '26 manuscript in [`paper\u002F`](paper\u002F) is © ACM 2026 and is shared here for non-commercial research use under the conference's open-access terms.\n\n\u003Csub>Built with ❤ at HKU & Stellaris AI. Inspired by IRCoT, FLARE, Self-RAG, and Search-R1.\u003C\u002Fsub>\n","ReaLM-Retrieve 是一个用于大型推理模型的自适应检索增强生成（RAG）系统。该项目通过在推理过程中动态检测知识缺口，并仅在必要时进行信息检索，从而提高了模型的效率和准确性。其核心功能包括基于三个不确定性信号（自我评估、实体覆盖熵及样本连续性的一致性）来决定何时执行检索操作，使用REINFORCE算法对策略进行微调以优化F1分数与成本之间的平衡。该技术特别适用于需要长链思考且存在中间知识断层的应用场景，如复杂的问答系统或深度文本分析任务中，能够显著减少不必要的检索请求次数同时提升整体性能。",2,"2026-06-11 03:54:46","CREATED_QUERY"]