[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75419":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},75419,"PaperGuru-Benchmark","PaperGuru-AI\u002FPaperGuru-Benchmark","PaperGuru-AI","Lifecycle-Aware Memory for long-horizon LLM agents — 66.05% on PaperBench, 94.66% on SurveyBench, 10 peer-reviewed acceptances at FSE\u002FICML\u002FTOSEM\u002FAEI\u002FICoGB",null,"TeX",842,129,87,0,199,239,740,597,10.34,"Other",false,"main",[],"2026-06-12 02:03:33","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Ffigures\u002Fhero_banner.png\" alt=\"PaperGuru — research at the speed of light\" width=\"100%\" \u002F>\n\n\u003Cbr\u002F>\n\n# PaperGuru &nbsp;·&nbsp; The Lifecycle-Aware Memory Benchmark\n\n**The first long-term memory primitive for long-horizon LLM agents — with state-of-the-art results on PaperBench and SurveyBench, and a peer-reviewed track record across FSE 2026, ICML 2026, TOSEM, AEI, and ICoGB.**\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-PDF-1f3a8a?style=for-the-badge&logo=arxiv&logoColor=white)](paper\u002FPaperGuru-CCM.pdf)\n[![PaperBench](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaperBench-66.05%25-0a0a0b?style=for-the-badge)](PaperBench\u002F)\n[![SurveyBench](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSurveyBench-94.66%25-0a0a0b?style=for-the-badge)](SurveyBench\u002F)\n[![Accepted](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAccepted-10%20papers-c9a14a?style=for-the-badge)](#-track-record)\n[![中文 README](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F中文-README-dc2626?style=for-the-badge)](README.zh-CN.md)\n[![Join WeChat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJOIN-WECHAT-07c160?style=for-the-badge&logo=wechat&logoColor=white&labelColor=555555)](assets\u002Fbadges\u002Fwechat_qr.png)\n\n\u003C\u002Fdiv>\n\n---\n\n## 📰 TL;DR\n\n> AI infrastructure has three commodity primitives — **compute** (NVIDIA), **models** (frontier LLM weights), and **retrieval** (Pinecone-class vector databases). A fourth primitive — **long-term memory with lifecycle semantics** — is missing, and every long-horizon LLM system reinvents it badly. **PaperGuru** is the first system designed from a 4-axiom formalisation of *Lifecycle-Aware Memory* (LAM), and on the two most rigorous published benchmarks it delivers state-of-the-art from a single algorithmic mechanism.\n\n| Benchmark | Metric | PaperGuru | Best published baseline | Lift |\n|---|---|---:|---:|---:|\n| **PaperBench** (OpenAI, 2025) | Mean reproduction across 23 papers | **66.05%** | 35.74% | **+30.21%** |\n| **PaperBench** | Papers above 41% human ML-PhD bar | **20 \u002F 23** | 4 \u002F 23 | **+16 papers** |\n| **SurveyBench** (Yan et al., 2025) | Content score (5-axis avg.) | **94.66%** | 80.60% | **+14.06%** |\n| **SurveyBench** | Composite richness (figures · tables · code) | **43.76%** | 20.36% | **+23.40%** |\n| **Real world** | Peer-reviewed acceptances since Q4 2025 | **10 papers** | — | 5 venues |\n\n\u003Cbr\u002F>\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fbefore_after.png\" alt=\"Before \u002F After PaperGuru\" width=\"92%\" \u002F>\n\u003Cbr\u002F>\n\u003Csub>\u003Ci>Same researcher, same task. The difference is the memory primitive underneath.\u003C\u002Fi>\u003C\u002Fsub>\n\u003C\u002Fdiv>\n\n---\n\n## 📑 Table of Contents\n\n- [What is PaperGuru?](#-what-is-paperguru)\n- [Why memory, why now](#-why-memory-why-now)\n- [The CCM architecture](#-the-ccm-architecture)\n- [Pipeline at a glance](#-pipeline-at-a-glance)\n- [Results · PaperBench](#-results--paperbench)\n- [Results · SurveyBench](#-results--surveybench)\n- [Track Record](#-track-record)\n- [What is in this repository](#-what-is-in-this-repository)\n- [Reproducing the figures](#-reproducing-the-figures)\n- [License](#-license)\n\n---\n\n## ✨ What is PaperGuru?\n\nPaperGuru is a memory architecture for long-horizon LLM agents. It is **not** another RAG library and **not** another agent framework. It is the first concrete instantiation of **Lifecycle-Aware Memory (LAM)** — a four-axiom system primitive that production agents have been quietly missing.\n\nThe four LAM axioms (formalised in §3 of the paper):\n\n1. **Versioned content** &nbsp;·&nbsp; statements once correct must become *stale* after revision, deprecation, or retraction — the memory layer must know.\n2. **Structural multi-hop relevance** &nbsp;·&nbsp; the right evidence is two citations away, not one cosine-similarity hop.\n3. **Bounded query cost under unbounded archive growth** &nbsp;·&nbsp; the archive grows every day; routing cost cannot grow with it.\n4. **Provenance-grounded composition** &nbsp;·&nbsp; every claim in the agent's output must trace back to a verifiable artifact in memory.\n\nPaperGuru satisfies all four axioms jointly through a single mechanism — the **Capital Chunk Memory (CCM)** — described next.\n\n---\n\n## 🌍 Why memory, why now\n\nContext windows have moved from 4K to 1M tokens in eighteen months, and the next milestone — *persistent* memory across sessions — is on every foundation lab's roadmap. The dominant unit of LLM deployment is no longer the single-prompt completion; it is the **long-horizon agentic system**:\n\n- a multi-day software-engineering session that touches hundreds of files,\n- a literature-grounded research assistant that drafts a 200K-token survey from a citation graph spanning ten years,\n- a paper-to-code reproduction agent that turns a paper PDF into a runnable submission tree,\n- a clinical-evidence agent that reads a decade of trial records before recommending treatment.\n\nAcross every published evaluation of these systems — **SurveyBench**, **PaperBench**, **SWE-bench-Live** — the ceiling is no longer set by the backbone's reasoning ability. It is set by what the system *remembers* between turns and *retrieves* from outside the working set.\n\n---\n\n## 🏛 The CCM architecture\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Farchitecture.png\" alt=\"PaperGuru CCM architecture\" width=\"92%\" \u002F>\n\u003C\u002Fdiv>\n\nPaperGuru separates memory into two surfaces:\n\n- **Chunk heads** — a compact, bounded routing surface (one head per artifact)\n- **Chunk contents** — the unbounded raw text, accessed lazily on demand\n\nA central **capital chunk** indexes all heads and supports *capital-first routing* over a **temporal artifact graph** that unifies two edge classes:\n\n| Edge class | Examples |\n|---|---|\n| **Structural edges** | `cites`, `benchmarked-on`, `introduced-by`, `implements` |\n| **Historical-causality edges** | `discussed-in`, `deprecated-by`, `retracted-by`, `superseded-by` |\n\nQuery-time context is constructed through a **route-first → expand-second → distill-last** pipeline that yields compact, provenance-grounded **evidence cards** — the single data structure on which the rest of the system operates.\n\n> **Why this matters in practice.** Flat retrieval breaks the moment a paper is revised; agent-specific memory hacks (MemGPT tiers, Ebbinghaus forgetting, knowledge-graph wrappers) each handle one or two of the four axioms but never all four. CCM is the first design we know of that satisfies all four jointly without per-task hand-tuning.\n\n---\n\n## ⚙️ Pipeline at a glance\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fpipeline.png\" alt=\"PaperGuru CCM pipeline\" width=\"92%\" \u002F>\n\u003Cbr\u002F>\n\u003Csub>\u003Ci>Search → Extract → Reason → Verify. The Reason stage is where the Compose \u002F Critique \u002F Mutate cycle runs.\u003C\u002Fi>\u003C\u002Fsub>\n\u003C\u002Fdiv>\n\n\u003Cbr\u002F>\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Fdemos\u002Fpipeline_animated.svg\" alt=\"PaperGuru live pipeline\" width=\"92%\" \u002F>\n\u003Cbr\u002F>\n\u003Csub>\u003Ci>Live SVG animation — the same pipeline, rendered as a beating system. Three sub-blocks of the Reason stage cycle through Compose \u002F Critique \u002F Mutate; verification ticks resolve as evidence cards land.\u003C\u002Fi>\u003C\u002Fsub>\n\u003C\u002Fdiv>\n\n| Stage | Input | Output | Compute share* |\n|---|---|---|---:|\n| **01 · SEARCH**  | Topic query, candidate archive | Ranked artifact heads | ~15% |\n| **02 · EXTRACT** | Heads + chunk contents | Evidence cards (text + provenance) | ~20% |\n| **03 · REASON**  | Evidence cards | Draft segments (Compose → Critique → Mutate loop) | ~45% |\n| **04 · VERIFY**  | Draft segments | Cited, provenance-checked output | ~20% |\n\n\u003Csup>* Indicative share of total wall-clock for a typical 200K-token survey run; varies by task.\u003C\u002Fsup>\n\n---\n\n## 📊 Results · PaperBench\n\nPaperBench (OpenAI, 2025) is the canonical paper-to-code reproduction benchmark: each submission is a runnable code tree scored by a leaf-judge LLM against a hand-written rubric. The official human-expert baseline is **41% over a 48-hour ML-PhD budget**.\n\n### Aggregate\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fpaperbench_topline.png\" alt=\"PaperBench top-line comparison\" width=\"88%\" \u002F>\n\u003C\u002Fdiv>\n\nPaperGuru reaches a **per-paper mean of 66.05% across all 23 papers**, beating every published baseline and clearing the human-expert bar by **+25 points**.\n\n### Per-paper\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fpaperbench_bar.png\" alt=\"PaperBench per-paper breakdown\" width=\"100%\" \u002F>\n\u003C\u002Fdiv>\n\nFor the 20 papers that have a published baseline:\n\n- **19 \u002F 20** papers improve over the strongest baseline\n- Mean lift: **+30.21% absolute**\n- Median lift: **+27.0% absolute**\n- Largest gain: `stay-on-topic-with-classifier-free-guidance` (+68.03%)\n- Only regression: `pinn` (−4.47%, where the cited PINN baseline already used hand-tuned domain priors)\n\n### Distribution of lifts\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Flift_distribution.png\" alt=\"Distribution of per-paper lifts\" width=\"80%\" \u002F>\n\u003C\u002Fdiv>\n\nThree additional papers (`semantic-self-consistency`, `self-composing-policies`, `self-expansion`) have no published baseline; PaperGuru scores **95.45%**, **65.03%**, and **39.77%** respectively.\n\n📂 **All 23 reproduction submissions are in [`PaperBench\u002Fsubmissions\u002F`](PaperBench\u002Fsubmissions\u002F).** Aggregate scores are in [`PaperBench\u002Faggregate-final.json`](PaperBench\u002Faggregate-final.json) and the per-paper comparison report is in [`PaperBench\u002FPER_PAPER_COMPARISON.md`](PaperBench\u002FPER_PAPER_COMPARISON.md).\n\n---\n\n## 📊 Results · SurveyBench\n\nSurveyBench (Yan et al., 2025) evaluates long-form survey writing along three dimensions — Content, Outline, and Richness — under an LLM judge. PaperGuru is evaluated under the official `claude-opus-4.7` judge with all dimensions normalised to [0, 100%].\n\n### Content quality (5-axis radar)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fsurveybench_radar.png\" alt=\"SurveyBench radar comparison\" width=\"80%\" \u002F>\n\u003C\u002Fdiv>\n\nPaperGuru scores **94.66%** on the content average — a **+14.06%** absolute lift over the strongest baseline (AutoSurvey at 80.60%) — and reaches the ceiling on **Focus** and **Fluency**.\n\n| Dimension | PaperGuru | AutoSurvey | LLM×MR-v2 | SurveyForge | ASur | Lift |\n|---|---:|---:|---:|---:|---:|---:|\n| Coverage  | **94.00%** | 61.00% | 70.00% | 60.00% | 59.00% | **+24.00%** |\n| Coherence | **87.40%** | 80.00% | 78.00% | 79.00% | 76.00% | **+7.40%**  |\n| Depth     | **92.00%** | 75.00% | 75.00% | 72.00% | 60.00% | **+17.00%** |\n| Focus     | **100.00%**| 99.00% | 94.00% | 86.00% | 76.00% | **+1.00%**  |\n| Fluency   | **100.00%**| 88.00% | 80.00% | 80.00% | 80.00% | **+12.00%** |\n| **Content avg.** | **94.66%** | 80.60% | 79.40% | 75.40% | 70.20% | **+14.06%** |\n\n### Composite richness (the structural advantage)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Fsurveybench_richness.png\" alt=\"SurveyBench composite richness\" width=\"92%\" \u002F>\n\u003C\u002Fdiv>\n\nRichness measures whether the generated survey contains *evidence-grounded* artifacts: figures, tables, executable code blocks, and resolved citations. **Two of four baselines produce zero**. PaperGuru reaches **43.76%**, more than **2× the strongest baseline** — and crucially this is a *file-system measurement* (no LLM judge), so the gap is preserved under judge swap and under input truncation.\n\n📂 **All 20 generated surveys are in [`SurveyBench\u002F`](SurveyBench\u002F)** in three formats: [`pdf\u002F`](SurveyBench\u002Fpdf\u002F), [`markdown\u002F`](SurveyBench\u002Fmarkdown\u002F), and [`latex\u002F`](SurveyBench\u002Flatex\u002F) (full LaTeX sources for reproducibility).\n\n---\n\n## 🏆 Track Record\n\nPaperGuru-assisted manuscripts have been **formally accepted** at top-tier venues across software engineering, machine learning, and engineering informatics — with thirty more under active review at NeurIPS 2026, CCS 2026, and adjacent venues.\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assets\u002Ffigures\u002Ftrophy_wall.png\" alt=\"PaperGuru track record · 5 venues, 10 acceptances\" width=\"100%\" \u002F>\n\u003C\u002Fdiv>\n\n| Venue | Tier | Year | Status |\n|---|---|---|---|\n| **FSE 2026** (ACM, CCF-A) | Diamond | 2026 | 5 papers accepted (3 IVR + 2 Poster) |\n| **ICML 2026** (CCF-A) | Diamond | 2026 | 1 long paper accepted to main proceedings |\n| **TOSEM** (ACM Trans., CCF-A) | Diamond | 2026 | 2 articles accepted (under publication embargo) |\n| **AEI** (Elsevier, SCI Q1) | Platinum | 2026 | 1 paper, minor revision accepted |\n| **ICoGB 2026** (Civil Engineering) | Gold | 2026 | 1 cross-disciplinary paper accepted |\n\n> **What this proves.** The same algorithmic memory that wins PaperBench (CS reproduction) and SurveyBench (CS literature synthesis) also writes the publishable manuscript that gets through peer review at venues spanning software engineering, ML, civil engineering, and engineering informatics. **Only the artifact archive differs.**\n\n---\n\n## 📦 What is in this repository\n\n```\nPaperGuru-Benchmark\u002F\n├── README.md                          ← you are here\n├── README.zh-CN.md                    ← 中文版\n├── LICENSE                            ← MIT\n│\n├── paper\u002F\n│   └── PaperGuru-CCM.pdf              ← the full paper (NeurIPS 2026 submission)\n│\n├── PaperBench\u002F                        ← 23 reproduction submissions\n│   ├── README.md\n│   ├── aggregate-final.json           ← all scores in machine-readable form\n│   ├── PER_PAPER_COMPARISON.md        ← per-paper PaperGuru vs baselines\n│   ├── REPORT.md                      ← narrative report\n│   └── submissions\u002F\n│       ├── adaptive-pruning\u002F          ← runnable code tree, one per paper\n│       ├── all-in-one\u002F\n│       ├── ...                        ← 23 directories total\n│       └── what-will-my-model-forget\u002F\n│\n├── SurveyBench\u002F                       ← 20 generated surveys, three formats\n│   ├── README.md\n│   ├── pdf\u002F                           ← compiled PDFs (review-ready)\n│   ├── markdown\u002F                      ← markdown source (web-friendly)\n│   └── latex\u002F                         ← full LaTeX sources (rebuild-able)\n│\n└── assets\u002F\n    ├── badges\u002F                        ← 5 venue badges (transparent PNG)\n    │   ├── fse.png    icml.png    tosem.png    aei.png    icogb.png\n    ├── figures\u002F                       ← every figure in this README\n    │   ├── hero_banner.png\n    │   ├── architecture.png\n    │   ├── pipeline.png\n    │   ├── before_after.png\n    │   ├── trophy_wall.png\n    │   ├── paperbench_bar.png         paperbench_topline.png\n    │   ├── surveybench_radar.png      surveybench_richness.png\n    │   ├── lift_distribution.png\n    │   ├── data.json                  ← raw numbers used to build all charts\n    │   ├── build_figures.py           ← rebuild data charts\n    │   └── build_trophy_wall.py       ← rebuild trophy wall composite\n    └── demos\u002F\n        └── pipeline_animated.svg      ← the live SVG animation\n```\n\n**Repository size**: ~350 MB (PaperBench submissions and SurveyBench LaTeX sources dominate). No file exceeds 20 MB; the repository works on standard `git`\u002F`git lfs`-free GitHub.\n\n---\n\n## 🔬 Reproducing the figures\n\nAll data figures in this README are built from a single Python script with the raw numbers stored alongside as JSON. To rebuild them:\n\n```bash\ncd assets\u002Ffigures\npython3 -m pip install matplotlib numpy pillow\npython3 build_figures.py        # rebuilds the 5 data charts + data.json\npython3 build_trophy_wall.py    # rebuilds the trophy_wall.png composite\n```\n\nThe raw numbers — every cell in every results table — are in [`assets\u002Ffigures\u002Fdata.json`](assets\u002Ffigures\u002Fdata.json). If you use these numbers, please cite the paper.\n\n---\n\n## 📜 License\n\nThis release is distributed under the [MIT License](LICENSE). The PaperBench reproduction submissions inherit the licenses of their corresponding original papers; check each subdirectory before redistributing. The SurveyBench generated surveys may be cited and quoted with attribution.\n\n---\n\n\u003Cdiv align=\"center\">\n\n**PaperGuru** &nbsp;·&nbsp; the missing memory primitive for long-horizon LLM agents.\n\n\u003Csub>Built for researchers, by researchers. Verified on the hardest published benchmarks. Carried by ten peer-reviewed publications and counting.\u003C\u002Fsub>\n\n\u003Cbr\u002F>\n\n[paper](paper\u002FPaperGuru-CCM.pdf) &nbsp;·&nbsp;\n[PaperBench](PaperBench\u002F) &nbsp;·&nbsp;\n[SurveyBench](SurveyBench\u002F) &nbsp;·&nbsp;\n[中文](README.zh-CN.md)\n\n\u003C\u002Fdiv>\n","PaperGuru-Benchmark 是一个针对长期记忆机制的基准测试项目，专为长周期大语言模型（LLM）代理设计。其核心功能在于提供了一种具有生命周期感知的记忆原语，通过单一算法机制在PaperBench和SurveyBench两个权威评测中取得了领先成绩，分别达到了66.05%和94.66%的表现，并且已经在FSE、ICML等顶级会议上有10篇同行评审论文被接受。该项目特别适用于需要长期记忆支持的研究场景，如学术研究中的文献综述生成与管理、复杂项目的持续跟踪分析等领域。",2,"2026-06-11 03:52:44","CREATED_QUERY"]