[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76145":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":13,"starSnapshotCount":13,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},76145,"gbrain-evals","garrytan\u002Fgbrain-evals","garrytan",null,"HTML",235,43,3,0,19,77,59.13,"MIT License",false,"main",true,[],"2026-06-11 04:06:42","# gbrain-evals\n\n**Public benchmarks for personal-knowledge agent stacks.** Two families,\nboth reproducible: BrainBench (our own corpus, in-house Cats 1–12) and\npublic benchmarks (LongMemEval today, ConvoMem + LoCoMo on the roadmap).\n\n## Latest results\n\n**LongMemEval `_s` (full 500-question public benchmark, 2026-05-07)** —\ngbrain-hybrid hits **97.60% R@5**, beating MemPalace's published raw\nbaseline by 1.0pt on the same dataset, K, and n with no LLM in the\nretrieval loop. Per-type wins: +7.1pt single-session-assistant, +1.5pt\nmulti-session, ties on user\u002Fpreference, -1.5pt temporal-reasoning.\n**Read the full report:** [docs\u002Fbenchmarks\u002F2026-05-07-longmemeval-s.md](docs\u002Fbenchmarks\u002F2026-05-07-longmemeval-s.md).\n\n**BrainBench v0.12.1 (in-house corpus, 2026-04-19)** — gbrain P@5 49.1%,\nR@5 97.9% on the 240-page fictional-life corpus. Beats its own\ngraph-disabled variant by +31.4pt P@5, grep-only by 32 points, vector by\n38 points. The graph layer is load-bearing.\n\n| Benchmark | Latest result | Date | Report |\n|---|---|---|---|\n| LongMemEval `_s` (public) | gbrain-hybrid 97.60% R@5 | 2026-05-07 | [link](docs\u002Fbenchmarks\u002F2026-05-07-longmemeval-s.md) |\n| BrainBench Cat 13b — Source Swamp | gbrain top-1 93.3% | 2026-04-25 | [link](docs\u002Fbenchmarks\u002F2026-04-25-brainbench-cat13b-source-swamp.md) |\n| BrainBench v0.20.0 baseline | gbrain P@5 49.1% \u002F R@5 97.9% | 2026-04-23 | [link](docs\u002Fbenchmarks\u002F2026-04-23-brainbench-v0.20.0.md) |\n| Cross-system comparison | MemPal \u002F Hindsight \u002F Mastra \u002F Stella \u002F Contriever | living | [docs\u002Fcomparison-systems.md](docs\u002Fcomparison-systems.md) |\n\n## Why a separate repo\n\nBenchmark corpora (world-v1 + amara-life-v1 = ~4MB) shouldn't land in\nevery gbrain install. This repo is what you clone when you want to run\nBrainBench against gbrain, not what you clone to use gbrain as a brain.\n\n`gbrain-evals` depends on `gbrain` via the GitHub URL. When you `bun install`\nhere, gbrain gets pulled in as a library. Evals call into gbrain's core\nmodules (`pglite-engine`, `operations`, `link-extraction`, etc.) via the\n`gbrain\u002F*` subpath exports.\n\n## 5-minute quickstart\n\n```sh\n# Clone + install (pulls gbrain as a library dep)\ngit clone https:\u002F\u002Fgithub.com\u002Fgarrytan\u002Fgbrain-evals.git\ncd gbrain-evals\nbun install\n```\n\n### Run LongMemEval (public benchmark, 500 questions × 4 adapters)\n\n```sh\n# Download the LongMemEval _s split (~278MB, one-time)\nmkdir -p ~\u002Fdatasets\u002Flongmemeval\ncurl -Lo ~\u002Fdatasets\u002Flongmemeval\u002Flongmemeval_s.json \\\n  https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fxiaowu0162\u002Flongmemeval\u002Fresolve\u002Fmain\u002Flongmemeval_s\n\nexport OPENAI_API_KEY=\"sk-...\"        # required for vector + hybrid adapters\nexport ANTHROPIC_API_KEY=\"sk-ant-...\" # required for hybrid+expansion adapter only\n\n# 4 adapters × 500 questions, 3 parallel workers, 10-min batches w\u002F resume\nbash eval\u002Frunner\u002Flongmemeval-batch.sh\n\n# One adapter only\nbash eval\u002Frunner\u002Flongmemeval-batch.sh --adapters hybrid\n\n# Stratified sample for fast iteration\nbun eval\u002Frunner\u002Flongmemeval.ts --stratify 10  # 10 Q's per type\n```\n\nFirst run pays ~$2 OpenAI embeddings; subsequent runs hit the local\ncontent-addressed cache (~$0). See the published\n[longmemeval-s benchmark report](docs\u002Fbenchmarks\u002F2026-05-07-longmemeval-s.md)\nfor headline numbers and methodology.\n\n### Run BrainBench (in-house corpus, 240-page fictional life)\n\n```sh\n# Run the full 4-adapter benchmark (N=5, ~15 min, no API keys required)\nbun run eval:run\n\n# Fast iteration (N=1)\nbun run eval:run:dev\n\n# Per-link-type accuracy report\nbun run eval:type-accuracy\n\n# Browse the fictional corpus\nbun run eval:world:view\n\n# Full BrainBench v1 scorecard (all Cats, published tier N=10)\nbun run eval:brainbench:published       # ~$200 Opus baseline\nbun run eval:brainbench                 # N=5 iteration (~$100)\nbun run eval:brainbench:smoke           # N=1 smoke (~$22)\n```\n\n## Public benchmarks\n\n| Benchmark | Source | What it tests | gbrain best result |\n|---|---|---|---|\n| **LongMemEval `_s`** | [xiaowu0162\u002Flongmemeval](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fxiaowu0162\u002Flongmemeval) | Retrieval recall over long-running chat (500 Q's, 6 question types, ~50 distractor sessions per haystack) | **97.60% R@5** ([report](docs\u002Fbenchmarks\u002F2026-05-07-longmemeval-s.md)) |\n| LongMemEval `_oracle` | same source, easier split (3 sessions per haystack) | sanity baseline | not yet run (trivial) |\n| LongMemEval `_m` | same source, harder split (200 distractors) | retrieval under heavier noise | filed as TODO |\n| ConvoMem (Salesforce) | 75K+ multi-turn QA pairs | conversational memory at scale | filed as TODO |\n| LoCoMo | 1,986 multi-hop QA | multi-hop reasoning over conversation | filed as TODO |\n\nEach public benchmark gets a runner under `eval\u002Frunner\u002F\u003Cbench>.ts`, a\nreport under `docs\u002Fbenchmarks\u002F\u003Cdate>-\u003Cbench>.md`, and per-row entries in\n[`docs\u002Fcomparison-systems.md`](docs\u002Fcomparison-systems.md) with sourced\nnumbers from MemPalace, Hindsight, Mastra, Supermemory, Stella,\nContriever, and BM25 baselines.\n\n## BrainBench Cat catalog\n\n| Cat | What it tests | Threshold | Status |\n|-----|--------------|-----------|--------|\n| 1+2 | Retrieval (relational queries over 240-page rich-prose) | P@5 > 0.39, R@5 > 0.83 | shipping |\n| 2 | Per-link-type accuracy on rich prose | type F1 per category | shipping |\n| 3 | Identity resolution (aliases, handles, emails) | recall > 0.80 | shipping |\n| 4 | Temporal queries (as-of, point, range, recency) | as-of recall > 0.80 | shipping |\n| 5 | Source attribution \u002F provenance (claim → source classification) | citation_accuracy > 0.90 | shipping (programmatic) |\n| 6 | Auto-link precision under prose (at scale) | link_precision > 0.95 | shipping (baseline-only) |\n| 7 | Performance \u002F latency | p95 \u003C 200ms per query | shipping |\n| 8 | Skill behavior compliance (brain-first, back-link, citation, tier) | all > 0.90 | shipping (programmatic) |\n| 9 | End-to-end workflows (5 flows × rubric) | 80% pass per workflow | shipping (programmatic) |\n| 10 | Robustness \u002F adversarial (22 hand-crafted cases) | 100% pass, no crash | shipping |\n| 11 | Multi-modal ingest (PDF + audio + HTML) | text > 0.95, WER \u003C 0.15 | shipping (opt-in fixtures) |\n| 12 | MCP operation contract (trust boundary, input validation) | no silent corruption | shipping |\n\nCats 5, 8, 9 are \"programmatic\" — they need runtime inputs (claim catalog,\nprobe catalog, scenarios + agent state) and are invoked via their `runCatN`\nharness API rather than as standalone CLI scripts.\n\n## The fictional corpus: world-v1 + amara-life-v1\n\n**world-v1** (committed, 2.0MB): 240 Opus-generated biographical pages.\n80 people, 80 companies, 50 meetings, 30 concepts. Each page carries\n`_facts` gold metadata that never crosses the adapter boundary (Day 9\nsealed-qrels enforcement).\n\n**amara-life-v1** (committed, 2.1MB): Amara Okafor's messy week in April\n2026. 50 emails + 300 Slack messages across 4 channels + 20 calendar\nevents + 8 meeting transcripts + 40 first-person notes + 6 reference docs.\nPlanted perturbations: 10 contradictions, 5 stale facts, 5 paraphrased-\ninjection poison items, 3 implicit preferences.\n\nRegenerate with `bun run eval:generate-amara-life` (requires\n`ANTHROPIC_API_KEY`, ~$4 Opus, ~15 min, deterministic from seed=42).\n\n## Repo layout\n\n```\ngbrain-evals\u002F\n├── CLAUDE.md                         Spec for the platonic-ideal benchmark report\n├── eval\u002F\n│   ├── data\u002F\n│   │   ├── world-v1\u002F                 240 committed biographical pages (BrainBench)\n│   │   ├── amara-life-v1\u002F            Amara's fictional life (BrainBench)\n│   │   ├── gold\u002F                     Sealed qrels + perturbation gold\n│   │   ├── longmemeval\u002Fembed-cache\u002F  Embedding cache (gitignored, ~700MB)\n│   │   └── multimodal\u002F               PDF\u002Faudio\u002FHTML fixtures (on-demand)\n│   ├── schemas\u002F                      Portable JSON Schema contracts\n│   ├── generators\u002F                   world.ts + amara-life.ts + Opus\n│   ├── runner\u002F                       12 Cat runners + LongMemEval + adapters + judge\n│   │   ├── adapters\u002F                 grep-only, vector, vector-grep-rrf-fusion, claude-sonnet\n│   │   ├── loaders\u002F                  PDF + corpus loaders\n│   │   ├── queries\u002F                  Tier 5 fuzzy + 5.5 synthetic\n│   │   ├── all.ts                    BrainBench master runner\n│   │   ├── cat{5,6,8,9,11}-*.ts      v1 Complete runners\n│   │   ├── longmemeval.ts            LongMemEval runner (4 adapters, NDJSON resume, parallel sharding)\n│   │   ├── longmemeval-cache.ts      Content-addressed embedding cache (SQLite, WAL, busy_timeout)\n│   │   ├── longmemeval-aggregate.ts  Aggregator over the NDJSON stream\n│   │   ├── longmemeval-batch.sh      3-worker × 10-min batch wrapper with auto-resume\n│   │   ├── longmemeval-chart.ts      Inline-SVG headline + per-type chart generator\n│   │   ├── tool-bridge.ts            12 read + 3 dry_run tools\n│   │   ├── judge.ts                  Haiku judge, structured evidence contract\n│   │   ├── recorder.ts               6-artifact flight-recorder\n│   │   └── llm-budget.ts             Shared Anthropic-call semaphore\n│   ├── reports\u002F                      Transient run output (gitignored)\n│   └── cli\u002F                          world-view, query-validate, query-new\n├── test\u002Feval\u002F                        Unit tests (314 tests, 1354 expect calls)\n└── docs\u002F\n    ├── benchmarks\u002F                   Published scorecards per release + their charts\n    └── comparison-systems.md         Living list of named systems w\u002F published R@k numbers\n```\n\n## Three contributor paths\n\n### 1. Reproduce a published scorecard\n```sh\ngit checkout \u003Ccommit-sha-from-scorecard>\nbun run eval:run\n# Match within tolerance bands (deterministic adapters byte-match)\n```\n\n### 2. Submit a new adapter\n1. Implement `eval\u002Frunner\u002Fadapters\u002F\u003Cyour-adapter>.ts` against the `Adapter`\n   interface (`init(pages, config) → BrainState`, `query(q, state) → RankedDoc[]`).\n2. Register it in `eval\u002Frunner\u002Fmulti-adapter.ts`.\n3. Run `bun run eval:run` — it scores side-by-side against the 4 references.\n4. Open a PR with your scorecard in `docs\u002Fbenchmarks\u002FYYYY-MM-DD-\u003Cstack>.md`.\n\n### 3. Extend a Cat\n1. Add a new Cat runner at `eval\u002Frunner\u002FcatN-*.ts`.\n2. Wire into `eval\u002Frunner\u002Fall.ts` CATEGORIES.\n3. Add tests at `test\u002Feval\u002FcatN.test.ts`.\n4. Commit a baseline to `docs\u002Fbenchmarks\u002F`.\n\n## Design doc + methodology\n\n- `docs\u002Fbenchmarks\u002FTEMPLATE-brainbench-v1.md` — scorecard format (coming in v1 Complete ship)\n- BrainBench v1 design doc: `~\u002F.gstack\u002Fprojects\u002Fgarrytan-gbrain\u002Fgarrytan-garrytan-gbrain-evals-design-20260418-081754.md` (original)\n- 3-axis metric framework: Retrieval (Cat 1-4), Ingestion (Cat 2, 6, 11), Assistant\u002Fpersonalization (Cat 5, 8, 9)\n- Anti-gaming: sealed qrels at the adapter boundary, N=3\u002F5\u002F10 tolerance bands,\n  judge-version pinning, randomized query order per seeded run\n\n## License\n\nMIT. Fixtures (world-v1, amara-life-v1) are fully fictional and redistributable.\n\n## Relationship to gbrain\n\n`gbrain-evals` is a **consumer** of `gbrain`. The benchmark imports gbrain's\npublic surface via `gbrain\u002F*` subpath exports:\n\n- `gbrain\u002Foperations` — the 36 operations (tool-bridge exposes 12 read-only + 3 dry_run)\n- `gbrain\u002Fpglite-engine` — in-memory Postgres for adapter state\n- `gbrain\u002Flink-extraction` — extractor under test\n- `gbrain\u002Fimport-file`, `gbrain\u002Fembedding`, `gbrain\u002Ftranscription` — ingest pipeline\n- `gbrain\u002Fsearch\u002Fvector-grep-rrf-fusion` — vector-grep-rrf-fusion RAG implementation\n- `gbrain\u002Ftypes`, `gbrain\u002Fconfig`, `gbrain\u002Fengine` — type contracts\n\nAny adapter that implements the `Adapter` interface can be scored — gbrain\nis one of many reference stacks, not the benchmark's subject.\n","gbrain-evals 是一个用于个人知识代理栈的公共基准测试项目。它提供了两个可复现的基准测试系列：BrainBench（自有的数据集）和LongMemEval（公开的数据集）。项目的核心功能包括对不同类型的个人知识代理进行性能评估，支持多种适配器，并且通过图层技术显著提高了检索精度。特别适用于需要评估和比较个人知识管理系统或聊天机器人长期记忆能力的研究者和开发者。该项目依赖于 gbrain 作为核心库，使用 HTML 编写，采用 MIT 许可证开源。",2,"2026-06-11 03:54:39","CREATED_QUERY"]