[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82048":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},82048,"Co-Scientist","Kaimen-Inc\u002FCo-Scientist","Kaimen-Inc","Open Source Reimplementation of Google Deepmind's Co-Scientist",null,"Python",161,34,2,0,17,95,11,66.63,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:36","# AI co-scientist\n\nAn open source re-implementation of Google's **AI co-scientist** ([Gottweis et al., *Nature*, 2026](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-026-10644-y); [research blog, 2025](https:\u002F\u002Fresearch.google\u002Fblog\u002Faccelerating-scientific-breakthroughs-with-an-ai-co-scientist\u002F)) — a multi-agent system that takes a natural-language research goal and produces a tournament-ranked **research overview** of novel hypotheses.\n\nThe agent roster, prompts, and control flow follow the paper. Source materials that were used to instruct the coding agent (Claude Code) is included with the repo:\n\n- [`reference\u002F8 Pseudocode of Co-Scientist agents`](reference\u002F) — the supplementary pseudocode for Supervisor, Generation, Reflection, Ranking, Evolution, Proximity, Meta-review.\n- [`reference\u002F9 Prompts for the specialized agents in .md`](reference\u002F) — the per-agent prompts from the paper's supplement, used verbatim (modulo Jinja interpolation) in [`config\u002Fprompts\u002F`](config\u002Fprompts\u002F).\n- [`reference\u002FAICoScientist-*.png`](reference\u002F) — the architecture and component diagrams from the paper.\n\nThe agents:\n\n- **Generation** — proposes hypotheses via literature review and simulated scientific debate.\n- **Reflection** — reviews hypotheses for novelty, correctness, and testability; deep-verifies the underlying assumptions.\n- **Ranking** — runs an Elo tournament with simulated debates between hypotheses.\n- **Evolution** — combines, simplifies, makes more feasible, or out-of-box-reimagines top-ranked hypotheses.\n- **Proximity** — embeds and clusters hypotheses to drive dedup and informative tournament pairings.\n- **Meta-review** — synthesizes system-wide feedback and the final research overview.\n\nA **Supervisor** parses the goal into a research plan and schedules agent tasks through a durable SQLite-backed queue with bounded concurrency.\n\nThis is an independent re-implementation in Python on top of pluggable LLM provider SDKs — not affiliated with Google or the paper's authors.\n\n> [`docs\u002FBENCH_RESULTS.md`](docs\u002FBENCH_RESULTS.md) — every cross-model bench ever run on this code, with per-candidate Elo, every hypothesis produced, gold-set hits, and direct file pointers. Auto-generated from the bench DB.\n\n## Contents\n\n- [Architecture](#architecture)\n- [Install](#install)\n- [Initialize](#initialize)\n- [Run a research session](#run-a-research-session)\n- [LLM provider](#llm-provider)\n- [Configuration](#configuration)\n- [Bench: compare models head-to-head](#bench-compare-models-head-to-head)\n- [Repository layout](#repository-layout)\n\n## Architecture\n\n```\n                       co-scientist run \"\u003Cgoal>\"\n                                  │\n                                  ▼\n            ┌──────────────────────────────────────┐\n            │            Supervisor                │  durable task queue (SQLite)\n            │  • parse_goal → ResearchPlan         │  bounded concurrency\n            │  • enqueue initial Generation tasks  │  lease + dead-letter + resume\n            │  • main loop: claim → run → follow-up│  termination: BUDGET \u002F WALL_CLOCK\n            │  • decide_next_steps when idle       │              \u002F ELO_STABLE \u002F IDLE \u002F EXTERNAL\n            │  • finalize: meta-review overview    │\n            └──────────────────────────────────────┘\n                                  │  tasks\n            ┌─────────────────────┼─────────────────────────────┐\n            ▼                     ▼                             ▼\n   ┌──────────────┐      ┌──────────────┐              ┌──────────────┐\n   │  Generation  │ hyp  │  Reflection  │ review       │   Ranking    │\n   │  literature  │─────►│  full +      │─────────────►│ pairwise vs  │──► Elo\n   │  + debate    │      │  verification│              │   debate     │\n   └──────────────┘      └──────────────┘              └──────────────┘\n            ▲                     ▲                             │\n            │                     │ informative pairings        ▼\n   ┌──────────────┐      ┌──────────────┐              ┌──────────────┐\n   │  Evolution   │◄─────│ Meta-review  │              │  Proximity   │\n   │ combine \u002F    │ feed │ system fdbk  │              │ FAISS embed  │\n   │ simplify \u002F   │ back │ + final      │              │ + cluster \u002F  │\n   │ feasibility \u002F│      │ overview     │              │ dedup        │\n   │ out_of_box   │      └──────────────┘              └──────────────┘\n   └──────────────┘\n            │\n            ▼\n       new hypotheses re-enter the cycle\n\n\n  Shared infrastructure\n  ─────────────────────\n  • LLMProvider  ─ anthropic \u002F openai \u002F openrouter \u002F gemini \u002F groq \u002F\n                   together \u002F mistral \u002F ollama \u002F openai_compatible\n  • ToolRegistry ─ web_fetch + pubmed_search \u002F arxiv_search \u002F europe_pmc_search;\n                   web_search auto-registered iff TAVILY\u002FBRAVE key set;\n                   science-skills discovered via SKILL.md frontmatter\n  • TokenBudget  ─ per-agent shares + global cap; reservation released on retry\n  • EventBus     ─ in-memory fan-out to SSE for the live web UI\n  • FaissStore   ─ IndexFlatIP per session, asyncio-locked, atomic save\u002Fload;\n                   Voyage → OpenAI → hash-fallback embedder chain\n  • SQLite       ─ sessions \u002F hypotheses \u002F reviews \u002F tournament_matches \u002F\n                   elo_journal \u002F tasks \u002F transcripts \u002F system_feedback \u002F\n                   embeddings_meta \u002F spans \u002F events \u002F bench_* (15 tables;\n                   WAL, busy_timeout, idempotent migration runner)\n```\n\n## Install\n\n```bash\n# Recommended: Python 3.11–3.13 (FAISS wheel availability)\npython3.12 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -e \".[dev]\"\n\ncp .env.example .env\n# fill in the API key for whichever LLM provider you'll use (see below).\n```\n\n## Initialize\n\n```bash\nco-scientist init\nco-scientist list\n```\n\n`init` creates `data\u002F` (artifacts, vectors, logs) and applies migrations to `data\u002Fco_scientist.db`. The output prints which LLM provider it sees configured and whether its API key is set.\n\n## Run a research session\n\n```bash\nco-scientist run \"Identify hypotheses about microbiome-driven inflammation\" \\\n  --n 3 --budget-usd 2.0 --wall-clock 600\n```\n\nThis kicks off Generation → Reflection → Ranking → Evolution → Meta-review under the configured LLM provider. The Supervisor schedules tasks, the Elo tournament refines a leaderboard, and the final research overview is written to `data\u002Fartifacts\u002F\u003Csession_id>\u002Ffinal\u002Foverview.md`.\n\n```bash\nco-scientist serve            # FastAPI + htmx + SSE dashboard at localhost:7878\nco-scientist report \u003Cid>      # print the final overview\nco-scientist status \u003Cid>      # session metadata + counts\nco-scientist pause \u003Cid> | resume \u003Cid> | abort \u003Cid>\nco-scientist feedback \u003Cid> --kind directive --text \"focus on metabolic pathways\"\nco-scientist estimate         # pre-flight cost estimate; warns if > 1.2× budget\nco-scientist eval [agent]     # run the rubric eval bundle (offline mode optional)\nco-scientist tools list       # show every registered tool the agents can call\n```\n\n## LLM provider\n\nThe agents talk to one LLM provider per session, configured in [`config\u002Fdefault.toml`](config\u002Fdefault.toml) (override with your own `co-scientist.toml`):\n\n```toml\n[llm]\nprovider = \"anthropic\"\n```\n\n| provider              | Endpoint                                                | Required key            | Example models                                            |\n| --------------------- | ------------------------------------------------------- | ----------------------- | --------------------------------------------------------- |\n| `anthropic` *(default)* | api.anthropic.com                                     | `ANTHROPIC_API_KEY`     | `claude-opus-4-7`, `claude-sonnet-4-6`                    |\n| `openai`              | api.openai.com                                          | `OPENAI_API_KEY`        | `gpt-5`, `gpt-4o`, `o3-mini`                              |\n| `openrouter`          | openrouter.ai — 200+ models from every major vendor     | `OPENROUTER_API_KEY`    | `anthropic\u002Fclaude-3.5-sonnet`, `openai\u002Fgpt-5`, `google\u002Fgemini-2.5-pro`, `meta-llama\u002Fllama-3.3-70b-instruct` |\n| `gemini` \u002F `google`   | generativelanguage.googleapis.com (OpenAI-compat)       | `GEMINI_API_KEY`        | `gemini-2.5-pro`, `gemini-2.5-flash`                      |\n| `groq`                | api.groq.com                                            | `GROQ_API_KEY`          | `llama-3.3-70b-versatile`, `mixtral-8x7b-32768`           |\n| `together`            | api.together.xyz                                        | `TOGETHER_API_KEY`      | `meta-llama\u002FLlama-3.3-70B-Instruct-Turbo`                 |\n| `mistral`             | api.mistral.ai                                          | `MISTRAL_API_KEY`       | `mistral-large-latest`, `codestral-latest`                |\n| `ollama`              | localhost:11434 — local models                          | *(none)*                | `llama3.3:70b`, `qwen2.5:32b`                             |\n| `openai_compatible`   | Anything else; set `[llm.openai] base_url` explicitly   | `OPENAI_API_KEY`        | depends                                                   |\n\nMixing vendors per session requires picking the provider once; for multi-vendor routing in a single session, use `provider = \"openrouter\"` and let OpenRouter dispatch upstream per model:\n\n```toml\n[llm]\nprovider = \"openrouter\"\n[llm.openrouter]\nreferer = \"https:\u002F\u002Fyour-app.example.com\"   # optional, for catalog attribution\ntitle   = \"My Co-Scientist\"\n\n[models]\ngeneration         = \"anthropic\u002Fclaude-3.5-sonnet\"\nreflection         = \"openai\u002Fgpt-5\"\nranking_pairwise   = \"google\u002Fgemini-2.5-flash\"\nmetareview_final   = \"anthropic\u002Fclaude-opus-4-7\"\n```\n\nCost is estimated via `co_scientist\u002Fllm\u002Frouting.py`'s `PRICE_TABLE`; unknown models match a family-hint (flash \u002F mini \u002F opus \u002F sonnet \u002F gemini \u002F llama \u002F mistral) so brand-new previews price sensibly. Tighten `[run] budget_usd` if running on a new model you haven't sanity-checked.\n\n**Provider feature support:**\n\n| Feature              | anthropic | openai (o-series) | openai (gpt) | openai_compatible |\n| -------------------- | --------- | ----------------- | ------------ | ----------------- |\n| Tool \u002F function call | ✅        | ✅                | ✅           | depends on endpoint |\n| Extended reasoning   | ✅ (thinking) | ✅ (`reasoning_effort`) | ❌ (dropped) | endpoint-specific |\n| Prompt-cache breakpoints | ✅    | ❌                | ❌           | ❌                |\n| Batch API (50%-off ranking) | ✅ | ❌            | ❌           | ❌                |\n\n## Configuration\n\nLayered: [`config\u002Fdefault.toml`](config\u002Fdefault.toml) → `~\u002F.co-scientist\u002Fconfig.toml` → `.\u002Fco-scientist.toml` → `--config \u003Cpath>`. Secrets come from environment only (see [`.env.example`](.env.example)).\n\n## Bench: compare models head-to-head\n\n`co-scientist bench` runs the same goal under N different `(provider, model)` configurations and ranks them via a single shared Elo tournament. Each candidate independently generates hypotheses; then every candidate-pair plays `--matches` head-to-head debates, judged by ONE fixed judge model (picked separately so no candidate scores its own work).\n\n> **For live numbers** — per-candidate Elo, the actual hypotheses each model proposed, gold-set hits, and what the data showed — see [`docs\u002FBENCH_RESULTS.md`](docs\u002FBENCH_RESULTS.md). It includes a headline-findings section at the top so you don't have to scroll through every bench.\n\n### Presets\n\n| `--preset`               | What it does |\n| ---                      | --- |\n| `paper`                  | Co-Scientist paper baselines (Gemini 2 Flash Thinking, Gemini 2 Pro, OpenAI o1, Claude Haiku) via OpenRouter, head-to-head Elo only |\n| `paper-aml`              | Same candidates + the paper's AML drug-repurposing goal + gold-set recall scoring (defaults to the strict top-3 set: Nanvuranlat \u002F KIRA6 \u002F Leflunomide) |\n| `paper-aml-vs-raw`       | `paper-aml` but each model runs **both** in the full pipeline AND as a single raw LM call — isolates the multi-agent harness's value-add |\n| `frontier-aml-vs-raw`    | Same pipeline-vs-raw setup but with current frontier models (Claude Opus 4.7, GPT-5, Gemini 3 Pro \u002F Flash) |\n\n```bash\n# Reproduce the paper preference-ranking comparison:\nco-scientist bench --preset paper --budget-per-candidate 1.5\n\n# Score against the paper's AML drug picks:\nco-scientist bench --preset paper-aml --n 3 --matches 2\n\n# Compare multi-agent pipeline vs raw model call on the same goal\n# (--budget-per-candidate defaults to 3.0; frontier models need it):\nco-scientist bench --preset paper-aml-vs-raw --n 1\n\n# Current frontier models, pipeline vs raw:\nco-scientist bench --preset frontier-aml-vs-raw --n 1\n```\n\n### Pipeline vs raw LM (one model, isolated)\n\nThe `--preset *-vs-raw` presets pit each model's **full co-scientist Generation pipeline** (literature tools + tool loop + dedup + `record_hypothesis`) against a **single raw LM call** with the same model + a forced `record_hypothesis` function call (no tools). Lets you measure how much of the system's output quality comes from the multi-agent harness vs the underlying model. → live numbers in [`docs\u002FBENCH_RESULTS.md`](docs\u002FBENCH_RESULTS.md#headline-findings).\n\n### Gold-set scoring (AML drug repurposing)\n\n`paper-aml*` presets score **recall** against a curated answer key from the Co-Scientist paper. Two gold sets ship; both stay registered so historical bench artifacts remain interpretable.\n\n| label                                                   | size | what it is |\n| ---                                                     | --- | --- |\n| `aml-repurposing-paper-top3` *(default for `paper-aml*`)* | 3 | Top-3 of the original paper's list: candidates with no prior published AML repurposing, no prior preclinical evidence in AML, and no external inputs (no DepMap scores, no expert curation). → **Nanvuranlat (JPH-203 \u002F KYT-0353), KIRA6, Leflunomide (Arava \u002F HWA-486 \u002F Teriflunomide \u002F Aubagio)** |\n| `aml-repurposing-paper-5`                               | 5 | Broader 5-drug list referenced in the paper's main text: **Binimetinib (MEK162), Pacritinib (SB1518 \u002F Vonjo), Cerivastatin (Baycol), Pravastatin (Pravachol), Dimethyl fumarate (DMF \u002F BG-12 \u002F Tecfidera)** |\n\nSwap with `--goldset`:\n\n```bash\nco-scientist bench --preset paper-aml --goldset aml-repurposing-paper-5   # broader list\nco-scientist bench --preset paper-aml --goldset none                       # head-to-head only\n```\n\nThe matcher is whole-token, case-insensitive, and looks at every searched field of every hypothesis (title \u002F summary \u002F full_text \u002F `entities` \u002F citation excerpts). Drug **class** mentions (e.g. \"DHODH inhibitor\") do **not** count — the candidate has to name the actual compound (or one of its registered aliases).\n\n### Custom candidates\n\n`label=provider:model[@mode]`. `mode` is `pipeline` (default) or `direct`. Pipeline goes through the full Generation agent stack; direct is a single forced-tool LM call with no literature tools.\n\n```bash\nco-scientist bench \"Identify hypotheses about X\" \\\n  -c flash3=openrouter:google\u002Fgemini-3-flash-preview \\\n  -c flash3-raw=openrouter:google\u002Fgemini-3-flash-preview@direct \\\n  -c gpt5=openai:gpt-5 \\\n  -c opus=anthropic:claude-opus-4.7 \\\n  --judge anthropic:claude-sonnet-4-6\n```\n\n### Where results live\n\nEvery bench writes to SQLite + JSON on disk:\n\n```\ndata\u002Fco_scientist.db                          ← SQLite, all metadata\n  bench_runs                                  one row per bench\n  bench_candidates                            one row per (bench × candidate × mode)\n  bench_matches                               one row per head-to-head\n\ndata\u002Fartifacts\u002F\u003Csession_id>\u002F                  ← JSON on disk\n  bench\u002F\u003Cbench_id>.json                       run summary + per-entity gold_hit_detail\n  hypotheses\u002F\u003Chyp_id>.json                    every hypothesis the bench produced\n  transcripts\u002Fgeneration\u002F\u003Ctrn_id>.json        every LLM call\n```\n\nThe auto-generated [`docs\u002FBENCH_RESULTS.md`](docs\u002FBENCH_RESULTS.md) (rebuild with `python scripts\u002Fbuild_bench_report.py`) walks every recorded bench and renders the per-candidate result table, every hypothesis attributed to the model that produced it, and a post-hoc rescore against every registered gold set.\n\n### Mechanics\n\n- **Generation runs in parallel** per candidate under a deep-copied Config (`cfg.llm.provider`, `cfg.models.*`, thinking budgets zeroed for non-Anthropic).\n- **Round-robin pairings**: every pair plays `--matches` head-to-heads (one random hypothesis from each side per match).\n- **Structured verdict** via a forced `record_verdict` function call — no fragile `better idea: \u003CN>` text parsing across providers.\n- Bench runs are **isolated from regular sessions** — they don't write to `tournament_matches` or affect any session's leaderboard.\n\n## Repository layout\n\n```\nco_scientist\u002F\n  agents\u002F       # supervisor + 6 specialized agents (base, generation, reflection,\n                # ranking, evolution, proximity, metareview)\n  bench\u002F        # cross-model bench runner (Elo tournament + gold-set scoring)\n  llm\u002F          # provider abstraction (anthropic\u002Fopenai\u002Fopenrouter\u002Fgemini\u002F...),\n                # tool loop, token budgets, model routing, retry, batch, estimator\n  storage\u002F      # SQLite schema + migrations, db connection, 10 repos\n  tools\u002F        # tool registry; web_fetch, web_search, pubmed\u002Farxiv\u002Feurope_pmc,\n                # science-skills bridge\n  vectors\u002F      # embeddings (Voyage\u002FOpenAI\u002Fhash-fallback) + FAISS IndexFlatIP\n  orchestrator\u002F # task scheduling, Elo updates, termination, event bus\n  safety\u002F       # injection quoting, classifier, citation verifier\n  obs\u002F          # metrics (tokens, cost, cache hit ratio, latency)\n  web\u002F          # FastAPI + htmx + SSE UI + sanitized markdown renderer\n  evals\u002F        # per-agent + e2e + regression evals\n  tests\u002F        # 213 unit tests + fixtures + smoke\nconfig\u002F\n  default.toml\n  prompts\u002F      # 14 Jinja2 templates (one per agent.mode), derived from\n                # the paper's supplementary prompts\ndocs\u002F\n  BENCH_RESULTS.md   # every bench ever run (auto-generated)\nscripts\u002F\n  build_bench_report.py\nreference\u002F      # paper source materials (pseudocode, prompts, diagrams)\ndata\u002F           # gitignored; runtime artifacts (SQLite, FAISS, transcripts)\nvendor\u002F         # gitignored; pinned clone of google-deepmind\u002Fscience-skills\n```\n\n## License\n\nApache-2.0.\n","Co-Scientist是一个开源项目，旨在重现谷歌DeepMind的AI科研助手。该项目使用Python实现，通过一个多智能体系统将自然语言形式的研究目标转化为新颖假设的研究概览。核心功能包括生成、反思、排名、进化、接近度评估及元评审等智能体，每个智能体负责特定任务如提出假设、验证假设的新颖性和正确性以及组织模拟辩论来对假设进行排名。适用于需要加速科学研究过程，特别是在探索新理论或实验设计时的情景。此外，它支持可插拔的大规模语言模型提供商SDK，并通过SQLite持久化任务队列以确保稳定运行。","2026-06-11 04:07:35","CREATED_QUERY"]