[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82241":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":21,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":37,"discoverSource":38},82241,"ken","townsendmerino\u002Fken","townsendmerino","Fast hybrid code search for agents. Pure Go, drop-in MCP-compatible with semble.","",null,"Go",23,1,3,0,2,0.9,"MIT License",false,"main",true,[23,24,25,26,27,28,29,30,31,32,33],"agents","bm25","code-search","embeddings","go","golang","mcp","mcp-server","model2vec","retrieval","semble","2026-06-12 02:04:24","# ken\n\n**Fast hybrid code search for agents.** Pure Go, single static binary, drop-in MCP-compatible with [MinishLab\u002Fsemble](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble) — same tool schemas, same output format, same install steps swapped to a Go binary.\n\n*Built collaboratively: most of the Go implementation written by Claude, with constraints, architectural decisions, and review discipline from [@townsendmerino](https:\u002F\u002Fgithub.com\u002Ftownsendmerino). The verbatim-port rule and the corpus-scale parity harness — the things that make this a faithful port instead of an approximate one — came from the human side. See [How this was built](#how-this-was-built).*\n\n[![CI](https:\u002F\u002Fgithub.com\u002Ftownsendmerino\u002Fken\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ftownsendmerino\u002Fken\u002Factions\u002Fworkflows\u002Fci.yml)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](LICENSE)\n[![Go Reference](https:\u002F\u002Fpkg.go.dev\u002Fbadge\u002Fgithub.com\u002Ftownsendmerino\u002Fken.svg)](https:\u002F\u002Fpkg.go.dev\u002Fgithub.com\u002Ftownsendmerino\u002Fken)\n![Go 1.26+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fgo-1.26%2B-blue)\n\nken is a Go port of semble. The retrieval algorithm is ported verbatim from semble's `search.py` + `ranking\u002F*.py`; ken adds two things on top: **runtime properties** (single-binary distribution, no Python interpreter import on cold start, no GIL on the indexing pipeline) and **measured agent-input efficiency** (~44× fewer tokens than grep+Read at recall@10 on semble's diverse-query benchmark; at corpus scale — CoIR-CSN-Python's 280K files — corpus-wide grep is functionally impossible and ken's 1,296-token result is the only workable path). The honest tradeoff: ken's recall caps at 82–91% vs grep's ~99%, so exhaustive enumeration (refactors, pre-rename audits) still belongs to grep — but for \"find the chunk that answers this,\" ken wins by 1–2 orders of magnitude on tokens. Full table in [`docs\u002FBENCH.md`](docs\u002FBENCH.md#token-budget-recall--agent-side-efficiency). If you already use semble in your agent, you can swap to ken-mcp without re-prompting; the wire format is the same string semble emits.\n\n## Embedded-corpus build pattern (v0.6.0)\n\nThe library form of ken-mcp lets SDK authors ship docs as a **single static MCP server binary**. Write ~20 lines of `main.go`, `\u002F\u002Fgo:embed` your `docs\u002F` and the Model2Vec model, `go build` — push a binary to a GitHub release. Users `brew install`, add one line to their agent config, and their coding agent has high-quality local retrieval over your SDK's docs. No backend, no vector DB to operate, no network egress per query, no \"is the cache stale\" question — the binary IS the corpus, version-pinned by build artifact.\n\n```go\npackage main\n\nimport (\n    \"context\"\n    \"embed\"\n    \"io\u002Ffs\"\n    \"log\"\n    \"os\"\n\n    \"github.com\u002Ftownsendmerino\u002Fken\u002Fmcp\"\n\n    _ \"github.com\u002Ftownsendmerino\u002Fken\u002Fchunk\u002Fmarkdown\"\n)\n\n\u002F\u002Fgo:embed docs\u002F*.md\nvar docsFS embed.FS\n\n\u002F\u002Fgo:embed model\u002Ftokenizer.json model\u002Fconfig.json model\u002Fmodel.safetensors\nvar modelFS embed.FS\n\nfunc main() {\n    docsSub, _  := fs.Sub(docsFS, \"docs\")\n    modelSub, _ := fs.Sub(modelFS, \"model\")\n    if err := mcp.Run(context.Background(), docsSub, mcp.Options{\n        Mode:        \"hybrid\",\n        ChunkerName: \"markdown\",\n        ModelFS:     modelSub,\n        LogWriter:   os.Stderr,\n    }); err != nil {\n        log.Fatal(err)\n    }\n}\n```\n\n[`cmd\u002Fken-mcp-docs\u002F`](cmd\u002Fken-mcp-docs\u002F) is the canonical worked example — it bakes ken's own [`docs\u002F*.md`](docs\u002F) and the Model2Vec model into a 74 MB static binary built via [`scripts\u002Fbuild-docs-mcp.sh`](scripts\u002Fbuild-docs-mcp.sh). Design and rationale: [ADR-016](docs\u002FDECISIONS.md#adr-016-embedded-corpus-mcp-build-pattern-via-mcprun-library-function).\n\n### Pre-building the index for faster cold start (v0.8.3)\n\nEvery `mcp.Run` startup walks the embedded corpus, chunks every indexable file, and (for semantic \u002F hybrid mode) calls `model.Encode` on every chunk to produce the dense embedding matrix. For a small docs corpus the cost is modest; for a larger embedded corpus (~10K+ chunks) it can add seconds-to-minutes of CPU on each process launch.\n\nv0.8.3 ships **pre-built embedded indices** — serialize the index once at build time, ship the bytes inside your `\u002F\u002Fgo:embed` corpus, skip the walk + chunk + embed pass at runtime. Cold start drops from \"build the index\" to \"read pre-serialized bytes + verify header.\"\n\nTwo additions to your build script:\n\n```bash\n# Before `go build`, pre-build the index from your corpus.\nken build-index .\u002Fcorpus \\\n    -o .\u002Fcorpus\u002F.ken\u002Findex.bin \\\n    --mode hybrid \\\n    --chunker markdown \\\n    --model ~\u002F.ken\u002Fmodel\n\n# Then `go build` as usual — \u002F\u002Fgo:embed corpus picks up\n# .ken\u002Findex.bin alongside the rest of your files.\ngo build .\u002Fcmd\u002Fyour-docs-mcp\n```\n\nYour `main.go` is unchanged from the v0.6.0 baseline — `mcp.Run` auto-discovers `corpus\u002F.ken\u002Findex.bin` from the supplied `fs.FS` and loads it. SDK authors who use a non-conventional layout (index outside the corpus FS, in a sibling `embed.FS`, etc.) can set `mcp.Options.PrebuiltIndex []byte` explicitly.\n\n**Lazy fallback on any load failure** — corrupt bytes, format-version mismatch, mode \u002F chunker mismatch — produces a stderr warning + falls back to the v0.6.0 build-from-corpus path. The pre-built path is purely an optimization, never a requirement; a stale or corrupt pre-built file gets you a slower-but-still-working binary, not a crash. Re-run `ken build-index` to refresh.\n\nDesign and rationale: [ADR-024](docs\u002FDECISIONS.md#adr-024-pre-built-embedded-indices-for-mcprun-v083).\n\n### Live demos (v0.1.0)\n\nTwo downloadable `mcp.Run` binaries that use this pattern against real codebases, with full audit transcripts:\n\n- **`ken-demo-kubernetes`** — Kubernetes v1.31.0 source, `regex` chunker (Go AST-tracking). 59,795 chunks, 216 MB binary, ≈ 3.9 s to ready, ~60 ms first query.\n- **`ken-demo-postgres`** — PostgreSQL 17.0 source, `treesitter` chunker (real C AST, 0% silent fallback verified). 64,506 chunks, 265 MB binary, ≈ 3.5 s to ready, ~30 ms first query.\n\nThe 4-second startup is \"loads a pre-built index,\" not \"instant\" — the writeup linked below has the honest measurement breakdown and the audit that caught two bugs in ken itself before publication.\n\n- Download: [`demos\u002Fv0.1.0` release](https:\u002F\u002Fgithub.com\u002Ftownsendmerino\u002Fken\u002Freleases\u002Ftag\u002Fdemos\u002Fv0.1.0) — `darwin\u002Farm64`, `darwin\u002Famd64`, `linux\u002Famd64`, `linux\u002Farm64`.\n- Audit transcripts: [`demos\u002Ftranscripts\u002F`](demos\u002Ftranscripts) (nine captured agent conversations) + [`demos\u002Ftranscript-audit-rubric.md`](demos\u002Ftranscript-audit-rubric.md).\n- Writeup: [*I shipped two downloadable code search binaries. The audit caught two bugs.*](https:\u002F\u002Ftownsendmerino.github.io\u002Fken\u002Fdemos-audit\u002F)\n\n### Why this is interesting\n\n- **Zero-infrastructure distribution.** No backend, no vector DB, no per-query cloud calls. The binary IS the corpus.\n- **Version-pinned by build artifact.** The corpus and the model and the search algorithm all ship together. There is no \"stale index\" question — to update, rebuild and re-release.\n- **Agent sandboxing by construction (for `mcp.Run` embedded-corpus binaries).** The embedded-corpus build has no path-resolution code, so there is no path-traversal escape path. The corpus is structurally sealed; an agent cannot pivot from \"search the docs\" to \"read the host's secrets.\" Note this property is specific to `mcp.Run` — `cmd\u002Fken-mcp` against a live filesystem resolves the agent-supplied `repo` argument to a real path on disk and is a different threat model (see [SSRF guard env var below](#mcp-server) for the related defense).\n- **Air-gapped friendly.** All queries answered locally, no network egress. For enterprise \u002F restricted-egress environments this is the difference between \"we can use this\" and \"we can't.\"\n\nFor multi-repo code search with live file-watching, use [`cmd\u002Fken-mcp`](cmd\u002Fken-mcp\u002F) directly (below) — the two modes coexist by design.\n\n## Who's this for?\n\nken has several distinct use cases. Pick the entry point that matches yours; each bullet names the relevant binary or mode and links to the in-depth section where its workflow is documented.\n\n- **AI coding agent users** (Claude Code, Cursor, Codex, opencode, VS Code) — install `ken-mcp` as an MCP server in your agent client. Same `search` \u002F `find_related` tool schemas and wire format as semble, so existing semble configurations work with the `command:` path swapped to `ken-mcp`. See [MCP server](#mcp-server) and [Comparison to semble](#comparison-to-semble).\n\n- **SDK \u002F docs authors shipping a single static binary** — use `mcp.Run` to bake your `\u002F\u002Fgo:embed` corpus + the Model2Vec model into one executable that serves MCP search. No backend, no per-platform asset bundles, version-pinned by build artifact. See [Embedded-corpus build pattern](#embedded-corpus-build-pattern-v060) and [Pre-building the index for faster cold start](#pre-building-the-index-for-faster-cold-start-v083).\n\n- **Backend developers with code + database schemas** — point `ken-mcp` at your repo alongside a local Postgres \u002F SQLite \u002F MySQL \u002F MariaDB dev DB. Code chunks, `.sql` file chunks, and live schema chunks compete in one ranked retrieval. See [Indexing database schemas](#indexing-database-schemas-v070-expanded-in-v071) and [Tier 2 — Live Postgres introspection](#tier-2--live-postgres-introspection-ken_db_dsn).\n\n- **CLI-first code search** — `ken index \u003Cpath>` + `ken search \u003Cpath> \u003Cquery>` for fast local exploration of an unfamiliar repository, with `--json` output for piping into other tools. Pure Go, single static binary cross-compiled to any `GOOS` \u002F `GOARCH`. See [Quickstart](#quickstart).\n\n- **Air-gapped or restricted-egress environments** — embedding inference, BM25 scoring, and fusion all run locally on the CPU. No network calls for search; no external vector DB; no API keys. See [Why this is interesting](#why-this-is-interesting).\n\n## What ken indexes well\n\nken's hybrid BM25 + Model2Vec retrieval is calibrated for two content types:\n\n- **Source code** — Python, Go, TypeScript, Java, Rust have language-aware chunking via the regex chunker (default) or the optional tree-sitter chunker. Other languages fall back to the line chunker.\n- **Documentation** — markdown files (`.md`, `.mdx`, `.markdown`) chunk on heading boundaries, keep code blocks \u002F tables \u002F lists atomic, and handle YAML\u002FTOML frontmatter. Mixed corpora (code + docs in one repo) work out of the box — each file routes to the right chunker by extension.\n\nFor plain-text corpora with no code or structured documentation (novels, journals, raw transcripts), the BM25 side works fine in `--mode=bm25`, but the semantic model is code-trained — semantic ranking quality on pure literary prose is unvalidated. If that's your use case, expect BM25 mode to do most of the heavy lifting.\n\n## Indexing database schemas (v0.7.0, expanded in v0.7.1)\n\nAgents working on a real codebase need schema context **alongside** the code. ken v0.7.0 indexes both. An agent answering \"how do users get authenticated\" gets the Go function doing auth, the SQL it executes, the `users` table definition, AND the FK relationships from `sessions.user_id` — all in one ranked result list. Design rationale in [ADR-017](docs\u002FDECISIONS.md#adr-017-database-schema-indexing--two-tier-static-sql--live-postgres-with-documented-pii-stance).\n\n### Tier 1 — Static `.sql` parsing (automatic)\n\nWhen ken's walker sees `.sql` files in the corpus, it parses each `CREATE TABLE` \u002F `INDEX` \u002F `VIEW` \u002F `ALTER TABLE` and emits one structural chunk per object alongside the raw line-chunked file. No env var, no opt-in. Migration files in `migrations\u002F` are now first-class retrieval units:\n\n```\n-- file: migrations\u002F001_users.sql\nTABLE users\n  id          BIGSERIAL PRIMARY KEY\n  email       VARCHAR(255) NOT NULL UNIQUE\n  role        VARCHAR(32)  NOT NULL DEFAULT 'guest'\n  created_at  TIMESTAMP    NOT NULL DEFAULT NOW()\n\n  INDEX users_email_idx ON (email)\n```\n\nAs of **v0.7.1** (ADR-018), when ken's walker detects a directory containing numbered `.sql` migration files — Goose \u002F dbmate \u002F Rails-4 (`\\d+_*.sql`), Flyway (`V\\d+__*.sql`), or Rails-5 \u002F Alembic (`\\d{14}_*.sql`) — it **folds CREATE TABLE + later ALTER TABLE statements into a single \"current state\" chunk per table**. An agent asking \"what columns does `users` have?\" gets one denormalized chunk reflecting the final schema, not N+1 chunks (CREATE + every ALTER) it has to mentally replay.\n\n```\n-- file: migrations\u002F0001_init.sql\n-- folded from migrations\nTABLE users\n  id          BIGSERIAL PRIMARY KEY\n  email       VARCHAR(255) NOT NULL UNIQUE\n  status      VARCHAR(16) NOT NULL DEFAULT 'active'   -- added by 0002_add_status.sql\n```\n\nFolding covers `ADD COLUMN`, `DROP COLUMN`, `ALTER COLUMN ... TYPE`, `ADD CONSTRAINT`, `DROP CONSTRAINT`, and — as of **v0.8.1 Part C** ([ADR-022](docs\u002FDECISIONS.md#adr-022-rename-column--rename-constraint-folding-via-eager-application-v081-part-c), closes [#14](https:\u002F\u002Fgithub.com\u002Ftownsendmerino\u002Fken\u002Fissues\u002F14)) — **`RENAME COLUMN` and `RENAME CONSTRAINT`**. RENAME is applied eagerly during replay: a `RENAME COLUMN old TO new` mutates the in-flight folded table so subsequent ALTERs see the post-rename state, and this-table column references inside constraint definitions (PK \u002F UNIQUE \u002F FK source-side \u002F CHECK) get rewritten via a word-boundary regex scoped to the first parenthesized group. Cross-table FK target-side column references (the `REFERENCES other(remote)` portion) are NOT propagated — that's cross-table dependency requiring migration-DAG analysis and remains out of scope. Operators using MySQL's `CHANGE old new TYPE` syntax (rename + retype in one statement) see the BOTH-chunks fallback below.\n\n**RENAME folding is a Tier-1 chunk-content fidelity improvement, NOT a search-ranking improvement.** ken's hybrid retrieval recall@10 numbers ([`docs\u002FBENCH.md`](docs\u002FBENCH.md)) measure a different system — they're about whether the right chunk surfaces in the top-10 results, not about whether the chunk contains post-rename column names. v0.8.1 Part C closes the latter gap without affecting the former.\n\nWhen an ALTER can't be folded cleanly (unknown column, type conflict, missing CREATE TABLE, RENAME of a column that doesn't exist, anonymous constraint with no name to match), ken emits **both** the original per-file ALTER chunk **and** the folded chunk for what could be resolved — the agent never sees less information than v0.7.0.\n\nSet `KEN_SQL_NO_AUTO_MIGRATIONS=1` to restore the v0.7.0 per-file behavior. Useful for operators who maintain a canonical `schema\u002Fcurrent.sql` and don't want migration history surfaced separately.\n\n### Tier 2 — Live Postgres introspection (`KEN_DB_DSN`)\n\nWhen `KEN_DB_DSN` is set, ken connects to the live database, introspects via `information_schema` \u002F `pg_catalog`, and emits one chunk per table \u002F view \u002F index \u002F function. Every chunk carries a freshness header naming the engine + host (never credentials):\n\n```\n-- indexed at 2026-08-15T14:23Z from postgres@dev-pg.local\nTABLE users\n  id          bigint        PK\n  email       varchar(255)  NOT NULL UNIQUE\n  role        varchar(32)   NOT NULL DEFAULT 'guest'\n  created_at  timestamp     NOT NULL DEFAULT now()\n\n  INDEX users_email_idx ON (email)\n  FK referenced by: sessions(user_id), audit_log(actor_id)\n```\n\n`KEN_DB_DSN` must be the URL form (`postgres:\u002F\u002Fuser:pass@host:port\u002Fdb?sslmode=...`). Tier 2 requires `KEN_MCP_DEFAULT_REPO` to be set to a local path — DB chunks attach to that repo's index. Multi-repo searches (no default) get FS-only.\n\n### PII stance: documentation + sane defaults\n\n> **This is intended for development databases. Do not point this at production data — sample rows are sent to your LLM provider as part of search results.** ken does NOT ship column-exclusion DSLs, redaction modes, or row-synthesis controls. If you can't trust the database with the LLM, don't connect ken to it.\n\nThe defaults are conservative:\n- **Schema-only is the default.** `KEN_DB_SAMPLE_ROWS=0` (unset) means no row data is read.\n- **The opt-in is unambiguous.** `KEN_DB_SAMPLE_ROWS=N` reads as \"rows you're choosing to expose.\"\n- **Every chunk carries provenance.** The freshness header surfaces `postgres@dev-pg.local` in agent output, so reviewers see where the data came from.\n\n### Row sampling (opt-in)\n\n`KEN_DB_SAMPLE_ROWS=3` appends 3 deterministically-ordered rows per table to that table's chunk:\n\n```\n  Sample rows (3 of ~12,847):\n    (1,   alice@example.com, admin,  2024-01-15)\n    (47,  bob@example.com,   member, 2024-03-22)\n    (203, claire@example.com,guest,  2025-11-08)\n```\n\nRows are ordered by the first PK column (fallback `ORDER BY 1` for tables without PK) so successive reindexes produce identical content. Long cells truncate at 80 chars with `…`.\n\n### Three reindex layers\n\n1. **Build-once-at-startup (default).** When `ken-mcp` starts, it introspects once and never refreshes. Restart to pick up schema changes. No background goroutines, no polling cost.\n2. **Periodic.** `KEN_DB_REINDEX_INTERVAL=5m` enables a background ticker that re-introspects on the configured cadence (Go duration string: `5m`, `1h`, etc.). Failures log a warn and skip that tick; agents tolerate stale schema better than no schema.\n3. **Manual via SIGHUP.** Standard Unix convention. Useful with migrate-up workflows:\n\n   ```makefile\n   migrate-up:\n       psql -f migrations\u002F$(NEXT).sql\n       kill -HUP $$(pgrep ken-mcp)\n   ```\n\n   The Refresher's mutex serializes concurrent triggers, so SIGHUP is safe to spam. No-op on Windows.\n\n### Failure modes (all non-fatal)\n\n- DSN unset → silent no-op (FS-only).\n- DSN invalid → stderr warn, FS-only.\n- `KEN_MCP_DEFAULT_REPO` unset or http(s) URL → stderr warn, Tier 2 stays off.\n- Initial connect \u002F introspection fails → stderr warn, Tier 2 stays off but FS-only server keeps running.\n- Periodic tick fails → stderr warn, skip tick, retry next interval.\n- SIGHUP refresh fails → stderr warn, previous chunks remain in the snapshot.\n\nTier 2 going dark never crashes `ken-mcp`. Restart picks up a recovered DSN.\n\n### Engine scope\n\nAs of **v0.7.2**, Tier 2 supports **Postgres + SQLite + MySQL**. Engine routing inside `internal\u002Fdb.IndexSchema` dispatches on the DSN scheme:\n\n| Scheme | Driver | Typical use |\n|---|---|---|\n| `postgres:\u002F\u002F` \u002F `postgresql:\u002F\u002F` | `github.com\u002Fjackc\u002Fpgx\u002Fv5` (pure Go) | server-backed dev DBs |\n| `sqlite:\u002F\u002F` \u002F `sqlite3:\u002F\u002F` | `modernc.org\u002Fsqlite` (pure Go, transpiled from C, no cgo) | Rails \u002F Django \u002F Phoenix \u002F Laravel \u002F FastAPI \u002F embedded apps |\n| `mysql:\u002F\u002F` or `user:pass@tcp(host:port)\u002Fdb` | `github.com\u002Fgo-sql-driver\u002Fmysql` (pure Go) | MySQL 5.7+ \u002F MySQL 8.x \u002F **MariaDB 10.x+** (first-class as of v0.8.1) — Rails, Django, Laravel, .NET, LAMP |\n\nSQLite DSN examples:\n- `sqlite:\u002F\u002F\u002Fvar\u002Fdata\u002Fdev.db` — absolute path (note the triple slash: scheme + empty host + absolute path).\n- `sqlite:\u002F\u002F.\u002Fdev.db` — relative path, resolved against `KEN_MCP_DEFAULT_REPO`. Convenient when the SQLite file lives inside the repo (overwhelmingly common).\n\nMySQL DSN examples:\n- `mysql:\u002F\u002Falice:s3cret@db.local:3306\u002Fmydb?parseTime=true` — URL form (canonical, matches the Postgres pattern).\n- `alice:s3cret@tcp(db.local:3306)\u002Fmydb?parseTime=true` — native go-sql-driver form, accepted directly because that's what most .env files in the wild already contain.\n- `alice@unix(\u002Fvar\u002Frun\u002Fmysqld\u002Fmysqld.sock)\u002Fmydb` — Unix-socket form.\n\n`parseTime=true` is forced on internally if absent — without it, DATE\u002FDATETIME\u002FTIMESTAMP columns deserialize as `[]byte` and don't render cleanly in row samples.\n\n**MariaDB is first-class as of v0.8.1** ([ADR-021](docs\u002FDECISIONS.md#adr-021-mariadb-first-class-engine-support-v081-part-b)) — same `KEN_DB_DSN` env var, same MySQL DSN forms above, same `go-sql-driver\u002Fmysql` driver. CI's `test-db-integration` job now runs the integration suite against both `mysql:8` and `mariadb:11-jammy` service containers; the v0.8.1 normalization layer strips MariaDB's legacy `bigint(20)` \u002F `int(11)` integer display widths so chunks stay byte-identical across engines. End users see no operator-visible difference — point ken at MariaDB the same way you point it at MySQL.\n\n`KEN_DB_MARIADB_TEST_DSN` is a **CI \u002F development-only** env var that the integration test suite uses to run the same tests against a live MariaDB container in parallel with `KEN_DB_MYSQL_TEST_DSN`. End users do not need to set it; both engines share `KEN_DB_DSN` for production use.\n\nThe freshness header omits credentials and shows the engine label only (`postgres@dev-pg.local`, `mysql@db.local`, `sqlite@dev.db`); ports are surfaced only when non-default. SQLite uses the file basename so chunks don't leak local filesystem layout. The same row-sampling \u002F periodic-refresh \u002F SIGHUP machinery works for all three engines without configuration changes.\n\n### Filtering indexed schemas\n\nProduction-cloned dev DBs accumulate noise — audit \u002F cron \u002F queue \u002F per-tenant schemas the agent shouldn't suggest using. As of **v0.7.2** two env vars filter the schema set Tier 2 indexes:\n\n- `KEN_DB_SCHEMAS` — comma-separated allow-list. Only these schemas are indexed (intersected with the engine's default exclusions, which always apply). Example: `KEN_DB_SCHEMAS=public,billing`.\n- `KEN_DB_EXCLUDE_SCHEMAS` — comma-separated deny-list. Extends (does not replace) the default exclusions. Example: `KEN_DB_EXCLUDE_SCHEMAS=audit,cron,legacy`.\n\nResolution rules:\n- Neither set → default behavior (everything except engine system schemas). v0.7.0 \u002F v0.7.1 byte-identical.\n- Only `KEN_DB_SCHEMAS` → index exactly those schemas (system schemas still filtered).\n- Only `KEN_DB_EXCLUDE_SCHEMAS` → index everything not in the deny-list and not in system schemas.\n- Both set → stderr warn, allow-list wins, deny-list ignored.\n\nDefault exclusions are **never user-overridable**: `pg_catalog` and `information_schema` for Postgres; `information_schema`, `mysql`, `performance_schema`, `sys` for MySQL. Operators who genuinely need to index those should not point ken at the DB.\n\nSQLite is a single-schema engine and ignores both env vars (debug-level log when they're set with a SQLite DSN).\n\nWildcards (e.g. `KEN_DB_SCHEMAS=tenant_*`) are explicitly out of scope for v0.7.2 — multi-tenant operators can fall back to the explicit form `KEN_DB_SCHEMAS=tenant_001,tenant_002,...` until field signal calls for wildcard syntax. See [ADR-019](docs\u002FDECISIONS.md#adr-019-mysql-engine--schema-filtering-for-multi-schema-dev-databases) for the rejected-alternatives audit trail.\n\n### LISTEN\u002FNOTIFY push notifications (v0.8.0, Postgres only)\n\nThe v0.7.x reindex layers (startup + `KEN_DB_REINDEX_INTERVAL` + SIGHUP) are pull-based — operators waiting for the next interval tick see stale schemas in the meantime. As of **v0.8.0**, Postgres deployments can opt into push-based change detection: schema changes propagate to ken's index within ~100ms instead of waiting for the next tick.\n\n**One-time setup.** ken does NOT modify your database without explicit consent. Run the embedded SQL script once:\n\n```bash\nken-mcp print-listen-script | psql $KEN_DB_DSN\n```\n\nThis installs a single schema-level event trigger (`ken_schema_changed_trigger`) that fires on tracked DDL (`CREATE \u002F ALTER \u002F DROP TABLE`, `INDEX`, `VIEW`, `MATERIALIZED VIEW`, `FUNCTION`, `TRIGGER`, `TYPE`) and emits a `pg_notify('ken_schema_changed', ...)`. The script is idempotent (`DROP IF EXISTS` + `CREATE`); re-running is safe.\n\n**Activate.** Set `KEN_DB_LISTEN=1` (or `true` \u002F `yes`) on the ken-mcp process. The listener uses a dedicated pgx connection separate from the introspection pool, so a long `WaitForNotification` call doesn't tie up the connection introspection needs.\n\n**Failure modes** (all non-fatal; interval polling continues regardless):\n- Event trigger not installed → clear warn naming the fix command (`ken-mcp print-listen-script | psql $KEN_DB_DSN`); listener idles until the next reconnect re-checks.\n- Connection drops (network partition, server restart) → exponential-backoff reconnect (100ms → 30s cap), reset on each successful reconnect.\n- Non-Postgres DSN → debug log + no-op (MySQL and SQLite have no equivalent push mechanism).\n\n**Recommendation: use alongside `KEN_DB_REINDEX_INTERVAL`, not instead.** NOTIFY connections can drop silently (network partition, brief reconnect window); interval polling acts as defense-in-depth backstop that catches missed notifications without operator intervention. Both can run simultaneously; the `Refresher`'s internal mutex serializes concurrent refreshes so a NOTIFY arriving mid-tick collapses cleanly.\n\nSee [ADR-020](docs\u002FDECISIONS.md#adr-020-listennotify-push-based-schema-change-detection-v080-part-1) for the alternatives considered (auto-install rejected; per-table opt-in rejected; replace-interval rejected; no-debouncing rejected; faked-MySQL rejected).\n\n### Agent-triggered reindex (`reindex_db` tool, v0.8.0 Part 2)\n\nAgents can refresh ken's view of the database schema on demand by calling the `reindex_db` MCP tool. Useful after the agent has run a migration, or before asking a schema-dependent question:\n\n> **User:** \"I just ran migration 042 that adds the `email_verified` column to `users`. Does the schema reflect that?\"\n> **Agent:** *(calls `reindex_db`)* → *(calls `search \"users table\"`)* → returns the post-migration schema.\n\n**Always available** when `KEN_DB_DSN` is set; no env var to enable. When no DB is configured, the tool is not registered at all (the agent's `tools\u002Flist` shows only `search` and `find_related`).\n\n**Engine-agnostic.** Works for Postgres, MySQL, and SQLite — the tool is a thin wrapper around the same `Refresher` that drives `KEN_DB_REINDEX_INTERVAL`, SIGHUP, and (Postgres) LISTEN\u002FNOTIFY.\n\n**Fail-fast on contention.** If a reindex is already in flight (LISTEN burst, interval tick, SIGHUP, or prior `reindex_db` call), the tool returns `Reindex already in progress; nothing to do.` immediately rather than queuing. The agent can retry, back off, or proceed with stale data based on its workflow — no silent queueing, no unbounded memory growth, no time-based cooldown env vars.\n\n**Pairs with LISTEN\u002FNOTIFY and `KEN_DB_REINDEX_INTERVAL`.** Push notifications cover Postgres deployments with the trigger installed; interval polling covers MySQL \u002F SQLite and the LISTEN-not-set-up Postgres case; `reindex_db` covers the case where the agent itself caused the schema change and knows it needs to refresh.\n\nSee [ADR-020 Part 2](docs\u002FDECISIONS.md#part-2-agent-callable-reindex-via-reindex_db-mcp-tool-v080-part-2) for the alternatives considered (cooldown, queue, async-return, env-var-disable, auto-call-from-search all rejected with mechanism-level failure modes).\n\n### Embedded DB support for SDK authors (v0.8.0 Part 3, opt-in)\n\nSDK authors using [`mcp.Run`](#using-ken-in-your-own-mcp-server-mcprun) (the v0.6.0 embedded-corpus entrypoint) can wire Tier 2 DB support — schema introspection, optional LISTEN\u002FNOTIFY, optional interval reindex, and the `reindex_db` MCP tool — via the new opt-in `mcp\u002Fdb` package:\n\n```go\npackage main\n\nimport (\n    \"context\"\n    \"log\"\n    \"os\"\n    \"time\"\n\n    \"github.com\u002Ftownsendmerino\u002Fken\u002Fmcp\"\n    mcpdb \"github.com\u002Ftownsendmerino\u002Fken\u002Fmcp\u002Fdb\"\n)\n\nfunc main() {\n    ctx := context.Background()\n\n    \u002F\u002F Opt-in: only SDK authors who want DB support import mcp\u002Fdb.\n    refresher, err := mcpdb.Setup(ctx, mcpdb.Config{\n        DSN:             os.Getenv(\"MY_DB_DSN\"),\n        SampleRows:      0,\n        ReindexInterval: 5 * time.Minute,\n        EnableListen:    true, \u002F\u002F requires one-time `mcpdb.ListenNotifyScript | psql $DSN` setup\n    })\n    if err != nil {\n        log.Fatal(err)\n    }\n\n    \u002F\u002F refresher is nil when MY_DB_DSN is unset → opts.DB stays nil →\n    \u002F\u002F reindex_db tool NOT registered (the v0.6.0 docs-only behavior).\n    \u002F\u002F When non-nil, mcp.Run calls refresher.Start internally and\n    \u002F\u002F defers the returned cleanup.\n    if err := mcp.Run(ctx, myEmbeddedDocsCorpus, mcp.Options{\n        Mode:        \"hybrid\",\n        ChunkerName: \"markdown\",\n        DB:          refresher, \u002F\u002F *mcpdb.Refresher satisfies mcp.DBIntegration\n    }); err != nil {\n        log.Fatal(err)\n    }\n}\n```\n\n**v0.6.0 binary-size contract preserved.** SDK authors who DON'T import `mcp\u002Fdb` get a binary identical in dep-tree shape to v0.7.2's `mcp.Run` use case — no pgx, no SQLite, no MySQL driver, no `internal\u002Fdb` in the link graph. The opt-in package boundary is enforced at CI time by `TestBinary_MCPPackageStaysDBFree`, which shells out to `go list -deps github.com\u002Ftownsendmerino\u002Fken\u002Fmcp` and fails if any DB driver path appears.\n\n**SDK authors who want `print-listen-script`** in their own CLI can grab the embedded SQL script from `mcpdb.ListenNotifyScript` (a re-export of `internal\u002Fdb.ListenNotifyScript`) without depending on the `internal\u002F` package:\n\n```go\nif len(os.Args) > 1 && os.Args[1] == \"print-listen-script\" {\n    _, _ = io.WriteString(os.Stdout, mcpdb.ListenNotifyScript)\n    return\n}\n```\n\n**Chunk integration is end-to-end.** Calling `reindex_db` from an agent against an `mcp.Run + mcp\u002Fdb.Setup` binary runs the introspection AND makes the new DB chunks searchable in the agent's next `search` \u002F `find_related` call. The pipeline: `mcp.Run` wraps the embedded `*search.Index` in `atomic.Pointer[search.Index]`; `mcp\u002Fdb.Refresher.Start` (called by `mcp.Run` on startup) wires the swap callback to `*search.Index.WithExtraChunks` + atomic-pointer store; each refresh rebuilds against the original corpus + the latest DB chunks. `cmd\u002Fken-mcp` continues to use `*WatchedIndex.SetExtraChunks` for its fsnotify-rooted path; the SDK-author + CLI surfaces converge on the same `Refresher` + `reindex_db` semantics. See [ADR-020 Part 3](docs\u002FDECISIONS.md#part-3-opt-in-mcpdb-package-preserving-v060-binary-size-contract-v080-part-3) for the full design + the rejected alternatives.\n\n## Quickstart\n\n```bash\n# Install both binaries (Go 1.26+).\ngo install github.com\u002Ftownsendmerino\u002Fken\u002Fcmd\u002Fken@latest\ngo install github.com\u002Ftownsendmerino\u002Fken\u002Fcmd\u002Fken-mcp@latest\n\n# Download the default Model2Vec model (~64 MB, one-time).\n# Pure Go, no Python tooling required.\nken download-model\n\n# Search any local repo from the CLI.\nken search \u002Fpath\u002Fto\u002Fmyrepo \"save model to disk\" --model ~\u002F.ken\u002Fmodel\n```\n\nOr skip the model download and use lexical-only mode:\n\n```bash\nken search \u002Fpath\u002Fto\u002Fmyrepo \"validateToken\" --mode bm25\n```\n\nLibrary use (sketch):\n\n```go\nimport \"github.com\u002Ftownsendmerino\u002Fken\u002Finternal\u002Fsearch\"\n\nix, _ := search.FromPath(\"\u002Fpath\u002Fto\u002Fmyrepo\", search.ModeHybrid, \"regex\", \"\u002Fpath\u002Fto\u002Fmodel\")\nfor _, r := range ix.Search(\"save model to disk\", 10) {\n    fmt.Printf(\"%.3f  %s:%d-%d\\n\", r.Score, r.Chunk.File, r.Chunk.StartLine, r.Chunk.EndLine)\n}\n```\n\nPre-built binaries for macOS and Linux are attached to each [release](https:\u002F\u002Fgithub.com\u002Ftownsendmerino\u002Fken\u002Freleases).\n\nAs of v0.3, `ken index \u003Cpath>` defaults to **watch mode** — it keeps the process alive and re-indexes files on change (2 s debounce); pass `--no-watch` for the v0.2 build-once-and-exit behavior. `ken-mcp` watches always — an agent editing the repo mid-session sees its own changes without a restart.\n\nAs of v0.5.0, ken respects **nested `.gitignore` files** (per-directory), matching git's behavior: a `.gitignore` inside a subdirectory applies to paths under it, with outer scopes evaluated first and inner scopes last (last match wins). Monorepos with per-package `node_modules\u002F` exclusions in subdirectory `.gitignore` files are correctly pruned without a root-level entry.\n\nThe default `regex` chunker handles most cases well. If you index a lot of Kotlin \u002F Zig \u002F TypeScript \u002F Java \u002F PHP, the opt-in `treesitter` chunker (`--chunker=treesitter` \u002F `KEN_MCP_CHUNKER=treesitter`) measurably wins for those languages — see [\"Choosing a chunker\"](#choosing-a-chunker) for the per-language recommendation.\n\n## Features\n\n- **Pure Go, no cgo.** Single static binary; `GOOS`\u002F`GOARCH` cross-compiles for free; no `libtokenizers.a` to vendor per platform.\n- **Drop-in MCP-compatible with semble.** Same `search` \u002F `find_related` tool schemas, same markdown-string output format, install snippets adapted from semble's README.\n- **Algorithm verbatim from semble.** BM25 + Model2Vec semantic + α-weighted RRF fusion + code-aware rerank (definition \u002F embedded-symbol \u002F file-coherence \u002F stem-match boosts) + path penalties + file-saturation decay. See [docs\u002FDESIGN.md §7](docs\u002FDESIGN.md#7-hybrid-retrieval--rerank).\n- **Measured agent-input efficiency.** ~44× fewer tokens than grep+Read at recall@10 on semble NL queries (4,269 vs 189,591 tok); ~16× on symbol queries; at 280K-file corpus scale, grep+Read is functionally impossible and ken is the only workable path. Full breakdown + caveats in [`docs\u002FBENCH.md`](docs\u002FBENCH.md#token-budget-recall--agent-side-efficiency).\n- **Tokenizer parity proven against `transformers.AutoTokenizer`** on an 11k-input adversarial+repo corpus (`scripts\u002Fparity_dump.py` + `internal\u002Fembed\u002Fparity_test.go`).\n- **Fast cold start.** No Python interpreter import (`ken search` from a tiny index returns in ~10–20 ms on a Mac).\n- **Concurrent indexing scaled to cores.** No GIL.\n- **CPU-only.** No API keys, no GPU, no external services.\n\n## MCP server\n\n`ken-mcp` speaks JSON-RPC over stdio. Configure your agent to invoke it; it serves the same two tools (`search`, `find_related`) semble does, with the same arg shapes and the same markdown-string output.\n\n### Install in your agent\n\n```bash\n# Claude Code\nclaude mcp add ken -s user -- \u002Fabsolute\u002Fpath\u002Fto\u002Fken-mcp\n```\n\n`~\u002F.cursor\u002Fmcp.json` (or `.cursor\u002Fmcp.json`):\n```json\n{ \"mcpServers\": { \"ken\": { \"command\": \"\u002Fabsolute\u002Fpath\u002Fto\u002Fken-mcp\" } } }\n```\n\n`~\u002F.codex\u002Fconfig.toml`:\n```toml\n[mcp_servers.ken]\ncommand = \"\u002Fabsolute\u002Fpath\u002Fto\u002Fken-mcp\"\n```\n\n`~\u002F.opencode\u002Fconfig.json`:\n```json\n{ \"mcp\": { \"ken\": { \"type\": \"local\", \"command\": [\"\u002Fabsolute\u002Fpath\u002Fto\u002Fken-mcp\"] } } }\n```\n\n`.vscode\u002Fmcp.json`:\n```json\n{ \"servers\": { \"ken\": { \"command\": \"\u002Fabsolute\u002Fpath\u002Fto\u002Fken-mcp\" } } }\n```\n\n### Environment\n\n| Variable | Default | Purpose |\n|---|---|---|\n| `KEN_MCP_DEFAULT_REPO` | (unset) | Pre-indexed source; lets tools omit the `repo` arg. |\n| `KEN_MCP_MODE` | `hybrid` | `bm25` \u002F `semantic` \u002F `hybrid`. Auto-downgrades to `bm25` with a stderr warning if the model dir is unreachable. |\n| `KEN_MCP_MODEL_DIR` | (unset) | Path to a Model2Vec snapshot containing `model.safetensors`. Empty ⇒ `bm25`-only. |\n| `KEN_MCP_CHUNKER` | `regex` | `regex` \u002F `treesitter` \u002F `line` \u002F `markdown`. See [\"Choosing a chunker\"](#choosing-a-chunker). |\n| `KEN_DB_DSN` | (unset) | Database DSN. Postgres (`postgres:\u002F\u002F...` \u002F `postgresql:\u002F\u002F...`), SQLite (`sqlite:\u002F\u002F\u002Fabs\u002Fpath.db`, `sqlite:\u002F\u002F.\u002Frel\u002Fpath.db`, `sqlite3:\u002F\u002F...`), or MySQL (`mysql:\u002F\u002Fuser:pass@host:3306\u002Fdb`, native `user:pass@tcp(host:3306)\u002Fdb`, or `user:pass@unix(\u002Fsock)\u002Fdb`) — engine routing dispatches on the scheme (or `@tcp(`\u002F`@unix(` for the native MySQL form). Enables [Tier 2 DB indexing](#tier-2--live-postgres-introspection-ken_db_dsn). Requires `KEN_MCP_DEFAULT_REPO` to be a local path. |\n| `KEN_DB_SAMPLE_ROWS` | `0` | Rows per table to sample. **Default 0 means schema-only.** See the [PII stance](#pii-stance-documentation--sane-defaults) before enabling. |\n| `KEN_DB_REINDEX_INTERVAL` | (off) | Go duration (`5m`, `1h`). Background refresh cadence. Off by default — restart or `SIGHUP` to refresh. |\n| `KEN_DB_LISTEN` | `0` | `1` \u002F `true` \u002F `yes` activates Postgres LISTEN\u002FNOTIFY push notifications (v0.8.0). Requires the one-time setup script: `ken-mcp print-listen-script \\| psql $KEN_DB_DSN`. Non-Postgres DSNs log debug + no-op. See [LISTEN\u002FNOTIFY push notifications](#listennotify-push-notifications-v080-postgres-only). |\n| `KEN_DB_SCHEMAS` | (unset) | Comma-separated allow-list of schema names (Postgres) \u002F database names (MySQL). Example: `public,billing`. Default exclusions (`pg_catalog`, `information_schema`, `mysql`, `performance_schema`, `sys`) always still apply. SQLite ignores. See [Filtering indexed schemas](#filtering-indexed-schemas). |\n| `KEN_DB_EXCLUDE_SCHEMAS` | (unset) | Comma-separated deny-list. Extends (does not replace) the default exclusions. Example: `audit,cron,legacy`. When set alongside `KEN_DB_SCHEMAS`, the allow-list wins (stderr warn). SQLite ignores. |\n| `KEN_SQL_NO_AUTO_MIGRATIONS` | (off) | `1` \u002F `true` \u002F `yes` disables v0.7.1 Tier-1 migration-history folding (restores v0.7.0 per-file behavior). Useful when you maintain a canonical `schema\u002Fcurrent.sql` and don't want migration history surfaced as folded chunks. |\n| `KEN_MCP_CACHE_SIZE` | `16` | LRU bound on the repo→Index cache. |\n| `KEN_MCP_LOG_LEVEL` | `warn` | `debug` \u002F `info` \u002F `warn` \u002F `error`. All logs go to stderr; **stdout is the JSON-RPC channel** ([details](docs\u002FDESIGN.md#hard-rule--stdoutstderr-contract)). |\n| `KEN_ALLOW_PRIVATE_CLONE_TARGETS` | `0` | Defaults off. When an agent passes an http(s) URL as `repo`, ken pre-resolves the host and rejects loopback \u002F link-local \u002F RFC1918 \u002F RFC4193 \u002F unspecified addresses (SSRF guard — blocks an agent from coaxing ken-mcp into probing cloud-metadata or other internal endpoints). Set to `1` \u002F `true` \u002F `yes` if you legitimately need to clone from an internal git host. |\n\n### Tuning ken's routing for your repo\n\nBy default, `ken-mcp`'s server-side instructions tell agents to prefer ken's `search` and `find_related` tools over grep, Glob, or Read for code-related questions — semble's verbatim behavior, faithful to the drop-in claim. For many repos that default is right; for some it's too aggressive (small codebases where grep is plenty fast; refactors that need exhaustive enumeration that top-N retrieval can silently miss).\n\nIf you'd rather have agents route between ken and grep deliberately, add something like the following to your repo's `CLAUDE.md`:\n\n> **Search routing — ken vs grep.** The `ken` MCP server is user-scoped (`claude mcp add ken -s user …`); not every session has it. Check the tool list before assuming.\n>\n> - **ken** — first-pass \"show me the surface of X\", semantic \u002F conceptual queries (\"where do we handle X?\"), unfamiliar areas. Returns a ranked top-N grouped across layers (handler → store → resolver → migrations → generated → docs). ~1–2 s warm round-trip.\n> - **grep \u002F rg** — exhaustive enumeration, pre-rename audits, every literal occurrence, known-identifier lookups, one-off literal checks. ~0.06 s and deterministic. **Use grep before any rename or refactor that must be complete** — ken is top-N and can miss matches past its result window.\n> - Don't reach for ken on a one-off literal lookup where you already know the symbol — the latency tax isn't worth it.\n\nken's defaults stay unchanged; this is per-repo tuning, not a configuration flag.\n\n## Tools\n\nBoth tools return a formatted markdown string identical to semble's `_format_results` output.\n\n### `search`\n\n| Arg | Type | Required | Default | Description |\n|---|---|---|---|---|\n| `query` | string | ✓ | — | Natural language or code query. |\n| `repo` | string |   | — | `https:\u002F\u002F` \u002F `http:\u002F\u002F` URL or local directory. Required if no `KEN_MCP_DEFAULT_REPO`. |\n| `mode` | `hybrid`\\|`semantic`\\|`bm25` |   | `hybrid` | Search mode. |\n| `top_k` | int |   | `5` | Number of results. |\n\n### `find_related`\n\n| Arg | Type | Required | Default | Description |\n|---|---|---|---|---|\n| `file_path` | string | ✓ | — | Path as it appears in a `search` result. |\n| `line` | int (1-indexed) | ✓ | — | A line inside the chunk to seed the similarity search. |\n| `repo` | string |   | — | Same as for `search`. |\n| `top_k` | int |   | `5` | Number of similar chunks. |\n\nExample response (verbatim from a real session against this repo's polyglot fixture):\n\n```\nSearch results for: \"validate_user\" (mode=bm25)\n\n## 1. auth.py:1-22  [score=5.518]\n​```\n\"\"\"Authentication helpers.\"\"\"\n\nimport hashlib\n\n@dataclass\nclass User:\n    name: str\n    token: str\n\n    def is_valid(self):\n        return bool(self.token)\n\n# validate_user checks a token against a user record.\ndef validate_user(user, token):\n    return user.token == token\n​```\n```\n\n## How it works\n\n```\ngitignore-respecting walk\n    → regex chunker (Python \u002F Go \u002F TS \u002F Java \u002F Rust) with line-chunker fallback\n    → BM25 (Lucene variant, k1=1.5, b=0.75)  +  Model2Vec semantic (cosine over a dense matrix)\n    → α-weighted RRF fusion (α auto-detected: 0.3 for symbol queries, 0.5 for NL)\n    → file-coherence boost + query-type boosts (definition \u002F embedded-symbol \u002F stem-match)\n    → path penalties (test files, compat \u002F legacy, `.d.ts`) + file-saturation decay\n    → top-k\n```\n\nThe retrieval algorithm is a verbatim port of semble's `search.py` + `ranking\u002F*.py`; see [docs\u002FDESIGN.md §7](docs\u002FDESIGN.md#7-hybrid-retrieval--rerank) for every constant, every pipeline-order subtlety, and where the original scoping reconstruction diverged from semble's live source. The Model2Vec inference path (three-tensor `safetensors` layout, the `mapping[]` indirection, the float64 precision contract that's load-bearing for ≥1−1e-5 cosine parity) is in [§4](docs\u002FDESIGN.md#4-model2vec-inference-format).\n\n## Using ken as a library over `fs.FS`\n\nAs of v0.5.0 the walker and indexer take any `fs.FS`, so ken can index an `embed.FS`, an `fstest.MapFS`, a tarball-backed FS, or any other `fs.FS` implementation — useful for agent sandboxing (no escape from the corpus) and offline analysis (no unpack-to-disk step). The `--watch` codepath stays real-FS-only.\n\n```go\nimport (\n    \"embed\"\n\n    \"github.com\u002Ftownsendmerino\u002Fken\u002Finternal\u002Fsearch\"\n)\n\n\u002F\u002Fgo:embed corpus\u002F**\nvar corpus embed.FS\n\nfunc main() {\n    ix, _ := search.FromFS(corpus, search.ModeBM25, \"regex\", \"\")\n    for _, r := range ix.Search(\"validate token\", 5) {\n        \u002F\u002F r.Chunk.File, r.Chunk.StartLine, r.Score, ...\n    }\n}\n```\n\nFor test fixtures, `testing\u002Ffstest.MapFS` works the same way: `search.FromFS(fstest.MapFS{\"a.go\": {Data: []byte(\"...\")}}, …)`. The legacy `search.FromPath(root, …)` is now a thin deprecated wrapper around `search.FromFS(os.DirFS(root), …)`. See [ADR-014](docs\u002FDECISIONS.md#adr-014-fsfs-as-canonical-walkerindexer-surface) for the design rationale.\n\n## Choosing a chunker\n\nken ships with **two chunkers** behind the same `--chunker=` flag (CLI) \u002F `KEN_MCP_CHUNKER=` env var (MCP):\n\n- **`regex`** *(default)* — hand-rolled per-language regex rules for Python \u002F Go \u002F TypeScript \u002F Java \u002F Rust with a line-window fallback for everything else.\n- **`treesitter`** *(opt-in)* — pure-Go tree-sitter via [`gotreesitter`](https:\u002F\u002Fgithub.com\u002Fodvcencio\u002Fgotreesitter), running the cAST split-then-merge algorithm from [arXiv 2506.15655](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2506.15655). Its 206 embedded grammars are ~19 MB on-disk (gotreesitter's `embed.FS` payload); importing the chunker adds ~26 MB to the linked binary (parser runtime + embed payload + symbol bookkeeping; measured darwin\u002Farm64). Importing is per-binary at compile time — `cmd\u002Fken` and `cmd\u002Fken-mcp` blank-import it; `cmd\u002Fken-mcp-docs` deliberately doesn't. Once imported, chunker choice is a runtime flag (`--chunker=treesitter` \u002F `KEN_MCP_CHUNKER=treesitter`). (v0.8.2 found per-grammar build-tag gating couldn't shrink the binary because the embed layer was a monolithic upstream glob; gotreesitter v0.20.0-rc2 fixed that, so as of [ADR-033](docs\u002FDECISIONS.md#adr-033-adopt-gotreesitter-grammarsubset-slim-release-binaries-v0200-rc2) ken's *release* binaries build slim — embedding only the 17 dispatched grammars (~14 MB lighter) — while the library `go build` stays all-grammars. History in [ADR-023](docs\u002FDECISIONS.md#adr-023-gotreesitter-grammar_subset-machinery--binary-size-reduction-outcome-v082-investigation-outcome) and its [calibration amendment](docs\u002FDECISIONS.md#calibration-amendment-post-v083-audit).)\n\n**TL;DR:** stay on `regex` unless you index one of the languages where treesitter measurably wins.\n\nThe NDCG@10 difference is small (overall hybrid: treesitter 0.838 vs regex 0.842 — Δ −0.004, within bench noise), but it's not uniform per-language. From the v0.2.0 measurement on semble's 63-repo benchmark:\n\n| Language | regex | treesitter | Recommendation |\n|---|---:|---:|---|\n| Kotlin | 0.806 | **0.817** | **`treesitter`** *(+0.011)* |\n| Zig | 0.867 | **0.880** | **`treesitter`** *(+0.013)* |\n| TypeScript | 0.676 | **0.685** | **`treesitter`** *(+0.009)* |\n| Java | 0.829 | **0.835** | **`treesitter`** *(+0.006)* |\n| PHP | 0.860 | **0.865** | **`treesitter`** *(+0.005)* |\n| Python | **0.870** | 0.861 | `regex` *(−0.009)* |\n| C | **0.748** | 0.731 | `regex` *(−0.017)* |\n| C++ | **0.896** | 0.884 | `regex` *(−0.012)* |\n| Rust | **0.806** | 0.793 | `regex` *(−0.013)* |\n| Lua | **0.838** | 0.816 | `regex` *(−0.022)* |\n| Scala | **0.905** | 0.883 | `regex` *(−0.022)* |\n| Go | **0.849** | 0.846 | either *(tied within ±0.005)* |\n| JavaScript | 0.917 | 0.912 | either |\n| Ruby | 0.903 | 0.903 | either |\n| Swift | 0.846 | 0.841 | either |\n| Elixir | 0.911 | 0.907 | either |\n| Haskell | 0.738 | 0.739 | either |\n| C# | 0.859 | 0.859 | either *(treesitter auto-falls-back to line)* |\n| Bash | 0.821 | 0.821 | either *(treesitter auto-falls-back to line)* |\n\nNotes on the auto-fallback rows:\n- **C#** — the gotreesitter v0.18.0 C# grammar OOMs on real-world C# files (1.7+ GB RSS during indexing). The treesitter chunker detects unsupported languages and routes them through the line chunker, so C# behaves identically under both selections.\n- **Bash** — the bash grammar is pathologically slow on real bash-it content (~39% of files timeout). Same auto-fallback behavior.\n\nThe full per-language NDCG breakdown plus the empirical findings that informed this is in [`docs\u002FBENCH.md`](docs\u002FBENCH.md). The rationale for default-stays-regex is in [`docs\u002FDECISIONS.md` ADR-011](docs\u002FDECISIONS.md#adr-011-default-chunker-stays-regex-in-v020-treesitter-is-opt-in).\n\n## Comparison to semble\n\n| Property | semble | ken |\n|---|---|---|\n| Language | Python | Go |\n| Distribution | `uvx` \u002F `pip install` | single static binary |\n| Cold start | (Python interpreter + `import numpy` + model load: ~500 ms per [semble README](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble#benchmarks)) | ~10–20 ms `ken search` over a tiny index (measured, M2 Mac) |\n| Index this repo (1,710 chunks, hybrid w\u002F model) | (not measured locally) | **0.36 s** (measured‡) |\n| Index `\u002Ftmp\u002Fsemble` checkout (1,151 chunks, hybrid w\u002F model) | (not measured locally) | **1.27 s** (measured‡) |\n| Index this repo (1,710 chunks, BM25 only) | (not measured locally) | **0.06 s** (measured‡) |\n| Retrieval algorithm | reference implementation | verbatim port (constants and pipeline order ported from `search.py` + `ranking\u002F*.py`) |\n| NDCG@10 on semble's benchmark | 0.854 ([semble README](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble#benchmarks)) | **0.842 hybrid** (gap 0.012, full corpus 63 repos × 1251 queries)† |\n| NDCG@10 on CoIR-CSN-Python (external) | (not measured; semble doesn't run this bench) | **0.8743 bm25 \u002F 0.7839 hybrid** ([see why](#benchmarks--external-reference-coir-csn-python))†† |\n| Median tokens to recall@10 on agent queries | (not measured; semble doesn't run this bench) | **4,269 tok @ 82% recall** on semble NL queries — vs grep+Read's 189,591 tok @ 99.9% (44× cheaper at 17 pp lower recall)††† |\n| MCP server | yes | yes — drop-in compatible (same tool schemas, same wire format) |\n| Binary size | n\u002Fa (Python env) | release (slim) `ken` ~22 MB · `ken-mcp` ~38 MB; default `go build` (all 206 grammars) `ken` ~36 MB · `ken-mcp` ~54 MB. Slim embeds only the 17 dispatched grammars via `grammar_subset` build tags ([ADR-033](docs\u002FDECISIONS.md#adr-033-adopt-gotreesitter-grammarsubset-slim-release-binaries-v0200-rc2)); measured darwin\u002Farm64 — see [Choosing a chunker](#choosing-a-chunker) |\n| Requires `huggingface-cli` for model | yes | **no** — `ken download-model` fetches direct from HF (or skip and use `--mode bm25`) |\n\n† **Measured at v0.1.0 \u002F v0.2.0 against semble's published benchmark** (63 repos, 1251 queries, semble's own `benchmarks.metrics.ndcg_at_k` + `target_rank`). Reproduce: see [`docs\u002FBENCH.md`](docs\u002FBENCH.md). Ablation breakdown vs semble's published raw retrieval numbers:\n>\n> | Mode | semble (raw) | ken regex (default) | ken treesitter (opt-in) |\n> |---|---:|---:|---:|\n> | Semantic only (potion-code-16M) | 0.650 | **0.647** | — |\n> | BM25 only | 0.675 | 0.624 | 0.621 |\n> | **Hybrid (full ranker)** | **0.854** | **0.842** | **0.838** |\n>\n> The semantic-raw match within 0.003 isolates and validates the embedding + tokenizer + ANN port. The BM25 tokenizer was also re-aligned to a verbatim port of semble's `tokens.py` (snake-case compound preservation, ASCII-only identifier extraction, compound-first emission order). The v0.2.0 tree-sitter chunker (`--chunker=treesitter` via [`gotreesitter`](https:\u002F\u002Fgithub.com\u002Fodvcencio\u002Fgotreesitter)) trades NDCG per-language without net movement — clear wins on Kotlin \u002F Zig \u002F TypeScript \u002F Java \u002F PHP, losses on Python \u002F Rust \u002F C \u002F Lua \u002F Scala — so the **default chunker stays regex** and treesitter is opt-in. See [\"Choosing a chunker\"](#choosing-a-chunker) for the per-language recommendation and [`docs\u002FDECISIONS.md` ADR-011](docs\u002FDECISIONS.md#adr-011-default-chunker-stays-regex-in-v020-treesitter-is-opt-in) for the full rationale.\n\n†† CoIR-CSN-Python numbers reported separately because they tell a different story than semble's bench: on CSN, BM25 beats hybrid by ~0.09 due to a substring-leak artifact in how CoIR reframes the CodeSearchNet dataset (queries are Python function sources; documents are docstrings extracted from those same functions, so the answer is a literal substring of the query). See the [\"Benchmarks — external reference\"](#benchmarks--external-reference-coir-csn-python) section and [`docs\u002FBENCH.md`](docs\u002FBENCH.md#external-benchmark--coir-csn-python) for the corrected explanation. semble's bench is the verbatim-port confirmation; CoIR-CSN is the externally-reproducible anchor against published code-IR baselines but is read as a dataset-construction case study, not as evidence about ken's hybrid retrieval on natural NL-to-code queries.\n\n††† Measured at v0.3.0 against semble's 63-repo benchmark (914 NL queries from semble's 1,251-query corpus, ranked by ken's regex chunker, K=10). The honest framing: ken trades ~17 percentage points of recall for ~44× fewer agent-input tokens. Exhaustive enumeration (refactors, pre-rename audits) still belongs to grep — ken is for \"find the chunk that answers this.\" Full per-query-class table (symbol + NL) and the methodology + caveats are in [`docs\u002FBENCH.md`](docs\u002FBENCH.md#token-budget-recall--agent-side-efficiency).\n\n‡ **Indexing times re-measured 2026-05-29** at commit `fe53e91` (post-perf-campaign: v0.8.5–v0.8.7 tokenizer-allocation reduction + indexing-pipeline parallelism, [ADR-027](docs\u002FDECISIONS.md#adr-027-bm25-tokenizer-allocation-reduction--rune--byte--syncpool-scratch--lowercase-fast-path-v085)\u002F[ADR-030](docs\u002FDECISIONS.md#adr-030-indexing-pipeline-parallelism--phase-a-per-file-workers-for-chunk--embed-v087)), via `ken perf index \u003Cpath> --mode …` (darwin\u002Farm64, Go 1.26.3, 8 cores). \"This repo\" is the ken repo root — it grew from 542 to 1,710 chunks (3.2×) since the prior measurement, yet hybrid indexing *dropped* from 0.45 s to 0.36 s and BM25 held at 0.06 s for 3.2× the work.\n\nsemble timings cited above are from semble's own [README \"Benchmarks\" section](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble#benchmarks); ken's are measured on the ken repo root and on a sibling shallow clone of `\u002Ftmp\u002Fsemble`. Cold-start was timed by `\u002Fusr\u002Fbin\u002Ftime -p ken search testdata\u002Frepo \"validate\" -k 1 --mode bm25` over three trials (M2 MacBook Air, Go 1.26.3, darwin\u002Famd64 build under Rosetta).\n\n## Benchmarks — external reference (CoIR-CSN-Python)\n\nA single externally-reproducible NDCG@10 number on [CoIR](https:\u002F\u002Fgithub.com\u002FCoIR-team\u002Fcoir)'s `CodeSearchNet-python` task, independent of semble's own benchmark — gives readers a comparable anchor against published code-IR baselines.\n\nResult (v0.2.0, 1000-query subsample, regex chunker):\n\n| Mode                       | NDCG@10 |\n|----------------------------|--------:|\n| bm25                       |  0.8743 |\n| semantic                   |  0.7405 |\n| **hybrid (default)**       | **0.7839** |\n\nReproduce:\n\n```bash\npython scripts\u002Fbench_coir.py                                # ~45 s download + 280k corpus files\nKEN_COIR_QUERY_LIMIT=1000 go test -tags=bench .\u002Fbench\u002Fndcg\u002F -run TestCoIR -v   # ~13 min\n```\n\nA nuance worth surfacing up front: **on CSN-Python, BM25 beats hybrid by 0.09** — opposite of what semble's bench shows. CSN-Python's queries (as CoIR re-hosts the dataset) are full Python function sources, and the relevant document for each query is the docstring extracted from that same function. Because the docstring lives inside the function source as a literal substring (the function's own `\"\"\"...\"\"\"` block), any lexical retriever with identifier-aware tokenization wins — BM25 has the answer string as input. ken's α=0.5 RRF fusion then drags the hybrid number down by averaging in the weaker semantic ranking. Not a ken bug; it's a structural artifact of how CoIR reframed CodeSearchNet for retrieval, and doesn't generalize to natural NL-to-code distributions. Detailed empirical findings and the comparison to potion-code-16M's published aggregate are in [`docs\u002FBENCH.md`](docs\u002FBENCH.md#external-benchmark--coir-csn-python).\n\n## Roadmap\n\nThe full risk register with explicit triggers is in [docs\u002FDESIGN.md §10](docs\u002FDESIGN.md#10-risk-register). Highlights:\n\n- **NDCG vs semble — measured at v0.1.0 \u002F v0.2.0**: hybrid 0.842 (regex) and 0.838 (treesitter) vs semble's 0.854. The ~0.012 gap is **not primarily chunker-driven** — v0.2.0's tree-sitter chunker trades per-language wins and losses without closing the gap (see [docs\u002FBENCH.md](docs\u002FBENCH.md) \"v0.2.0 empirical findings\"). The algorithm port itself is validated by the semantic-raw match within 0.003.\n- **Tree-sitter chunker (Option A)** — landed in v0.2.0 via [`gotreesitter`](https:\u002F\u002Fgithub.com\u002Fodvcencio\u002Fgotreesitter) as opt-in (`--chunker=treesitter`). Default stays `regex`. Per-language guidance in [\"Choosing a chunker\"](#choosing-a-chunker).\n- **Chroma chunker (Option B)** — broader language coverage via a token-stream lexer. Trigger: a polyglot repo where neither chunker covers a needed language. Not currently triggered.\n- **Class-body-aware Python chunking** — currently top-level only; large Django models \u002F SQLAlchemy bases line-split through methods. Trigger: Python NDCG visibly below the other languages (not currently triggered).\n- **~~Incremental indexing~~ — landed in v0.3.** `ken-mcp` watches the repo file tree and republishes a snapshot 2s after any edit, so an agent querying its own working tree sees its own edits without a restart. `ken index --watch` (default) keeps the CLI alive in a similar role; `ken index --no-watch` restores the v0.2 build-and-exit behavior. Tombstones for deletes, no compaction — memory grows monotonically with cumulative edit volume, which is fine for typical agent-session lifetimes; compaction is a v0.3.x trigger if multi-day sessions hit pressure. Atomic-snapshot reads keep query latency unchanged from v0.2. Implementation: [`internal\u002Fsearch\u002Fwatch.go`](internal\u002Fsearch\u002Fwatch.go), design rationale in [`docs\u002FDECISIONS.md` ADR-012](docs\u002FDECISIONS.md#adr-012-incremental-indexing-via-fsnotify--atomic-snapshot-swap).\n- **Token-budget recall — agent-side efficiency vs grep+Read.** Measured at v0.3.0; ken surfaces the qrel target chunk in ~44× fewer tokens than the tokenized-grep baseline at K=10 on semble's NL queries (82% recall vs 99%), and in ~10,000× fewer tokens on the 280K-file CoIR-CSN-Python corpus (91% vs 100% recall). Grep wins on recall completeness; ken wins decisively on agent-input cost. See [`docs\u002FBENCH.md` \"Token-budget recall\"](docs\u002FBENCH.md#token-budget-recall--agent-side-efficiency).\n\n## How this was built\n\nken is a port. The retrieval algorithm is verbatim from [MinishLab\u002Fsemble](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble) (Python). The Go implementation was written by Claude under a fixed set of constraints: pure Go \u002F no cgo, algorithm constants ported verbatim never tuned, original source wins whenever Claude's reconstruction of an algorithm detail diverges from semble's live code.\n\nThat last rule caught five material errors during the rerank-pipeline port (see [docs\u002FDESIGN.md §7](docs\u002FDESIGN.md#7-hybrid-retrieval--rerank)) — each one a confident-sounding hallucination of an algorithm detail that turned out to be wrong when checked against the Python source. The discipline of always checking is human-supplied.\n\nBenchmark numbers in the [Comparison table](#comparison-to-semble) are measured against semble's own harness using its native NDCG@10 metric, not synthesized — reproducible via [`docs\u002FBENCH.md`](docs\u002FBENCH.md). The 11k-input tokenizer parity test ([`scripts\u002Fparity_dump.py`](scripts\u002Fparity_dump.py) + [`internal\u002Fembed\u002Fparity_test.go`](internal\u002Fembed\u002Fparity_test.go)) was a human call — \"the 18-case spot-check isn't enough\" — and surfaced three real bugs the spot-check missed.\n\nThe ADR-style record of every architectural decision (alternatives considered, consequences) lives in [`docs\u002FDECISIONS.md`](docs\u002FDECISIONS.md).\n\n## Acknowledgments\n\nken stands on MinishLab's shoulders. The retrieval algorithm, the model, the entire approach to embedding-table-driven code search — all theirs.\n\n- **[semble](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble)** — the original Python implementation. ken's retrieval pipeline is a verbatim port; constants and pipeline order come straight from `search.py` and `ranking\u002F*.py`. © Thomas van Dongen, MIT.\n- **[model2vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec)** — the static-embedding library whose three-tensor format ken implements. © Thomas van Dongen, MIT.\n- **[potion-code-16M](https:\u002F\u002Fhuggingface.co\u002Fminishlab\u002Fpotion-code-16M)** — model weights, distilled from `nomic-ai\u002FCodeRankEmbed` (MIT) which is itself initialized from `Snowflake\u002Fsnowflake-arctic-embed-m-long` (Apache-2.0). © Minish Lab. Redistributed per [`NOTICE`](NOTICE).\n\n## License\n\nken is [MIT-licensed](LICENSE). It bundles attribution for the redistributed model weights and their upstream lineage in [`NOTICE`](NOTICE), and a generated list of Go-module dependency licenses in [`THIRD_PARTY_LICENSES.md`](THIRD_PARTY_LICENSES.md). Every link in the provenance chain is permissive (MIT, Apache-2.0, and MPL-2.0 — the MPL-2.0 entry (`go-sql-driver\u002Fmysql` v1.10.0, added in v0.7.2) is file-level copyleft only and is safe to redistribute when used as an unmodified library); see [docs\u002FDESIGN.md §6](docs\u002FDESIGN.md#6-license--attribution-chain).\n\nFor contributors: see [`CLAUDE.md`](CLAUDE.md) for build \u002F test \u002F formatting conventions and the project's invariants (precision contract, stdout\u002Fstderr contract).\n","ken 是一个快速的混合代码搜索工具，专为代理设计。它使用 Go 语言编写，提供单一静态二进制文件，并与 semblance 兼容，支持相同的工具模式、输出格式和安装步骤。项目的核心功能包括高效的代码检索算法，基于 BM25 和嵌入技术，能够显著减少查询时所需的标记数量，从而提高效率。此外，ken 还通过嵌入式语料库构建模式，允许开发者将文档打包成单个 MCP 服务器二进制文件，方便用户在本地进行高质量的文档检索，无需依赖后端服务或网络请求。这种设计特别适用于需要快速访问大量代码文档或开发指南的场景，如编程助手或自动化开发环境。","2026-06-11 04:08:09","CREATED_QUERY"]