[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76148":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},76148,"psql_bm25s","Intelligent-Internet\u002Fpsql_bm25s","Intelligent-Internet","PostgreSQL BM25S extension",null,"PLpgSQL",140,3,129,0,2,7,1,43.51,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:20","# 🐿️ psql_bm25s\n\n`psql_bm25s` ([Technical Report](docs\u002Ftechnical-report.md)) is an\nindependent PostgreSQL extension for BM25-family lexical retrieval. It\nis inspired by the public [`bm25s`](https:\u002F\u002Fgithub.com\u002Fxhluca\u002Fbm25s)\nwork and implemented as a PostgreSQL-native access method.\n\n\u003Cimg width=\"1500\" height=\"600\" alt=\"commons-banner-github\" src=\"doc\u002Fbanner.png\" \u002F>\n\nThe project keeps the BM25 contract explicit where it matters most:\n\n- BM25-family semantics\n- corpus-statistics-driven ranking\n- query-first exact top-k retrieval\n\nThe mainline extension adds a PostgreSQL-specific storage and\nmaintenance design for database workloads with frequent `INSERT`,\n`UPDATE`, and `DELETE`. No source code from the Python reference\nimplementation `bm25s` is vendored or copied into this repository.\n\n## What It Does\n\nCurrent mainline capabilities:\n\n- native `CREATE INDEX USING psql_bm25s`\n- indexed column types:\n  - `text`\n  - `varchar`\n  - `text[]`\n  - `varchar[]`\n  - `int4[]`\n- multicolumn fusion indexes over `text[]`, `varchar[]`, `text`, or\n  `varchar` columns\n- opt-in field-aware multicolumn indexes with query-time field\n  weights\n- canonical exact BM25 retrieval APIs:\n  - `psql_bm25s_query_tokens(...)`\n  - `psql_bm25s_query_ids(...)`\n- SQL-facing convenience surfaces:\n  - `psql_bm25s_query(...)`\n  - `psql_bm25s_query_prepared(...)`\n  - `tokens @@ 'query text'`\n  - `ORDER BY tokens \u003C=> ... ASC LIMIT k`\n  - `ORDER BY token_ids \u003C=> ... ASC LIMIT k`\n- text preprocessing helpers:\n  - tokenization\n  - normalization\n  - optional stopwords\n  - optional English stemming\n  - optional Latin-diacritic folding\n- token-stream and raw-text highlight and snippet helpers\n- automatic mutable-workload maintenance:\n  - exact `INSERT` \u002F `UPDATE`\n  - exact delete cleanup through PostgreSQL heap cleanup and index\n    maintenance\n  - bounded deferred overlays for lower write cost\n- maintenance introspection and recommendation helpers\n- PostgreSQL-native durability and physical replication compatibility\n- weighted multi-index query fusion helpers for field-aware retrieval\n- C-backed hybrid BM25\u002Fvector late-fusion helpers without a hard vector\n  extension dependency\n\n## Relationship To BM25S\n\nThe Python reference implementation `bm25s` project is the main public\nreference for the eager sparse scoring formulation used as technical\nbackground:\n\n- \u003Chttps:\u002F\u002Fgithub.com\u002Fxhluca\u002Fbm25s>\n\nRelated scoring concepts include:\n\n- `robertson`\n- `lucene`\n- `atire`\n- `bm25l`\n- `bm25+`\n- eager sparse scoring\n- CSC-style postings layout\n- top-k retrieval over precomputed sparse scores\n\nThe extension does not try to reproduce the `bm25s` Python package interface.\nIt implements its own C and PostgreSQL storage, access-method, and\nmaintenance layers around BM25-family retrieval semantics.\n\n## Why This Extension Exists\n\n`bm25s` is excellent for fast BM25 retrieval over static or externally\nmanaged corpora. PostgreSQL adds different requirements:\n\n- index persistence inside the database\n- transactional writes\n- crash recovery\n- physical replication\n- planner-visible SQL access\n- operationally manageable index maintenance under row churn\n\n`psql_bm25s` keeps BM25-family semantics explicit while adding the\nstorage and maintenance layers needed for the index to behave like a\ndatabase index rather than a one-shot offline artifact.\n\n## Performance Status\n\nCurrent high-level read:\n\n- the canonical rowset query path is the main performance path\n- the current public benchmark reference is the PG18 `15 x 5` BEIR\n  matrix\n- published BEIR reference targets are included in the project benchmark\n  suite\n- historical same-machine studies remain useful background, but they\n  are no longer the default benchmark reference\n- the mutable-maintenance design dramatically improves write-side\n  maintenance cost relative to eager full rebuilds, while keeping exact\n  reads and PostgreSQL durability\u002Freplication guarantees\n\nCurrent performance matrix, with dataset size and QPS:\n\nThese numbers are benchmark context, not universal claims about any\nproject. They depend on the measured versions, configuration, hardware,\nworkload, and query settings. Third-party project names identify the\nmeasured engines for reproducibility.\n\n| Dataset | Docs | Queries | Python reference implementation `bm25s` QPS | `psql_bm25s ids` QPS | `psql_bm25s text[]` QPS | `pg_search` QPS | `vchord_bm25` QPS |\n| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |\n| `arguana` | 8,674 | 1,406 | 1158.34 | 1402.63 | 1112.01 | 115.94 | 78.77 |\n| `climate-fever` | 5,416,593 | 1,535 | 3.04 | 57.78 | 50.75 | 2.84 | 5.25 |\n| `cqadupstack` | 457,199 | 13,145 | 111.56 | 443.13 | 438.42 | 13.99 | 60.51 |\n| `dbpedia-entity` | 4,635,922 | 467 | 3.47 | 128.19 | 91.19 | 5.21 | 23.66 |\n| `fever` | 5,416,568 | 123,142 | 3.15 | 97.56 | 80.15 | 5.62 | 12.13 |\n| `fiqa` | 57,638 | 6,648 | 810.51 | 1409.52 | 1186.41 | 17.76 | 190.57 |\n| `hotpotqa` | 5,233,329 | 97,852 | 4.16 | 55.40 | 49.86 | 3.42 | 9.31 |\n| `msmarco` | 8,841,823 | 509,962 | 1.61 | 96.67 | 82.13 | 4.44 | 18.20 |\n| `nfcorpus` | 3,633 | 3,237 | 3155.35 | 3373.94 | 3326.96 | 1132.17 | 1252.75 |\n| `nq` | 2,681,468 | 3,452 | 10.55 | 174.34 | 176.69 | 6.28 | 21.96 |\n| `quora` | 522,931 | 15,000 | 90.56 | 637.98 | 619.64 | 13.26 | 154.36 |\n| `scidocs` | 25,657 | 1,000 | 1203.09 | 1835.85 | 1614.92 | 17.89 | 367.04 |\n| `scifact` | 5,183 | 1,109 | 2964.86 | 2557.47 | 2240.18 | 500.04 | 629.42 |\n| `trec-covid` | 171,332 | 50 | 210.50 | 191.94 | 154.48 | 8.75 | 75.66 |\n| `webis-touche2020` | 382,545 | 49 | 240.36 | 82.97 | 74.10 | 8.14 | 86.04 |\n\nTrend view by dataset scale:\n\n![QPS vs dataset scale](docs\u002Fperformance\u002Freports\u002Fpg18-qps-vs-dataset-scale-2026-04-02.svg)\n\nCurrent matrix readout:\n\n- Median QPS ratios versus Python reference implementation `bm25s` are `3.97x` for\n  `psql_bm25s ids`, `3.93x` for `psql_bm25s text[]`, `0.54x` for\n  `vchord_bm25`, and `0.17x` for `pg_search`.\n- Dataset counts at or above Python reference implementation `bm25s` are `12\u002F15` for\n  `psql_bm25s ids`, `11\u002F15` for `psql_bm25s text[]`, `7\u002F15` for\n  `vchord_bm25`, and `3\u002F15` for `pg_search`.\n- On the largest workload, `msmarco`,\n  the measured QPS was `96.67` for `psql_bm25s ids`, `82.13` for\n  `psql_bm25s text[]`, `18.20` for `vchord_bm25`, `4.44` for\n  `pg_search`, and `1.61` for the Python reference implementation\n  `bm25s`.\n- Mutable-maintenance results are still important, but they are a\n  separate dimension from this cross-engine read-performance matrix.\n\n[More Performance and Benchmark detail](docs\u002Fperformance\u002FREADME.md)\n\n## Quick Start\n\nYou can get started in three common ways:\n\n### 1. Install from GitHub Releases\n\nDownload the package that matches your PostgreSQL major, OS, and\narchitecture from:\n\n- \u003Chttps:\u002F\u002Fgithub.com\u002FIntelligent-Internet\u002Fpsql_bm25s\u002Freleases>\n\nEach release `.zip` is built with the PostgreSQL extension files staged\nunder their final install paths. A simple install flow is:\n\n```bash\ncurl -L -o psql_bm25s.zip \\\n  https:\u002F\u002Fgithub.com\u002FIntelligent-Internet\u002Fpsql_bm25s\u002Freleases\u002Flatest\u002Fdownload\u002F\u003Cpackage>.zip\nunzip psql_bm25s.zip\nsudo rsync -a \u003Cpackage>\u002F \u002F\n```\n\nReplace `\u003Cpackage>` with the actual archive directory name from the\nrelease, for example `psql_bm25s-vX.Y.Z-linux-x86_64-pg18`.\n\nAfter copying the files, restart PostgreSQL if needed, then enable the\nextension in the target database:\n\n```sql\nCREATE EXTENSION psql_bm25s;\n```\n\nIf you want a non-`public` extension schema, choose it at creation time with\n`CREATE EXTENSION psql_bm25s WITH SCHEMA ext;`. The extension is not\nrelocatable after creation because SQL helper functions capture the extension\nschema for safe wrapper resolution.\n\n### 2. Run the Docker image from GitHub Packages\n\nPrebuilt PostgreSQL 18 images are published to:\n\n- \u003Chttps:\u002F\u002Fgithub.com\u002Forgs\u002FIntelligent-Internet\u002Fpackages?repo_name=psql_bm25s>\n\nPull either the floating PG18 tag or a versioned release tag:\n\n```bash\ndocker pull ghcr.io\u002Fintelligent-internet\u002Fpsql_bm25s:pg18\n# or:\ndocker pull ghcr.io\u002Fintelligent-internet\u002Fpsql_bm25s:pg18-vX.Y.Z\n```\n\nStart PostgreSQL with the extension preinstalled:\n\n```bash\ndocker run -d \\\n  --name psql-bm25s-pg18 \\\n  -e POSTGRES_PASSWORD=postgres \\\n  -p 5432:5432 \\\n  ghcr.io\u002Fintelligent-internet\u002Fpsql_bm25s:pg18\n```\n\nThe image init scripts create `psql_bm25s` in `postgres`, `template1`,\nand the optional `POSTGRES_DB` database on first boot.\n\n### 3. Build from source\n\nFor source builds, local validation, and release workflow notes, see\n[Contribution](docs\u002Fcontribution.md).\n\nAfter installing the extension, open `psql` and run the smallest useful\nexample: insert rows, index a `text` column with default options, and\nrun a top-k BM25 query.\n\n```sql\nCREATE EXTENSION psql_bm25s;\n\nDROP TABLE IF EXISTS docs;\n\nCREATE TABLE docs (\n    id integer primary key,\n    title text not null,\n    body text not null\n);\n\nINSERT INTO docs (id, title, body) VALUES\n    (1, 'red apple', 'fresh red apple fruit'),\n    (2, 'green apple', 'green apple slices'),\n    (3, 'orange citrus', 'orange citrus fruit'),\n    (4, 'cat guide', 'small cat animal care');\n\nCREATE INDEX docs_bm25_idx\n    ON docs USING psql_bm25s (body);\n\nSELECT d.id, d.title, h.score\nFROM psql_bm25s_query(\n    'docs_bm25_idx'::regclass,\n    'apple fruit',\n    5\n) AS h\nJOIN docs AS d ON d.ctid = h.ctid\nORDER BY h.score DESC, d.id;\n```\n\nWith no `WITH (...)` options, the index uses Lucene-style BM25 and IDF\ndefaults with the `realtime` consistency policy.\n\nThe extension also supports `varchar`, pretokenized `text[]` and\n`varchar[]`, and integer token id arrays for applications that own\ntokenization. Eventual and manual consistency are available for write-heavy\nor batch-maintained corpora.\nFor SQL operators, multi-field search, hybrid vector\u002FBM25 fusion, and\nmaintenance policy examples, start with:\n\n- [Supported Input Types](docs\u002Finput-types.md)\n- [Query Semantics](docs\u002Fquery-semantics.md)\n- [Multi-Field Search](docs\u002Fmulti-field-search.md)\n- [Field-Aware Indexes](docs\u002Ffield-aware-indexes.md)\n- [Hybrid Vector\u002FBM25 Search](docs\u002Fhybrid-search.md)\n- [Index Policy](docs\u002Findex-policy.md)\n\n## Canonical Retrieval APIs\n\nUse these as the default exact BM25 entry points:\n\n- `psql_bm25s_query_ids(...)`\n- `psql_bm25s_query_tokens(...)`\n\nThese are the clearest semantic surface and the main benchmark path.\n\n## Query Model\n\nThe extension exposes three layers of query surface:\n\n1. Canonical exact BM25 APIs\n   - `psql_bm25s_query_ids(...)`\n   - `psql_bm25s_query_tokens(...)`\n2. SQL convenience retrieval\n   - `psql_bm25s_query(...)`\n   - `psql_bm25s_query_prepared(...)`\n3. Planner\u002Foperator integration\n   - `@@`\n   - `\u003C=>`\n\nThe canonical exact BM25 contract is defined by rowset retrieval APIs\nthat return `psql_bm25s_result_hit` rows. They are the main benchmark path and\nthe clearest semantic surface.\n\nImportant semantic boundaries:\n\n- `@@` is a document-match predicate, not a ranking API\n- `\u003C=>` aligns with true BM25 ordering only when PostgreSQL executes a\n  real `psql_bm25s` index scan\n- raw-query retrieval is exact for the supported surface, but grouped\n  boolean and phrase queries may use bounded heap verification and are\n  intentionally slower than simple array-based retrieval\n\n## Hybrid BM25\u002FVector Fusion\n\nFor RAG-style ranking, `psql_bm25s` can combine BM25 candidates, vector\ncandidates, and other ranked SQL candidate sources into one weighted top-k\ninside PostgreSQL. The fusion path is C-backed and keeps vector support\noptional: `pgvector`, VectorChord, or another vector extension can own vector\nretrieval, while `psql_bm25s` owns normalization, weighting, de-duplication,\nand final ordering.\n\nUse [Hybrid Vector\u002FBM25 Search](docs\u002Fhybrid-search.md) for query examples and\n[Hybrid Fusion Engine](docs\u002Fhybrid-fusion-engine.md) for implementation\nboundaries, performance guidance, and validation coverage.\n\n## Mutable-Workload Index Design\n\n`psql_bm25s` extends the original static `bm25s` storage idea with\nPostgreSQL-native mutable-index maintenance. The goal is to keep BM25\nsemantics stable while making `INSERT`, `UPDATE`, `DELETE`, `VACUUM`,\nrestart, crash recovery, and physical replication operationally manageable.\n\nThe public maintenance switch is `consistency`:\n\n- `realtime`: default strong-freshness behavior for ordinary mutable tables\n- `eventual`: query-first behavior for large knowledge bases that can\n  tolerate short-term stale BM25 results while maintenance converges\n- `manual`: explicit-refresh behavior for static corpora, benchmarks, and\n  externally scheduled maintenance\n\nThe implementation records maintenance debt on the index, batches repeated\nwrites, supports bounded overlays where appropriate, follows PostgreSQL's\nnormal delete-cleanup lifecycle, and exposes automatic worker plus explicit\nmaintenance helpers for operators.\n\nFor details:\n\n- [Index Policy](docs\u002Findex-policy.md) explains the three consistency modes,\n  write\u002Fquery behavior, maintenance helpers, and scheduler guidance.\n- [Shared Generation Cache](docs\u002Fshared-generation-cache.md) explains the\n  shared immutable cache tiers for large connection-pool deployments,\n  including the zero-configuration DSM path and the optional\n  `shared_preload_libraries` arena.\n- [Index Parameters](docs\u002Findex-parameters.md) is the compact reference for\n  `CREATE INDEX ... WITH (...)` reloptions.\n- [Architecture and Design](docs\u002Farchitecture-and-design.md) summarizes the\n  storage and access-method design behind mutable indexes.\n- [Online Maintenance Future Plan](docs\u002Fonline-maintenance-future-plan.md)\n  records the larger future direction for generationed publish semantics.\n\n## Durability and Replication\n\nThe current implementation is a native PostgreSQL index relation.\n\nThat means:\n\n- index pages live in ordinary PostgreSQL index storage\n- writes are WAL-logged\n- the index survives restart and crash recovery\n- the index participates in physical replication like other PostgreSQL\n  indexes\n\nLogical replication follows normal PostgreSQL behavior:\n\n- table rows replicate\n- index relations do not replicate as logical data objects\n- indexes should be created or rebuilt on the subscriber side\n\n## Documentation\n\nCheck these links for detailed documentation:\n\n- [Documentation Root](docs\u002F)\n- [Contribution](docs\u002Fcontribution.md)\n- [Architecture and Design](docs\u002Farchitecture-and-design.md)\n- [Technical Report](docs\u002Ftechnical-report.md)\n- [API Reference](docs\u002Fapi-reference.md)\n- [Functions](docs\u002Ffunctions.md)\n- [Index Parameters](docs\u002Findex-parameters.md)\n- [Supported Input Types](docs\u002Finput-types.md)\n- [Hybrid Vector\u002FBM25 Search](docs\u002Fhybrid-search.md)\n- [Hybrid Vector\u002FBM25 Search Use-Case Design](docs\u002Fhybrid-vector-bm25-use-case-design.md)\n- [Hybrid Fusion Engine](docs\u002Fhybrid-fusion-engine.md)\n- [Query Semantics](docs\u002Fquery-semantics.md)\n- [Multi-Field Search](docs\u002Fmulti-field-search.md)\n- [Field-Aware Indexes](docs\u002Ffield-aware-indexes.md)\n- [Index Policy](docs\u002Findex-policy.md)\n- [Shared Generation Cache](docs\u002Fshared-generation-cache.md)\n- [Online Maintenance Future Plan](docs\u002Fonline-maintenance-future-plan.md)\n- [Performance and Benchmarks](docs\u002Fperformance\u002FREADME.md)\n- [Testing and Validation](docs\u002Ftesting-and-validation.md)\n\n## Future Work\n\nThe remaining roadmap is broader project work rather than another round of\nlocal collector tweaks:\n\nSee also:\n\n- [Online Maintenance Future Plan](docs\u002Fonline-maintenance-future-plan.md)\n\n1. deeper storage and representation work for lower read-path overhead\n2. stronger heap and `text[]` fetch redesign, especially for phrase- and\n   verification-heavy paths\n3. more architectural query-state reuse where it can be made exact and\n   stable\n4. stronger differential fuzzing and randomized correctness checks\n5. prepared and index-bound query values\n6. broader filtered ranked scans and clearer top-k diagnostics\n7. better native APIs for pretokenized corpora\n8. richer tokenizer and normalization profiles\n9. longer-run maintenance policy tuning and workload replay\n10. richer presentation helpers beyond token-stream output\n","`psql_bm25s` 是一个为 PostgreSQL 设计的 BM25 家族词法检索扩展。它提供了基于语料库统计的排名和精确的 top-k 检索功能，支持多种文本类型（如 `text`, `varchar`, `text[]` 等）的索引创建，并且能够处理多列融合索引。此外，该扩展还具备字段感知的多列索引、查询时字段权重调整等功能，同时提供了文本预处理（包括分词、归一化等）和高亮片段辅助工具。特别适合需要频繁进行数据插入、更新与删除操作的应用场景，如内容管理系统或搜索引擎后端。通过自动维护机制确保了在动态工作负载下的高效性和一致性，同时也兼容 PostgreSQL 的物理复制特性，保证了数据的安全性与可靠性。","2026-06-11 03:54:39","CREATED_QUERY"]