[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76353":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":44,"readmeContent":45,"aiSummary":46,"trendingCount":15,"starSnapshotCount":15,"syncStatus":18,"lastSyncTime":47,"discoverSource":48},76353,"semble_rs","johunsang\u002Fsemble_rs","johunsang","Fast, AI-agent-native code search in Rust — hybrid BM25 + semantic, Tree-sitter AST chunking, dependency & impact analysis. Drop-in replacement for grep\u002Fcat\u002Fread\u002Fls in Claude Code, Codex, Cursor, Aider, OpenHands.","https:\u002F\u002Fgithub.com\u002Fjohunsang\u002Fsemble_rs",null,"Rust",123,16,1,0,3,35,2,3.69,false,"main",true,[24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43],"ai-agent","bm25","claude-code","cli","code-search","codex","coding-assistant","cursor","dependency-graph","developer-tools","embeddings","grep-replacement","hybrid-search","impact-analysis","llm","rag","rust","semantic-search","token-optimization","tree-sitter","2026-06-12 02:03:41","\u003C!-- Keywords: code search, semantic code search, AI agent, LLM, BM25, embeddings, tree-sitter, AST, dependency graph, impact analysis, Rust, CLI, Claude Code, Codex, Cursor, grep replacement, token reduction, potion-code, model2vec, hybrid search, RRF, build output digest, CI log compression, korean code search, 한글 코드 검색 -->\n\n\u003Ch2 align=\"center\"> semble_rs\u003Cbr\u002F> Fast and Accurate Code Search for Agents — in Rust\u003Cbr\u002F> \u003Csub>Replaces grep \u002F cat \u002F read \u002F ls and compresses build & CI output. Up to \u003Cb>-99%\u003C\u002Fb> tokens.\u003C\u002Fsub> \u003C\u002Fh2>\n\n\u003Cdiv align=\"center\">\n\n\u003Cp> \u003Ca href=\"https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg\" alt=\"License: MIT\">\u003C\u002Fa> \u003Ca href=\"https:\u002F\u002Fwww.rust-lang.org\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Frust-1.75%2B-orange.svg\" alt=\"Rust\">\u003C\u002Fa> \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatform-macOS%20%7C%20Linux%20%7C%20Windows-blue.svg\" alt=\"Platform\"> \u003Ca href=\"#benchmarks\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fagent%20tokens-up%20to%20--99%25-brightgreen.svg\" alt=\"Token savings\">\u003C\u002Fa> \u003Ca href=\".\u002FREADME.ko.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%ED%95%9C%EA%B5%AD%EC%96%B4-README.ko.md-blue.svg\" alt=\"한국어\">\u003C\u002Fa> \u003C\u002Fp>\n\n\u003Cp> \u003Ca href=\"#quickstart\">Quickstart\u003C\u002Fa> • \u003Ca href=\"#search\">Search\u003C\u002Fa> • \u003Ca href=\"#tree\">Tree\u003C\u002Fa> • \u003Ca href=\"#digest\">Digest\u003C\u002Fa> • \u003Ca href=\"#dependency-graph\">Deps \u002F Impact\u003C\u002Fa> • \u003Ca href=\"#how-it-works\">How it works\u003C\u002Fa> • \u003Ca href=\"#benchmarks\">Benchmarks\u003C\u002Fa> \u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n`semble_rs` is a Rust port and superset of [MinishLab\u002Fsemble](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble) built for AI coding agents. It returns the exact code chunks an agent needs, prints a token-cheap codebase tree instead of `ls -R`, and compresses 3 MB CI logs into 35 KB. One single binary, no daemon, no API keys, no GPU. Hybrid BM25 + [Model2Vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec) static embeddings with code-aware reranking, plus a dependency graph, AST chunking, and a `digest` pipeline for build \u002F test \u002F CI output.\n\n## Quickstart\n\n```bash\n# Install Rust if needed, then:\ngit clone https:\u002F\u002Fgithub.com\u002Fjohunsang\u002Fsemble_rs.git && cd semble_rs\ncargo install --path .\n```\n\nThe binary lands at `~\u002F.cargo\u002Fbin\u002Fsemble_rs`. On first run, the default embedding model `minishlab\u002Fpotion-code-16M` (\\~60 MB) is downloaded from HuggingFace.\n\n```bash\n# Map the codebase (replaces ls -R)\nsemble_rs tree .\u002Fmy-project --symbols\n\n# Find code by what it does (replaces grep + cat)\nsemble_rs search \"how is auth handled\" .\u002Fmy-project --outline\n\n# Compress build \u002F CI output before reading it\ncargo build 2>&1 | semble_rs digest\ngh run view \u003Cid> --log-failed | semble_rs digest\n```\n\nFor agent integration (Claude Code, Codex, Cursor), see [Agent integration](#agent-integration).\n\n## Main Features\n\n- **Fast**: indexes the local repo (22 files) in \\~150 ms, \\~10 s on 1,600 files. Static embedder — no transformer forward pass at query time.\n- **Token-efficient**: `tree` collapses `ls -R` by **4×–747×**; `--outline` is **-47%** vs full output; `digest` reaches **-98.9%** on real GitHub Actions logs.\n- **Hybrid retrieval**: BM25 + Model2Vec embeddings fused with RRF, then reranked with definition \u002F identifier-stem \u002F file-coherence boosts and noise penalties.\n- **Dependency graph**: `deps` \u002F `impact` show what a file imports, defines, and what changes if you touch it. Optional Graphviz `--dot` output.\n- **Build \u002F CI compression**: `digest` auto-detects cargo, pnpm\u002Fnpm\u002Fyarn\u002Fbun, tsc, pytest, go test, gradle, ruff, mypy, clang\u002Fgcc\u002Fcmake\u002Fmake\u002Fswiftc, GitHub Actions.\n- **Single binary**: no Python, no daemon, no API keys. Runs on CPU.\n\n## Search\n\n```bash\nsemble_rs search \"auth flow\" .\u002Fmy-project --outline    # pass 1: structural overview\nsemble_rs search \"loginWithEmail\" .\u002Fmy-project --compact   # pass 2: matching lines\nsemble_rs search \"save model\" https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec   # git URL\n```\n\n`path` defaults to the current directory; git URLs are accepted (cloned shallow).\n\n### Output modes\n\n| Mode | Output | Token cost vs `--compact` | When to use |\n| --- | --- | --- | --- |\n| `--outline` | One signature line per chunk | **-47%** | First-pass structural scan |\n| `--group` | Directory grouping + match lines capped at 3 (`+N` overflow) | \\-47% | Many match lines per chunk |\n| `--compact` | Score + path + every matching line | baseline | Precision scan |\n| `--json --strip` | Chunk bodies (comments stripped) | +800% | Tooling \u002F pipeline integration |\n| `--json` | Chunk bodies (raw) | +900% | Tooling \u002F pipeline integration |\n\n**Recommended:** `--outline` to overview → `--compact` to narrow → `--json --strip` only if the chunk body itself is needed.\n\n### `find-related`\n\nGiven a `file:line` from a previous search result, returns chunks semantically similar to that location.\n\n```bash\nsemble_rs find-related src\u002Fauth.rs 42 .\u002Fmy-project\n```\n\n### `plan`\n\nWhen the agent doesn't know where to start, `plan` runs a small search and prints a recommended sequence of `--outline` \u002F `--group` \u002F `--compact` \u002F `deps` \u002F `impact` commands.\n\n```bash\nsemble_rs plan \"fix auth flow bug\" .\u002Fmy-project -k 5\n```\n\n`plan` is a guardrail, not an oracle: low-confidence candidates are leads, not facts. Skip it when the symbol or feature name is already known.\n\n### `--model`\n\nAll search-side commands accept `--model \u003Chf-repo-or-local-path>` to override the default embedder. Also honours the `SEMBLE_MODEL_PATH` environment variable.\n\n## Tree\n\n`semble_rs tree` prints the codebase file tree using the same gitignore-aware index as `search`. It exists because `ls -R` on a real project explodes into tens or hundreds of thousands of tokens (`.git\u002F`, `target\u002F`, `node_modules\u002F` all included). Measured on real repos:\n\n| Project | `semble_rs tree` | `ls -R` | Reduction |\n| --- | --- | --- | --- |\n| this repo (Rust + `target\u002F`) | **533 B** | 398,101 B | **747×** |\n| 6,693-file Python backend | **3,950 B** | 254,066 B | **64×** |\n| 325-file ML training repo | 838 B | 7,522 B | 9× |\n\n```bash\nsemble_rs tree                              # current directory\nsemble_rs tree -d                           # directories only\nsemble_rs tree --max-depth 2                # cap depth\nsemble_rs tree --symbols                    # append top-level symbols per file\nsemble_rs tree --lang rust,python           # filter by language\n```\n\n## Digest\n\n`semble_rs digest` collapses build \u002F test \u002F install \u002F CI output. Errors, file:line:col, tracebacks, panic stacks, and failed-step bodies are always preserved — only progress lines collapse to counts.\n\n```bash\ncargo build 2>&1            | semble_rs digest\npnpm install 2>&1           | semble_rs digest\npytest 2>&1                 | semble_rs digest\ngh run view \u003Cid> --log-failed | semble_rs digest\n```\n\nMeasured on 15 real-world fixtures:\n\n| Fixture | Raw → digest | Savings |\n| --- | --- | --- |\n| `cargo build` (clean, 218 crates) | 7,611 B → 59 B | **-99.2%** |\n| `cargo test` (45 passing) | 3,368 B → 369 B | \\-89.0% |\n| `pnpm install` | 1,323 B → 349 B | \\-73.6% |\n| `tsc` (13 errors, 5 codes) | 1,085 B → 648 B | \\-40.3% |\n| `pytest` (4 failures) | 2,762 B → 2,330 B | \\-15.6% |\n| **GitHub Actions log (rust-lang\u002Frust failed CI, real)** | **3.3 MB → 35 KB** | **-98.9%** ⭐ |\n| `go test` (with panic + stack) | 1,034 B → 475 B | \\-54.1% |\n| `gradle test` (2 failures) | 1,232 B → 522 B | \\-57.6% |\n| `ruff` \u002F `mypy` \u002F `clang` \u002F `cmake` \u002F `swift` | varies | \\-3% to -30% |\n| **TOTAL (15 fixtures)** | **3.33 MB → 43 KB** | **-98.7%** |\n\nAuto-detection covers cargo, pnpm\u002Fnpm\u002Fyarn\u002Fbun, tsc, pytest, go test, gradle, ruff, mypy, clang\u002Fgcc\u002Fcmake\u002Fmake\u002Fswiftc, GitHub Actions. Force a handler with `--format \u003Cname>`; inspect with `--show-format`.\n\n## Dependency graph\n\n```bash\nsemble_rs deps   src\u002Fauth.rs .\u002Fmy-project                  # what this file imports \u002F defines (flat)\nsemble_rs deps   src\u002Fauth.rs .\u002Fmy-project --tree           # transitive imports as ASCII tree\nsemble_rs deps   src\u002Fauth.rs .\u002Fmy-project --tree --max-depth 3\nsemble_rs deps   src\u002Fauth.rs .\u002Fmy-project --dot | dot -Tpng > deps.png\nsemble_rs impact src\u002Fauth.rs .\u002Fmy-project                  # who depends on this file (flat list)\nsemble_rs impact src\u002Fauth.rs .\u002Fmy-project --tree           # reverse-dependency tree\nsemble_rs impact src\u002Fauth.rs .\u002Fmy-project --dot | dot -Tpng > impact.png\n```\n\n`--tree` (v0.9.1+) renders forward (`deps`) or reverse (`impact`) dependencies as an ASCII tree with cycle detection (repeated nodes marked `(cycle)`) and `--max-depth N` truncation (`…`). No external tool required, agent-readable.\n\n`impact` is intended to be run before edits to a shared module to avoid surprises.\n\n### `find-pattern`\n\nThin wrapper around `ast-grep` for structural queries that semantic search can't express:\n\n```bash\nsemble_rs find-pattern 'fn $name($$$)' . --lang rust --compact\n```\n\nRequires `ast-grep` installed (`brew install ast-grep` or `cargo install ast-grep`).\n\n## Encode\n\n`semble_rs encode` exposes the embedding model as a CLI for scripting and debugging:\n\n```bash\nsemble_rs encode \"search result scoring\"            # one vector → JSON array\necho -e \"auth\\nlogin\\ntoken\" | semble_rs encode     # stdin, one sentence per line\nsemble_rs encode \"x\" --model minishlab\u002Fpotion-multilingual-128M\n```\n\n## Agent integration\n\nAppend a snippet like the following to your project-root `CLAUDE.md` or `AGENTS.md`. It works for Claude Code, Codex, Cursor (`.cursorrules`), Aider, and OpenHands.\n\n```markdown\n## Code search and exploration\n\nUse `semble_rs` instead of `ls -R`, `grep`, `cat`:\n\n​```bash\nsemble_rs tree . --symbols                         # codebase map (cheap)\nsemble_rs search \"\u003Cfeature or symbol>\" . --outline # pass 1\nsemble_rs search \"\u003Cfeature or symbol>\" . --compact # pass 2\nsemble_rs deps   \u003Cfile> .                          # what file imports \u002F defines\nsemble_rs impact \u003Cfile> .                          # files affected by changes\n​```\n\nCompress noisy command output before reading it:\n\n​```bash\ncargo build 2>&1 | semble_rs digest\npnpm install 2>&1 | semble_rs digest\ngh run view \u003Cid> --log-failed | semble_rs digest\n​```\n```\n\n`semble_rs savings` shows estimated tokens saved across past searches.\n\n## How it works\n\n`semble_rs` chunks every file with `tree-sitter` at function \u002F class \u002F module boundaries (line-based fallback for unsupported languages), then scores every query with two complementary retrievers: static [Model2Vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec) embeddings (default `minishlab\u002Fpotion-code-16M`) for semantic similarity, and BM25 for lexical matches on identifiers and API names. Score lists are fused with Reciprocal Rank Fusion.\n\nAfter fusion, results are reranked with code-aware signals:\n\n\u003Cdetails> \u003Csummary>\u003Cb>Ranking signals\u003C\u002Fb>\u003C\u002Fsummary>\n\n- **Adaptive weighting.** Symbol-like queries (`Foo::bar`, `_private`, `getUserById`) get more lexical weight; natural-language queries stay balanced.\n- **Definition boosts.** Chunks that define the queried symbol (a `class`, `def`, `func`, etc.) outrank chunks that merely reference it.\n- **Identifier stems.** Query tokens are stemmed and matched against identifier stems. Querying `parse config` boosts chunks containing `parseConfig`, `ConfigParser`, or `config_parser`.\n- **File coherence.** When multiple chunks of a file match, the file is boosted so the top result reflects file-level relevance.\n- **Sibling-chunk boost.** Chunks adjacent to a top hit get a small boost — definitions and their helpers usually cluster.\n- **Dependency boost.** Chunks in files imported by a top hit get boosted so call-chain context surfaces.\n- **Noise penalties.** Test files, `compat\u002F` \u002F `legacy\u002F` shims, example code, and `.d.ts` declaration stubs are down-ranked so canonical implementations surface first.\n\n\u003C\u002Fdetails>\n\nThe embedder is fully static (vocab embedding lookup → mean pool → SIF weighting → L2 normalize). All of this runs in milliseconds on CPU.\n\n## Benchmarks\n\n### Retrieval quality — 100-query benchmark (this repo)\n\n100 hand-labelled queries across 5 categories: exact symbol names, natural-language feature descriptions, scenarios, acronyms, and Korean queries. Default model `minishlab\u002Fpotion-code-16M`.\n\n| Metric | Score |\n| --- | --- |\n| Recall@1 | 70% |\n| Recall@5 | 90% |\n| Recall@10 | 95% |\n| MRR | 0.78 |\n| Median latency | 150 ms \u002F query (cold) |\n\n| Category | n | R@1 | R@5 | R@10 | MRR |\n| --- | --- | --- | --- | --- | --- |\n| exact_symbol | 30 | 93% | 100% | 100% | 0.96 |\n| nl_feature | 40 | 75% | 98% | 100% | 0.83 |\n| scenario | 10 | 70% | 100% | 100% | 0.77 |\n| acronym | 10 | 50% | 70% | 70% | 0.56 |\n| korean | 10 | 10% | 60% | 80% | 0.27 |\n\nQuery set: `docs\u002Feval_set_100.json` · per-miss analysis: `docs\u002Fbenchmark_100.md`.\n\n### Indexing and query latency by repo size\n\nThe index is rebuilt every run (no persistent cache).\n\n| Repo size (code files) | Indexing + first query |\n| --- | --- |\n| 22 (this repo) | **\\~0.15 s** |\n| 57–120 | \\~0.3–0.7 s |\n| 1,600 | \\~10 s |\n\n`digest` is independent of repo size: 3.3 MB CI log → 35 KB in **\\~20 ms**.\n\n### Token efficiency vs native shell tools\n\nMeasured on real projects:\n\n| Operation | `semble_rs` | Native | Reduction |\n| --- | --- | --- | --- |\n| **Codebase map** (this repo) | `tree` **533 B** | `ls -R` 398 KB | **747×** |\n| **Codebase map** (6,693-file Python backend) | `tree` **3,950 B** | `ls -R` 254 KB | **64×** |\n| **Codebase map** (325-file Python repo) | `tree` 838 B | `ls -R` 7,522 B | 9× |\n| **Code chunk lookup** (`--outline` vs `--compact`) | \\-47% | baseline | \\-47% |\n| **Build log** (`cargo build` clean) | `digest` 59 B | raw 7,611 B | **-99.2%** |\n| **CI failure log** (real GitHub Actions, rust-lang\u002Frust) | `digest` 35 KB | raw 3.3 MB | **-98.9%** ⭐ |\n| **15-fixture aggregate** | `digest` 43 KB | raw 3.33 MB | **-98.7%** |\n\n> Agents using `grep + cat + ls -R` spend most of their context window on irrelevant code and noise. `semble_rs` returns only what matters and compresses the rest.\n\n## Supported languages\n\n| Language | Search | AST chunking | Dependency graph |\n| --- | --- | --- | --- |\n| Rust | ✓ | ✓ | ✓ |\n| Python | ✓ | ✓ | ✓ |\n| JavaScript \u002F TypeScript | ✓ | ✓ | ✓ |\n| Go | ✓ | ✓ | ✓ |\n| Java | ✓ | ✓ | ✓ |\n| C \u002F C++ | ✓ | ✓ | ✓ |\n| Kotlin | ✓ | ✓ | ✓ |\n| Ruby | ✓ | ✓ | ✓ |\n| PHP | ✓ | ✓ | ✓ |\n| Swift | ✓ | ✓ | ✓ |\n| HTML \u002F CSS \u002F Vue \u002F Svelte | ✓ | line-based | partial |\n| Other | ✓ | line-based | — |\n\n## License\n\nMIT\n\n## Acknowledgements\n\n- [MinishLab\u002Fsemble](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fsemble) — original Python implementation by Stéphan Tulkens and Thomas van Dongen. `semble_rs` is a Rust port + superset of their work.\n- [Model2Vec](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec) and [model2vec-rs](https:\u002F\u002Fgithub.com\u002FMinishLab\u002Fmodel2vec-rs) — static distillation framework powering the embedder.\n- Embedding model: `minishlab\u002Fpotion-code-16M`.","`semble_rs` 是一个用 Rust 编写的快速且精准的代码搜索工具，专为 AI 代理设计。它结合了 BM25 和语义搜索技术，使用 Tree-sitter 进行 AST 分块，并提供依赖关系和影响分析功能。该工具可以替代传统的 grep、cat、read 和 ls 命令，尤其适用于需要高效检索代码库的场景，如 Claude Code、Codex 和 Cursor 等 AI 辅助编程环境。其核心特点包括高速索引（在 1600 个文件上约需 10 秒）和高效的令牌压缩能力（树状结构输出比 ls -R 减少 4 到 747 倍），并且无需 GPU 或 API 密钥即可运行。此外，它还支持构建和 CI 输出的日志压缩，极大减少了数据处理的时间和资源消耗。","2026-06-11 03:54:58","CREATED_QUERY"]