[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11084":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":37,"discoverSource":38},11084,"leanctx","jia-gao\u002Fleanctx","jia-gao","Drop-in prompt compression for production LLM apps. Cut your token bill 40-60% without changing your code. Python SDK, LLMLingua-2, MIT.",null,"Python",309,2,30,0,165,1.43,"MIT License",false,"main",true,[22,23,24,25,26,27,28,29,30,31,32,33],"anthropic","cost-optimization","gemini","langchain","langgraph","llm","llm-inference","llmlingua","openai","prompt-compression","python","rag","2026-06-12 02:02:29","# leanctx\n\n**Drop-in prompt compression for production LLM applications.**\nCut your input-token bill by 40–60% — without changing your code.\n\n```python\n# before\nfrom openai import OpenAI\n\n# after\nfrom leanctx import OpenAI  # same interface, compressed requests\n```\n\nOn the public **[LongBench v2](https:\u002F\u002Flongbench2.github.io\u002F)** leaderboard's short subset, leanctx-Lingua **doubles accuracy** versus naive head+tail truncation (40 % vs 20 %) while removing **57 % of tokens**. Open-source models, runs locally, MIT-licensed. Your prompts and user data never leave your infrastructure by default.\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fleanctx)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fleanctx\u002F)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fleanctx)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fleanctx\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Fleanctx)](LICENSE)\n\n---\n\n## Quickstart (60 seconds)\n\n```bash\npip install 'leanctx[openai,lingua]'    # or [anthropic], [gemini]\n```\n\n```python\nfrom leanctx import OpenAI\n\nclient = OpenAI(\n    leanctx_config={\n        \"mode\": \"on\",\n        \"trigger\": {\"threshold_tokens\": 2000},\n        \"routing\": {\"prose\": \"lingua\"},  # route prose through LLMLingua-2\n    },\n)\n\nresponse = client.chat.completions.create(\n    model=\"gpt-4o-mini\",\n    max_tokens=512,\n    messages=[{\"role\": \"user\", \"content\": LONG_DOCUMENT}],\n)\n\nprint(response.usage.leanctx_tokens_saved)  # e.g. 1841\nprint(response.usage.leanctx_ratio)         # e.g. 0.49\n```\n\n> **First Lingua call loads ~1.2 GB of model weights** to `~\u002F.cache\u002Fhuggingface\u002F`. Subsequent calls reuse the cache. Add `pip install 'leanctx[lingua]'` to opt in; without it, leanctx falls back to passthrough.\n\nVerify the install with no API key needed:\n\n```bash\nleanctx bench list                                   # 7 registered scenarios\nleanctx bench run agent-structural --workload agent  # 5 invariants enforced, exit 0 = pass\n```\n\n## Why this exists\n\nYou're building a production LLM app and your token bill is a line item:\n\n- RAG apps with large retrieved documents\n- Long-running conversational agents (LangChain \u002F LangGraph \u002F CrewAI)\n- Document-processing pipelines\n- Coding agents — Cursor-like \u002F Claude-Code-like, with growing tool-call histories\n\nExisting options have gaps:\n\n- **Provider prompt caching** (Anthropic \u002F OpenAI \u002F Gemini) wins on stable prefixes — system prompts, tool definitions, retrieved-document pools. **It doesn't help with dynamic per-query content** (chat history, freshly retrieved docs, tool outputs). Compose with leanctx, don't choose between them.\n- **Naive truncation** drops the middle of the document, exactly where many answers live. The LongBench v2 numbers above show this concretely.\n- **Hosted compression APIs** (Compresr, Token Company) require sending your context to their servers. Closed-source models. leanctx is MIT-licensed, runs the model locally, and never makes outbound calls except to your existing provider.\n\n## Real numbers\n\n### Public benchmark — LongBench v2 (Tsinghua KEG, 503 questions, 8K–2M words)\n\n15-item short-subset ablation, Claude Haiku 4.5 eval, 20K head+tail truncation cap (rate-limit-friendly). Same model, same questions, same truncation across all three conditions — so the comparison is apples-to-apples. Full 503-item sweep is on the v0.3.x roadmap.\n\n| Method | Accuracy | Tokens kept | Reproduce |\n|---|---:|---:|---|\n| Baseline (head+tail truncation only) | 20.0 % (3\u002F15) | 100 % of 20K cap | `leanctx bench run longbench-v2` |\n| **leanctx Lingua** (ratio=0.5) | **40.0 % (6\u002F15)** | **43 %** | `LEANCTX_LBV2_COMPRESSOR=lingua leanctx bench run longbench-v2` |\n| leanctx SelfLLM (Haiku, ratio=0.3) | 26.7 % (4\u002F15) | 1.4 % | `LEANCTX_LBV2_COMPRESSOR=selfllm leanctx bench run longbench-v2` |\n\nLingua **doubles** the baseline accuracy while removing 57 % of tokens. Naive head+tail truncation drops the middle; Lingua's extractive token classifier keeps answer-bearing tokens distributed across the full document. Per-question records: [`docs\u002Fblog\u002Fdata\u002Flbv2-2026-05-03\u002F`](docs\u002Fblog\u002Fdata\u002Flbv2-2026-05-03\u002F).\n\n### Internal benchmark — coding-agent transcript\n\nA realistic 9-message agent transcript — user question, file reads, grep, log dumps, failed edit, error trace — totaling ~2.1K tokens. Run through `leanctx.Anthropic` with content-aware routing (code → verbatim, errors → verbatim, prose → Lingua):\n\n| Metric | Before | After | Reduction |\n|---|:-:|:-:|:-:|\n| Tokens | 2148 | 1384 | **35.6 %** |\n| Tokens saved per request | | | **768** |\n\n**What got preserved verbatim** (asserted programmatically by the `agent-structural` bench scenario):\n- A 2 KB Python source file inside a `tool_result` block — byte-identical\n- A Python traceback in an `is_error` tool result — byte-identical\n- Every `tool_use_id` and the `name` \u002F `input` of every `tool_use` block — tool linkage and tool-call payloads untouched\n- `edit_file`'s `new_str` argument — the actual code edit isn't rewritten\n\n**What actually compressed:**\n- A 3.4 KB log dump shrank to 1.9 KB (45 % reduction) — the legitimate compression target\n- Grep results and prose reasoning blocks shrank by 30–50 %\n\nReproduce: `leanctx bench run agent-structural --workload agent` — runs the real LLMLingua-2 model, ~30 s on Apple Silicon, no API key required. Status flips to `failure` with named invariants if any regress; CI-gateable.\n\n### SelfLLM cross-provider comparison\n\nSame 1.7 KB SRE-incident document through `SelfLLM` against each provider's cheapest tier:\n\n| Provider  | Model              | Compression | Latency   | Cost per call |\n|-----------|--------------------|:-----------:|:---------:|:-------------:|\n| Anthropic | `claude-haiku-4-5` | **41.6 %**  | 3.05 s    | ~$0.0016      |\n| OpenAI    | `gpt-4o-mini`      | **49.1 %**  | 6.42 s    | ~$0.0003      |\n| Gemini    | `gemini-2.5-flash` | **48.7 %**  | **2.25 s** ⚡ | ~$0.0001      |\n\nAll three preserved every timestamp, metric value, and action item with no hallucination. Combined with `Lingua` (LLMLingua-2 local) hitting **44.7 %** char reduction on the same document at zero marginal cost, leanctx covers the full speed\u002Fcost\u002Fquality trade-off space.\n\nFull methodology, per-provider output samples, cost analysis, bugs found in flight: [`docs\u002Fbenchmarks\u002F`](docs\u002Fbenchmarks\u002F).\n\n## How it works\n\nleanctx wraps your existing SDK call and applies a configurable compression pipeline before the request hits the wire.\n\n```\nyour code\n   ↓\nleanctx.Anthropic \u002F OpenAI \u002F Gemini    ← drop-in wrapper\n   ↓\nMiddleware (mode=on\u002Foff, threshold)\n   ↓\nPer-message pipeline:\n   classify (code | error | prose | …)\n        ↓\n   route to compressor:\n        Verbatim  — never touch (code, errors, tool calls)\n        Lingua    — LLMLingua-2 local, free marginal cost\n        SelfLLM   — your configured LLM (Anthropic\u002FOpenAI\u002FGemini), highest quality\n   ↓\nreal Anthropic \u002F OpenAI \u002F Gemini SDK → API\n```\n\nTwo layers of config:\n\n- **`mode`** — `\"on\"` to compress, `\"off\"` to passthrough. Off is safe to leave deployed.\n- **`routing`** — maps content types (code \u002F error \u002F prose \u002F unknown \u002F long_important) to compressors (verbatim \u002F lingua \u002F selfllm).\n\nA fully-loaded production config:\n\n```python\nfrom leanctx import OpenAI\n\nclient = OpenAI(leanctx_config={\n    \"mode\": \"on\",\n    \"trigger\": {\"threshold_tokens\": 2000},  # don't bother below this\n    \"routing\": {\n        \"code\":           \"verbatim\",   # never touch code\n        \"error\":          \"verbatim\",   # never touch stack traces\n        \"prose\":          \"lingua\",     # local LLMLingua-2\n        \"long_important\": \"selfllm\",    # cheap LLM summarization\n    },\n    \"lingua\":  {\"ratio\": 0.5, \"device\": \"cpu\"},\n    \"selfllm\": {\"model\": \"gpt-4o-mini\", \"api_key\": \"sk-...\", \"ratio\": 0.3},\n    \"observability\": {\"otel\": True},     # opt-in OpenTelemetry\n})\n```\n\n## Compose with provider caching\n\nleanctx is **complementary** to Anthropic \u002F OpenAI \u002F Gemini prompt caching, not competitive:\n\n- **Provider caching wins** on stable prefixes: system prompts, tool definitions, retrieved-document pools that don't change across requests. Up to 90 % discount on cached reads.\n- **leanctx wins** on dynamic per-query content: chat history, freshly retrieved docs, tool outputs, log dumps that vary every call.\n- **They compose.** Mark your stable prefix with `cache_control` (provider-specific) and let leanctx compress the variable suffix. Both savings stack.\n\nThe OTel telemetry leanctx emits includes a `provider` label that you can correlate with provider-side cache-hit metrics in the same dashboard.\n\n## Observability (v0.3)\n\nleanctx emits OpenTelemetry spans + metrics for every compression call, opt-in via `leanctx_config[\"observability\"][\"otel\"]`. The library is **API-only**: it never owns the OTel SDK or registers providers. The application configures OTel; leanctx emits.\n\n```python\nclient = leanctx.Anthropic(\n    leanctx_config={\n        \"mode\": \"on\",\n        \"observability\": {\"otel\": True},\n    },\n)\n```\n\nEach wrapper-routed call produces one root `leanctx.compress` span (provider, method, input_tokens, output_tokens, cost_usd, duration_ms) plus per-compressor child spans. Five metrics — 4 counters + 1 histogram — labeled by `provider`\u002F`method`\u002F`status`. Closed `leanctx.method` taxonomy: `passthrough` | `below-threshold` | `empty` | `opaque-bailout` | `verbatim` | `lingua` | `selfllm` | `hybrid`.\n\nSee [`docs\u002Fobservability.md`](docs\u002Fobservability.md) for the full attribute reference, stream-lifetime contract, app-side OTel SDK setup, and cardinality guidance.\n\n## Reproducible benchmarks (v0.3)\n\nThe `leanctx bench` CLI ships seven named scenarios with versioned JSON output (`schema_version: \"1\"`):\n\n```bash\nleanctx bench list                                  # show registered scenarios\nleanctx bench run lingua-local --workload rag       # offline, no API key\nleanctx bench run agent-structural --workload agent # 5 invariants enforced\nleanctx bench run anthropic-e2e --workload chat     # full stack, respx-mocked\nleanctx bench run selfllm-anthropic --workload rag  # live API, set ANTHROPIC_API_KEY\nleanctx bench run longbench-v2 --workload rag       # public LongBench v2 ablation\n```\n\nVersioned schema, multi-run isolation (`--runs N` constructs fresh client\u002Fmiddleware each run), clean diagnostics for missing extras \u002F API keys (exit 3, no traceback). Built so downstream tooling can consume the JSON without breaking on schema changes.\n\n## Install\n\n```bash\npip install leanctx                              # core (passthrough only — useful for testing the wrapper)\npip install 'leanctx[anthropic,openai,gemini]'   # provider SDKs\npip install 'leanctx[lingua]'                    # + LLMLingua-2 local compression (~1.2 GB on first call)\npip install 'leanctx[otel]'                      # + OpenTelemetry API\u002FSDK\npip install 'leanctx[bench]'                     # + respx for offline scenarios\npip install 'leanctx[longbench]'                 # + HuggingFace datasets for LongBench v2\npip install 'leanctx[all]'                       # everything\n```\n\nDocker:\n\n```bash\ndocker build -t leanctx:slim .                             # 341 MB, all provider SDKs\ndocker build -t leanctx:lingua --build-arg LINGUA=true .   # + LLMLingua-2, ~3 GB\n```\n\n## Supported providers\n\n| Provider | Drop-in client | Streaming | Compression | SelfLLM target |\n|---|:-:|:-:|:-:|:-:|\n| Anthropic | `leanctx.Anthropic` \u002F `AsyncAnthropic` | ✅ | ✅ | ✅ |\n| OpenAI    | `leanctx.OpenAI` \u002F `AsyncOpenAI` | ✅ | ✅ | ✅ |\n| Gemini    | `leanctx.Gemini` (`.models` + `.aio.models`) | ✅ | ✅ \\* | ✅ |\n\n\\* **Gemini text-only requests compress fully.** Requests that include `function_call`, `function_response`, or multimodal (`inline_data`) parts automatically bail out to passthrough — leanctx never rewrites tool-call payloads (would change tool semantics) and doesn't touch images. Multimodal + function-call compression is on the v0.3.x roadmap. Spans for these calls carry `leanctx.method = opaque-bailout` so you can monitor the share.\n\n12 wrapper request paths instrumented (sync + async × stream + non-stream × 3 providers). Stream-path span lifetime closes at the first of: iterator exhaustion, explicit `.close()`, or `__del__` GC backstop — `duration_ms` covers the full stream lifetime.\n\n## Status\n\n[`v0.3.1`](https:\u002F\u002Fgithub.com\u002Fjia-gao\u002Fleanctx\u002Freleases\u002Ftag\u002Fv0.3.1) is on PyPI. Built across a 5-round Codex-reviewed RLCR loop; 257 tests passing, ruff + mypy --strict clean across 40 source files.\n\n## Roadmap\n\n- [x] **v0.1** — Python SDK, drop-in wrappers, LLMLingua-2 + SelfLLM (Anthropic), classifier, router, dedup + purge-errors strategies, LangChain helpers, Docker\n- [x] **v0.2** — SelfLLM on OpenAI + Gemini, block-aware compression (tool_use \u002F tool_result preserved), Gemini contents normalization, LCEL `compress_runnable`\n- [x] **v0.3** — OpenTelemetry observability across 12 wrapper paths, `leanctx bench` CLI (6 scenarios + versioned schema), `agent-structural` invariant enforcement, [public release `v0.3.1`](https:\u002F\u002Fpypi.org\u002Fproject\u002Fleanctx\u002F) — 2026-04-26\n- [ ] **v0.3.x** — full 503-item LongBench v2 sweep, ghcr.io Docker publish, OpenAI Responses-API intercept, multimodal + function-call compression for Gemini, LlamaIndex helpers, TypeScript SDK compression port\n- [ ] **v0.4** — per-tenant attribution (with cardinality cap), Helm chart \u002F K8s sidecar, stateful session dedup with explicit session IDs\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n","leanctx 是一个用于生产环境中的大语言模型应用的提示压缩工具，能够帮助用户在不修改代码的情况下减少40-60%的输入令牌成本。其核心功能包括通过简单的Python SDK集成实现请求压缩，并支持LLMLingua-2技术以提高准确性同时大幅度减少令牌使用量。该项目特别适用于需要处理大量文本数据的应用场景，如基于检索的生成应用程序、长时间运行的对话代理、文档处理流程以及具有不断增长工具调用历史记录的编码助手等。leanctx提供了一个开源且本地运行的解决方案，确保用户的提示和数据默认情况下不会离开其基础设施，从而增强了安全性与隐私保护。","2026-06-11 03:31:10","CREATED_QUERY"]