[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79915":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":14,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":39,"readmeContent":40,"aiSummary":41,"trendingCount":15,"starSnapshotCount":15,"syncStatus":42,"lastSyncTime":43,"discoverSource":44},79915,"anansi","mdowis\u002Fanansi","mdowis","A self-healing web scraper built for hostile sites: selectors repair themselves, browser rendering kicks in when needed, and Chrome TLS fingerprinting evades bot detection. Ships with an MCP server so any LLM can drive a full crawl through conversation.","",null,"Python",93,17,1,0,3,5,3.77,"Apache License 2.0",false,"main",true,[24,25,26,27,28,29,30,31,32,33,34,35,36,37,38],"adaptive-scraping","ai-agent","anti-bot","crawler","data-extraction","llm-tools","mcp","mcp-server","pydantic","python","self-healing","stealth-browser","tls-fingerprint","web-scraper","web-scraping","2026-06-12 02:03:55","\u003Cimg src=\"https:\u002F\u002Frepository-images.githubusercontent.com\u002F1238896536\u002Fd711cc76-8358-4a4a-9160-341131498877\">\n\n> *The spider that learns.*\n\nEvery web scraper starts working. The question is how long before it breaks.\n\n**Anansi is built on a different assumption: the web is adversarial and unstable, and your scraper should handle that without your involvement.**\n\nWhen a site changes its layout, Anansi finds the data anyway and remembers the fix: CSS selectors are scored by confidence and healed automatically. When a page needs a browser to render, it switches to one silently. When bot detection gets in the way, it mimics Chrome's TLS fingerprint at the network level, the layer most scrapers never think about. When you re-crawl, unchanged pages are skipped before a request is even made. When extraction goes wrong, Pydantic validation catches it immediately instead of letting garbage accumulate in your database.\n\nThe result: a crawler that handles hostile sites, survives redesigns, and gets better the longer it runs. Ships with an **MCP server** so any LLM can drive a full crawl through a conversation.\n\n---\n\n## Capabilities\n\n| | |\n|---|---|\n| **Self-healing parser** | CSS selectors are stored with confidence scores. When one breaks, four healing strategies run — fuzzy class matching, text-pattern regex, structural context, XPath fallback — and the winner is persisted for next time. |\n| **Structured data extraction** | JSON-LD, Open Graph, and Microdata are extracted from every page automatically. Fields matched in schema.org markup skip CSS evaluation entirely — they're more stable and require no selector maintenance. |\n| **TLS \u002F HTTP-2 fingerprint mimicry** | Enterprise bot-detection (Cloudflare, Akamai, DataDome) fingerprints your TLS ClientHello *and* HTTP\u002F2 SETTINGS\u002Fframe ordering before inspecting a single header. With `impersonate=\"chrome124\"`, Anansi uses curl-cffi to reproduce both, plus per-host session warm-up and a graduated Akamai-block escalation ladder. Install the `tls` extra (see [Install](#install)); operator-gated, authorized use only. |\n| **Auto browser upgrade** | Every HTTP response is checked for SPA markers, noscript redirects, and suspiciously low text density. JS shells trigger a silent retry with a stealth Playwright browser. The decision is cached per domain for the crawl session. |\n| **Anti-bot & Cloudflare bypass** | The browser fetcher removes `webdriver` fingerprints, spoofs plugins, hardware concurrency, audio context, font measurements, battery API, and touch points, adds canvas\u002FWebGL noise, auto-dismisses GDPR\u002Fcookie consent banners, and waits out Cloudflare Turnstile challenges automatically. |\n| **Adaptive rate limiting** | A per-domain sliding window tracks error rates. A 429 immediately doubles the request gap and activates a circuit breaker. Sustained 5xx errors increase the gap further. Clean windows slowly decay back toward the base delay. |\n| **Incremental crawling** | ETag, Last-Modified, and content MD5 are stored per URL. Re-crawls send conditional GET headers — 304 responses skip parsing entirely, and hash comparison catches changes even without server-side ETag support. Sitemap `\u003Clastmod>` dates are used for a pre-flight filter that skips unchanged pages before a network request is even made. |\n| **URL canonicalization** | Tracking parameters (`utm_*`, `fbclid`, `gclid`, and 25 others) are stripped before URLs enter the queue. Remaining parameters are sorted and fragments removed — so `?utm_source=twitter` and `?utm_source=facebook` are the same crawl target. |\n| **Item validation** | Set `item_schema = MyPydanticModel` on a Spider and every yielded item is validated before persistence. Type coercion is automatic (`\"49.99\"` → `49.99`). Invalid items carry a `_validation_errors` key; valid\u002Finvalid counts and error rate appear in live crawl metrics. |\n| **Concurrent crawler** | Pure asyncio, semaphore-gated workers, SQLite-backed URL queue. Crawls survive process restarts. Pause mid-run and resume days later with `Crawler.resume(crawl_id, MySpider)`. |\n| **Proxy rotation** | HTTP\u002FHTTPS\u002FSOCKS5 with round-robin, random, or least-used strategies. Failed proxies are auto-quarantined and retested in the background. |\n| **MCP server** | FastMCP server exposes 17 scraping tools — fetch, extract, crawl, screenshot, train\u002Fvalidate selectors, cancel, cache control, and more — so any LLM or tool-calling agent can drive a full crawl through a conversation. |\n\nAlso includes: JS interaction (click, fill, scroll, infinite-scroll loop, wait), network request interception (capture JSON API responses from SPAs), robots.txt compliance, sitemap discovery, content deduplication, auth\u002Fcookie support, configurable retries with `Retry-After` support, CSV\u002FJSON\u002FJSONL export.\n\n---\n\n## Install\n\nThe distribution name is `anansi-scraper`; the import package is `anansi`. It\nis installed from this Git repository (not yet published to PyPI), so the\noptional extras use pip's `extras @ git+URL` syntax:\n\n```bash\n# Core install\npip install \"git+https:\u002F\u002Fgithub.com\u002Fmdowis\u002Fanansi\"\n\n# For browser-based fetching (Cloudflare bypass, JS rendering):\nplaywright install chromium\n\n# With the TLS-fingerprint-mimicry extra (curl-cffi impersonation):\npip install \"anansi-scraper[tls] @ git+https:\u002F\u002Fgithub.com\u002Fmdowis\u002Fanansi\"\n\n# With the OpenAI \u002F ChatGPT Agents SDK extra:\npip install \"anansi-scraper[openai] @ git+https:\u002F\u002Fgithub.com\u002Fmdowis\u002Fanansi\"\n```\n\nOnce installed, the MCP server is available as the `anansi-mcp` console script\nor via `python -m anansi.mcp_server.server`, and the CLI as `anansi`.\n\n**Windows:** `pip` is often not on PATH. Use `py -m pip install ...` instead. If `py` isn't found either, download Python from [python.org](https:\u002F\u002Fpython.org) and check **\"Add Python to PATH\"** during setup.\n\n---\n\n## How it works\n\n### Extraction pipeline\n\n```\n    Per field:\n         │\n    ┌────▼──────────────────────────┐\n    │  Structured data pre-pass     │  JSON-LD \u002F Open Graph \u002F Microdata\n    │                               │  matched fields skip all CSS work\n    └────┬──────────────────────────┘\n         │ field not in structured data\n    ┌────▼──────────────────────────┐\n    │  Try known selectors          │  ordered by confidence score (SQLite)\n    │  Try primary selector         │\n    └────┬──────────────────────────┘\n         │ all fail\n    ┌────▼──────────────────────────┐\n    │  Healing strategies           │\n    │  1. Text-pattern match        │  regex on element text\n    │  2. Attribute fuzzy match     │  Levenshtein-similar CSS classes\n    │  3. Structural context        │  parent\u002Fsibling navigation\n    │  4. XPath fallback            │  CSS→XPath conversion\n    └────┬──────────────────────────┘\n         │ winner (score ≥ 0.5)\n    ┌────▼──────────────────────────┐\n    │  Persist new selector         │  confidence stored in SQLite\n    │  Success: score × 1.05 + 0.02 │  cap 1.0\n    │  Failure: score × 0.85 − 0.05 │  floor 0.0\n    │  Unused >7d: score × 0.99\u002Fday │\n    └───────────────────────────────┘\n```\n\n### Auto browser upgrade\n\n```\n    HTTP fetch\n         │\n    ┌────▼──────────────────────┐\n    │  Domain cached as JS?     │──Yes──► BrowserFetcher directly\n    └────┬──────────────────────┘\n         │ No\n    ┌────▼──────────────────────┐\n    │  needs_browser(html)?     │  SPA markers (React\u002FVue\u002FNext\u002FNuxt\u002FAngular)\n    │                           │  noscript redirect · text\u002FHTML \u003C 3%\n    └────┬──────────┬───────────┘\n         │ No       │ Yes\n         │          ▼\n         │   BrowserFetcher retry ──► cache domain for session\n         │\n    ┌────▼──────────────────────┐\n    │  Return HTTP result       │\n    └───────────────────────────┘\n```\n\nDisable with `auto_browser=False`, or force browser on a specific request with `meta[\"use_browser\"] = True`.\n\n### Adaptive rate limiting\n\n```\n    After each fetch:\n         │\n    ├── status 429 ──────────────► gap × 2 (cap 60 s) + 30 s circuit breaker\n    │\n    ├── window full, error rate > 30% ──► gap × 1.5\n    │\n    └── window full, error rate \u003C 5%  ──► gap × 0.95  (floor = base delay)\n```\n\nDisable with `adaptive_rate_limiting=False`.\n\n---\n\n## Quickstart\n\n### Extract structured data from a product page\n\n```python\nimport asyncio\nfrom anansi import AdaptiveParser\nfrom anansi.parser.adaptive import SelectorConfig\n\nasync def main():\n    html = ...  # fetched HTML\n\n    parser = AdaptiveParser()\n    data = await parser.extract(html, {\n        # JSON-LD fields like \"name\" and \"price\" are pulled from structured\n        # data automatically — the CSS selectors below are only used as fallback\n        \"name\":  SelectorConfig(\"h1.product-title\", expected_pattern=r\"\\w+\"),\n        \"price\": SelectorConfig(\".price-tag\", expected_pattern=r\"\\$[\\d,.]+\"),\n        \"sku\":   \".product-sku\",\n    }, url=\"https:\u002F\u002Fshop.example.com\u002Fproduct\u002F42\")\n\n    print(data)\n    # {\"name\": \"Widget Pro\", \"price\": \"$49.99\", \"sku\": \"WGT-001\"}\n\n    # Raw structured data is also available directly\n    structured = await parser.extract_structured(html)\n    print(structured[\"json_ld\"])   # [{\"@type\": \"Product\", \"name\": \"Widget Pro\", ...}]\n    print(structured[\"open_graph\"]) # {\"title\": \"Widget Pro\", \"image\": \"https:\u002F\u002F...\"}\n\nasyncio.run(main())\n```\n\n### Run a resilient concurrent crawl\n\n```python\nfrom pydantic import BaseModel\nfrom anansi import Crawler, ProxyManager\nfrom anansi.core import Item, Request, Response\nfrom anansi.spider.spider import Spider\n\nclass ProductItem(BaseModel):\n    title: str\n    price: float        # \"49.99\" strings are auto-coerced\n    sku: str | None = None\n\nclass ShopSpider(Spider):\n    name = \"shop\"\n    start_urls = [\"https:\u002F\u002Fshop.example.com\u002Fproducts\"]\n    item_schema = ProductItem   # validate every yielded item against this model\n\n    async def parse(self, response: Response):\n        for link in response.css(\"a.product-link\"):\n            yield Request(response.urljoin(link[\"href\"]), callback=\"parse_product\")\n\n    async def parse_product(self, response: Response):\n        yield Item({\"title\": response.css(\"h1\")[0].get_text(), \"url\": response.url})\n\npm = ProxyManager([\"http:\u002F\u002Fproxy1:8080\", \"socks5:\u002F\u002Fproxy2:1080\"])\n\ncrawler = Crawler(\n    ShopSpider,\n    concurrency=10,\n    delay=0.5,\n    max_pages=1000,\n    proxy_manager=pm,\n    domain_delay=1.0,             # minimum gap between requests to same domain\n    respect_robots=True,          # honour robots.txt (default True)\n    cookies={\"session\": \"...\"},   # for login-protected sites\n    auto_browser=True,            # detect and upgrade JS shells (default True)\n    adaptive_rate_limiting=True,  # back off on errors, recover on clean runs (default True)\n    conditional_get=True,         # skip unchanged pages on re-crawl (default True)\n    canonicalize_urls=True,       # strip tracking params before queuing (default True)\n)\n\nasync for item in crawler.run():\n    print(item.data)\n\n# Pause from another coroutine, resume later (even after process restart):\ncrawler.pause()\nresumed = await Crawler.resume(crawler.crawl_id, ShopSpider, concurrency=10)\nasync for item in resumed.run():\n    print(item.data)\n\n# Export everything to CSV:\nawait Crawler.export_items(crawler.crawl_id, fmt=\"csv\", path=\"\u002Ftmp\u002Fproducts.csv\")\n```\n\n### TLS fingerprint mimicry\n\n```python\nfrom anansi.fetchers.http import HTTPFetcher\n\n# Requires the tls extra: pip install \"anansi-scraper[tls] @ git+https:\u002F\u002Fgithub.com\u002Fmdowis\u002Fanansi\"\nasync with HTTPFetcher(impersonate=\"chrome124\") as f:\n    result = await f.fetch(\"https:\u002F\u002Fbot-protected-site.com\")\n    print(result.html)\n\n# Per-request profile rotation — vary the TLS fingerprint across requests\n# to avoid a fixed JA3\u002FJA4 hash being flagged across sessions.\nasync with HTTPFetcher(impersonate=\"chrome124\") as f:\n    r1 = await f.fetch(\"https:\u002F\u002Fexample.com\u002Fpage1\", impersonate=\"chrome131\")\n    r2 = await f.fetch(\"https:\u002F\u002Fexample.com\u002Fpage2\", impersonate=\"safari18_0\")\n    r3 = await f.fetch(\"https:\u002F\u002Fexample.com\u002Fpage3\", impersonate=None)   # plain httpx\n```\n\nWithout `[tls]` installed, Anansi logs a warning and falls back to standard httpx automatically — no code change required.\n\n### CLI\n\n```bash\n# Fetch and print as markdown\nanansi fetch https:\u002F\u002Fexample.com --output markdown\n\n# Use browser (Cloudflare bypass, JS rendering)\nanansi fetch https:\u002F\u002Fprotected-site.com --browser\n\n# List all recorded crawls\nanansi crawls\n\n# Start the MCP server\nanansi mcp\n```\n\nMore examples in [`\u002Fexamples`](examples\u002F).\n\n---\n\n## MCP Server (LLM Integration)\n\nAnansi ships a **FastMCP** server that exposes all scraping capabilities as tools any LLM can call over stdio transport.\n\n> **Windows note:** Claude Desktop and most MCP clients on Windows spawn the server with a restricted PATH that often excludes `Python313\\Scripts\\`, so `anansi-mcp` may not be found. Use `python -m anansi.mcp_server.server` in any config where it fails.\n\n### Start the server\n\n```bash\nanansi-mcp\n# or\npython -m anansi.mcp_server.server\n```\n\n### Tools\n\n| Tool | Description |\n|---|---|\n| `fetch_url` | Fetch a single page — HTML, text, or markdown; supports chunking, browser mode, and browser actions |\n| `fetch_urls` | Fetch multiple URLs concurrently in one call |\n| `fetch_and_extract` | Fetch and extract structured fields (CSS + structured data) in one call |\n| `extract` | Extract structured data from an HTML string with adaptive selectors |\n| `crawl_site` | Launch a background crawl; returns a `crawl_id` immediately |\n| `get_crawl_items` | Retrieve persisted items from a crawl (paginated) |\n| `export_crawl` | Export items as JSONL, JSON, or CSV |\n| `crawl_metrics` | Live stats: pages\u002Fsec, error rate, unchanged pages, queue depth, item validation counts |\n| `pause_crawl` | Pause a running crawl |\n| `resume_crawl` | Resume a paused crawl (same process) |\n| `list_crawls` | List all crawls and their state |\n| `selector_health` | Inspect learned selector confidence scores for a URL pattern |\n| `cancel_crawl` | Permanently cancel a running or paused crawl (irreversible; distinct from `pause_crawl`) |\n| `screenshot_url` | Capture a PNG screenshot of any page via headless browser; returns base64 or saves to file |\n| `train_selector` | Manually teach the parser a correct CSS\u002FXPath\u002Ftext selector for a URL pattern at confidence 1.0 |\n| `validate_selector` | Test CSS selectors against a live page without affecting stored confidence scores |\n| `clear_cache` | Invalidate the in-memory page cache (all entries, or a single URL) |\n\n### `fetch_url` parameters\n\n| Parameter | Default | Description |\n|---|---|---|\n| `url` | required | The URL to fetch |\n| `use_browser` | `false` | Use headless browser (bypasses Cloudflare, renders JS) |\n| `proxy` | `null` | Proxy URL — `\"http:\u002F\u002Fuser:pass@host:port\"` |\n| `wait_for_selector` | `null` | Wait for this CSS selector before returning (browser only) |\n| `timeout` | `30.0` | Request timeout in seconds |\n| `format` | `\"html\"` | Output format: `\"html\"`, `\"text\"`, or `\"markdown\"` |\n| `chunk_size` | `null` | Max characters per chunk — `null` returns the full page |\n| `chunk_index` | `0` | Which chunk to return (0-indexed) |\n| `actions` | `null` | Browser interactions to run after page load (see below) |\n| `impersonate` | `null` | curl-cffi TLS\u002FHTTP-2 fingerprint target (e.g. `\"chrome124\"`); falls back to `ANANSI_IMPERSONATE` env var; per-request, overrides the instance default |\n| `capture_network` | `false` | **Browser only.** Intercept JSON API responses the page makes during load\u002Factions. Returns raw payloads in `captured_requests` — ideal for API-first SPAs. Bypasses cache. |\n| `capture_patterns` | `null` | URL substrings to filter captured responses (e.g. `[\"\u002Fapi\u002F\", \"\u002Fgraphql\"]`). Max 20 entries. Requires `capture_network=true`. |\n\n### Handling large pages\n\nRaw HTML is often 500 kB–2 MB. Three strategies, simplest to most granular:\n\n**Switch format** — strips markup (typically 5–10× smaller):\n```\nfetch_url(url=\"https:\u002F\u002Fexample.com\u002Farticle\", format=\"text\")\nfetch_url(url=\"https:\u002F\u002Fexample.com\u002Fdocs\",    format=\"markdown\")\n```\n\n**Chunk** — splits at DOM or paragraph boundaries; page is cached 5 min so subsequent chunks cost nothing:\n```\nfetch_url(url=\"https:\u002F\u002Fexample.com\", format=\"markdown\", chunk_size=20000, chunk_index=0)\n# → {content: \"...\", chunk_index: 0, total_chunks: 4}\nfetch_url(url=\"https:\u002F\u002Fexample.com\", format=\"markdown\", chunk_size=20000, chunk_index=1)\n```\n\n**Extract only what you need** — target specific fields with `fetch_and_extract` or `extract` and never download the full page content.\n\n### `fetch_and_extract` example\n\n```\nfetch_and_extract(\n    url=\"https:\u002F\u002Fshop.example.com\u002Fproduct\u002F1\",\n    selectors={\"title\": \"h1.product-title\", \"price\": \".price\", \"sku\": \".sku\"},\n)\n# → {\n#     \"url\": \"https:\u002F\u002F...\", \"status\": 200, \"elapsed\": 0.42,\n#     \"data\": {\"title\": \"Widget Pro\", \"price\": \"$49.99\", \"sku\": \"WGT-001\"},\n#     \"structured_data\": {\n#       \"json_ld\": [{\"@type\": \"Product\", \"name\": \"Widget Pro\", \"price\": \"49.99\"}],\n#       \"open_graph\": {\"title\": \"Widget Pro\", \"image\": \"https:\u002F\u002F...\"},\n#       \"microdata\": []\n#     }\n#   }\n```\n\nFields matched in JSON-LD or Open Graph appear in `data` directly — CSS selectors are not evaluated for them. `structured_data` always contains the raw metadata.\n\n### Browser interactions (`actions`)\n\nPass an `actions` list with `use_browser=true` for dynamically loaded content. Actions execute in order after page load.\n\n| Type | Required fields | Optional fields | Description |\n|---|---|---|---|\n| `click` | `selector` | — | Click a CSS-matched element |\n| `fill` | `selector`, `value` | — | Type text into an input |\n| `press` | `selector`, `key` | — | Press a key while an element is focused |\n| `scroll_to_bottom` | — | — | Scroll to the bottom of the page (single shot) |\n| `scroll_until_stable` | — | `max_scrolls` (1–30, default 10), `scroll_delay` (100–5000 ms, default 1500) | Scroll repeatedly until page height stops changing — handles infinite-scroll feeds, product listings, and lazy-loaded content. Stops when height is stable for 2 consecutive checks, or when the 60 s action budget is hit. |\n| `wait` | `ms` | — | Pause for N milliseconds |\n| `wait_for_selector` | `selector` | — | Wait until a CSS selector appears in the DOM |\n\n```\n# Infinite scroll — load all items automatically\nfetch_url(url=\"https:\u002F\u002Fexample.com\u002Ffeed\", use_browser=true, actions=[\n    {\"type\": \"scroll_until_stable\", \"max_scrolls\": 15, \"scroll_delay\": 1500},\n])\n\n# Submit a search form\nfetch_url(url=\"https:\u002F\u002Fexample.com\u002Fsearch\", use_browser=true, format=\"text\", actions=[\n    {\"type\": \"fill\", \"selector\": \"input[name=q]\", \"value\": \"web scraping\"},\n    {\"type\": \"press\", \"selector\": \"input[name=q]\", \"key\": \"Enter\"},\n    {\"type\": \"wait_for_selector\", \"selector\": \".results\"},\n])\n```\n\n### Network request interception (`capture_network`)\n\nMany modern sites (React, Next.js, Vue, Nuxt) render a minimal HTML shell and load all actual data via XHR\u002Ffetch API calls. `capture_network=true` registers a response listener _before_ navigation and collects every JSON API response the page makes — bypassing HTML parsing entirely.\n\n```\nfetch_url(\n    url=\"https:\u002F\u002Fshop.example.com\u002Fproducts\",\n    use_browser=true,\n    capture_network=true,\n    capture_patterns=[\"\u002Fapi\u002Fproducts\", \"\u002Fgraphql\"],\n    actions=[{\"type\": \"scroll_until_stable\"}],\n)\n# → {\n#     \"url\": \"https:\u002F\u002F...\", \"status\": 200, \"via_browser\": true,\n#     \"captured_requests\": [\n#       {\"url\": \"https:\u002F\u002Fshop.example.com\u002Fapi\u002Fproducts?page=1\", \"status\": 200,\n#        \"body\": {\"items\": [...], \"total\": 240}},\n#       ...\n#     ],\n#     \"content\": \"...\",   # HTML shell (often minimal)\n#   }\n```\n\n- Capped at 50 responses, 200 KB each (larger responses are silently skipped)\n- `capture_patterns` filters by URL substring; omit to capture all JSON responses\n- Results bypass the page cache (each call re-fetches and re-intercepts)\n\n### Client configuration\n\n**Claude Code:**\n```bash\nclaude mcp add anansi -- anansi-mcp\n```\n\n**Claude Desktop \u002F Cursor \u002F Windsurf** — add to the client's MCP config file:\n```json\n{ \"mcpServers\": { \"anansi\": { \"command\": \"anansi-mcp\" } } }\n```\n\n**If `anansi-mcp` is not found** (common on Windows where the Scripts directory isn't on PATH):\n```json\n{ \"mcpServers\": { \"anansi\": { \"command\": \"python\", \"args\": [\"-m\", \"anansi.mcp_server.server\"] } } }\n```\n\n**Any LLM via Python:**\n```python\nfrom mcp import ClientSession, StdioServerParameters\nfrom mcp.client.stdio import stdio_client\n\nserver = StdioServerParameters(command=\"anansi-mcp\")\nasync with stdio_client(server) as (read, write):\n    async with ClientSession(read, write) as session:\n        await session.initialize()\n        tools = await session.list_tools()\n        result = await session.call_tool(\"fetch_url\", {\"url\": \"https:\u002F\u002Fexample.com\"})\n```\n\n**LangChain:**\n```python\nfrom langchain_mcp_adapters.tools import load_mcp_tools\n# load_mcp_tools(session) returns standard LangChain Tool objects\n```\n\n**ChatGPT Desktop App** — open Settings → Connectors → Add MCP Server and paste:\n```json\n{ \"command\": \"anansi-mcp\", \"args\": [], \"env\": {} }\n```\n\n**ChatGPT \u002F OpenAI Agents SDK (programmatic):**\n```bash\npip install \"anansi-scraper[openai] @ git+https:\u002F\u002Fgithub.com\u002Fmdowis\u002Fanansi\"\n```\n```python\nfrom agents import Agent, Runner\nfrom agents.mcp import MCPServerStdio\n\nasync with MCPServerStdio(params={\"command\": \"anansi-mcp\", \"args\": []}) as server:\n    agent = Agent(name=\"Scraper\", instructions=\"Use Anansi tools.\", mcp_servers=[server])\n    result = await Runner.run(agent, \"Fetch https:\u002F\u002Fexample.com and summarise it.\")\n    print(result.final_output)\n```\n\n**Remote SSE transport** (for web-based ChatGPT or shared team access):\n```bash\n# Start Anansi as an HTTP server\nanansi-mcp --transport sse --host 0.0.0.0 --port 8000\n```\nThen point ChatGPT Desktop (or the Agents SDK) at `http:\u002F\u002F\u003Chost>:8000\u002Fsse`:\n```json\n{ \"url\": \"http:\u002F\u002Flocalhost:8000\u002Fsse\" }\n```\n```python\nfrom agents.mcp import MCPServerSse\nasync with MCPServerSse(params={\"url\": \"http:\u002F\u002Flocalhost:8000\u002Fsse\"}) as server:\n    ...\n```\n\nSee [`examples\u002F05_mcp_chatgpt_usage.py`](examples\u002F05_mcp_chatgpt_usage.py) for a runnable end-to-end example.\n\n---\n\n## Architecture\n\n```\nanansi\u002F\n├── core.py              # Request, Response, Item, Spider base\n├── db.py                # SQLite schema (selectors.db, crawls.db, url_cache)\n├── fetchers\u002F\n│   ├── base.py          # BaseFetcher, FetchResult\n│   ├── http.py          # HTTPFetcher — httpx\u002Fcurl-cffi, retry, UA rotation, TLS mimicry\n│   ├── browser.py       # BrowserFetcher — Playwright, stealth JS, Cloudflare bypass\n│   └── smart.py         # needs_browser() — JS shell detection heuristics\n├── parser\u002F\n│   ├── adaptive.py      # AdaptiveParser — structured pre-pass + self-healing selectors\n│   ├── strategies.py    # text_match, attribute_fuzzy, structural, xpath_fallback\n│   └── structured.py    # extract_jsonld, extract_opengraph, extract_microdata\n├── proxy\u002F\n│   └── manager.py       # ProxyManager — rotation, health checks, quarantine\n├── sitemap.py           # SitemapEntry, iter_sitemap_entries — \u003Clastmod> aware\n├── spider\u002F\n│   ├── spider.py        # Spider base class, @rule, item_schema, sitemap filtering\n│   ├── queue.py         # SQLiteQueue — URL canonicalization, persistent queue\n│   └── crawler.py       # Crawler — adaptive throttle, validation, conditional GET\n├── utils\u002F\n│   └── url.py           # canonicalize_url — tracking param stripping, param sort\n└── mcp_server\u002F\n    └── server.py        # FastMCP server — 12 LLM-callable tools\n```\n\n---\n\n## Legal \u002F Acceptable Use\n\nAnansi is a powerful scraping tool. **You are solely responsible for how you use it.**\nBefore scraping any site, ensure you have the right to access and use the data and\nthat you comply with the site's Terms of Service, its `robots.txt`, applicable rate\nlimits, and all relevant laws (including computer-misuse statutes such as the CFAA,\ndata-protection law such as GDPR\u002FCCPA, and copyright\u002Fdatabase rights).\n\nThe anti-bot, TLS-fingerprint-impersonation, and Cloudflare-handling features are\nintended for **authorized** testing, research, and scraping of content you have the\nright to access — not for circumventing access controls without permission. The\nauthors accept no liability for damages, account bans, legal consequences, or losses\narising from use or misuse of this software. See [`DISCLAIMER.md`](DISCLAIMER.md) for\nthe full statement.\n\n### Operator controls\n\nThese environment variables are read once at process start and are **not** settable\nby an MCP\u002FLLM client — only by whoever runs the server:\n\n| Variable | Default | Effect when set to `1`\u002F`true` |\n|---|---|---|\n| `ANANSI_ALLOW_PRIVATE_NETWORKS` | off | Allows fetches\u002Fcrawls to resolve to loopback, RFC1918, link-local, and cloud-metadata addresses. Off by default so the untrusted LLM cannot reach internal services (SSRF). Enable only on a trusted, isolated host. |\n| `ANANSI_DISABLE_ANTIBOT` | off | Disables **all** anti-bot evasion: stealth-JS injection, the Cloudflare-challenge wait, curl-cffi TLS\u002FHTTP-2 impersonation, the per-host session warm-up, the browser→HTTP cookie hand-off, and the Akamai escalation ladder. Block *detection* still runs so callers get an honest blocked status. Always wins over `ANANSI_IMPERSONATE`. |\n| `ANANSI_IMPERSONATE` | unset | Default curl-cffi TLS\u002FHTTP-2 impersonation target applied to HTTP fetches (e.g. `chrome124`). Must be an allowlisted target; an invalid value fails loud at startup. A per-call `impersonate=` argument (also allowlist-validated) overrides it. |\n\n#### Surviving Akamai \u002F edge bot-managers (authorized use)\n\nAkamai Bot Manager blocks via TLS JA3\u002FJA4 fingerprint, HTTP\u002F2 frame-ordering\nfingerprint, and behavioral scoring of cold (cookie-less, no-`Referer`)\nrequests — block pages show `Reference #…` \u002F `errors.edgesuite.net` and a\n`Server: AkamaiGHost` header. Recommended operator recipe:\n\n1. Install the `tls` extra and set `ANANSI_IMPERSONATE=chrome124` (replays a\n   real Chrome TLS **and** HTTP\u002F2 fingerprint — the single biggest lever).\n2. Leave the per-host session warm-up and `Referer` continuity on (default)\n   so behavioral scoring sees a warm session.\n3. Supply **residential or mobile** proxies via the existing proxy support\n   for the hardest tier — datacenter IPs are heavily penalized.\n4. Allow browser escalation (`use_browser` \u002F the automatic ladder) so the\n   Akamai sensor JS can run when impersonation alone is insufficient.\n\n**Honest limit:** the highest Akamai tier validates `_abck` via sensor JS and\nalso blocks headless Chromium. Even with impersonation + browser + warm-up it\nmay remain unreliable without residential\u002Fmobile egress, and sometimes even\nthen. Anansi makes a best effort and reports an honest blocked status when it\ncannot get through. These features are for **authorized** scraping only — see\n[`DISCLAIMER.md`](DISCLAIMER.md); `ANANSI_DISABLE_ANTIBOT=1` turns all of it\noff.\n\n---\n\n## License\n\nLicensed under the Apache License, Version 2.0 — see [`LICENSE`](LICENSE) and\n[`NOTICE`](NOTICE). Use of this software is additionally subject to the\nacceptable-use terms in [`DISCLAIMER.md`](DISCLAIMER.md).\n","Anansi 是一个专为对抗性网站设计的自愈网络爬虫。它能够自动修复CSS选择器，必要时切换到浏览器渲染，并通过模仿Chrome的TLS指纹来规避机器人检测。其核心功能包括自愈解析、结构化数据提取、TLS\u002FHTTP-2指纹模拟、自动浏览器升级、反机器人和Cloudflare绕过以及自适应速率限制。这些特性使得Anansi非常适合在频繁更改布局或具有复杂反爬机制的网站上进行数据抓取，确保长期稳定运行。此外，该项目还提供了一个MCP服务器，允许任何大型语言模型通过对话驱动整个爬取过程。",2,"2026-06-11 03:58:31","CREATED_QUERY"]