[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74690":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":46,"readmeContent":47,"aiSummary":48,"trendingCount":15,"starSnapshotCount":15,"syncStatus":49,"lastSyncTime":50,"discoverSource":51},74690,"webclaw","0xMassi\u002Fwebclaw","0xMassi","Fast, local-first web content extraction for LLMs. Scrape, crawl, extract structured data — all from Rust. CLI, REST API, and MCP server.","https:\u002F\u002Fwebclaw.io",null,"Rust",1322,150,10,0,24,66,195,72,104.54,"GNU Affero General Public License v3.0",false,"main",true,[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45],"ai","ai-agents","ai-scraping","cli","crawler","data-extraction","firecrawl-alternative","html-to-markdown","llm","markdown","mcp","mcp-server","rag","rust","self-hosted","tls-fingerprinting","web-crawler","web-extraction","web-scraper","web-scraping","2026-06-12 04:01:15","\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwebclaw.io\">\n    \u003Cimg src=\".github\u002Fbanner.png\" alt=\"webclaw\" width=\"760\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">webclaw\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Turn websites into clean markdown, JSON, and LLM-ready context.\u003C\u002Fstrong>\u003Cbr\u002F>\n  \u003Csub>CLI, MCP server, REST API, and SDKs for AI agents and RAG pipelines.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fgithub\u002Fstars\u002F0xMassi\u002Fwebclaw.svg?variant=branded&logo=github\" alt=\"Stars\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Freleases\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fgithub\u002Ftag\u002F0xMassi\u002Fwebclaw.svg?variant=branded&logo=rust\" alt=\"Version\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fgithub\u002Flicense\u002F0xMassi\u002Fwebclaw.svg?variant=branded\" alt=\"License\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.npmjs.com\u002Fpackage\u002Fcreate-webclaw\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fnpm\u002Fdt\u002Fcreate-webclaw.svg?variant=branded\" alt=\"npm installs\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FKDfd48EpnW\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fbadge\u002FDiscord-Join.svg?variant=branded&logo=discord\" alt=\"Discord\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002Fwebclaw_io\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fbadge\u002FFollow-@webclaw__io.svg?variant=branded&logo=x\" alt=\"X \u002F Twitter\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwebclaw.io\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fbadge\u002FHosted-webclaw.io.svg?variant=branded&logo=safari\" alt=\"Hosted webclaw\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwebclaw.io\u002Fdocs\">\u003Cimg src=\"https:\u002F\u002Fshieldcn.dev\u002Fbadge\u002FDocs-Read.svg?variant=branded&logo=readthedocs\" alt=\"Docs\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdemo.gif\" alt=\"webclaw extracting clean markdown from a page\" width=\"760\" \u002F>\n\u003C\u002Fp>\n\n---\n\nMost web scraping tools give your agent one of two bad outputs:\n\n- a blocked page, login wall, or empty app shell\n- raw HTML full of nav, scripts, styling, ads, and duplicated boilerplate\n\n[webclaw.io](https:\u002F\u002Fwebclaw.io) is the hosted web extraction API for webclaw. This repo contains the open-source CLI, MCP server, extraction engine, and self-hostable server.\n\nwebclaw turns a URL into clean content your tools can actually use.\n\n```bash\nwebclaw https:\u002F\u002Fexample.com --format markdown\n```\n\n```md\n# Example Domain\n\nThis domain is for use in illustrative examples in documents.\n\nYou may use this domain in literature without prior coordination or asking for permission.\n```\n\nUse it from the terminal, wire it into Claude\u002FCursor through MCP, call the hosted API from your app, or self-host the OSS server.\n\n---\n\n## Install\n\n### Agent setup\n\nThe fastest way to connect webclaw to Claude Code, Claude Desktop, Cursor, Windsurf, OpenCode, Codex CLI, and other MCP-compatible tools:\n\n```bash\nnpx create-webclaw\n```\n\nThe installer detects supported clients and configures the MCP server for you.\n\n### Homebrew\n\n```bash\nbrew tap 0xMassi\u002Fwebclaw\nbrew install webclaw\n```\n\n### Prebuilt binaries\n\nDownload macOS and Linux binaries from [GitHub Releases](https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Freleases).\n\n### Docker\n\n```bash\ndocker run --rm ghcr.io\u002F0xmassi\u002Fwebclaw https:\u002F\u002Fexample.com\n```\n\n### Cargo\n\n```bash\ncargo install --git https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw.git webclaw-cli\ncargo install --git https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw.git webclaw-mcp\n```\n\nIf building from source fails because native build tools are missing, install the platform prerequisites:\n\n| OS | Command |\n| --- | --- |\n| Debian \u002F Ubuntu | `sudo apt install -y pkg-config libssl-dev cmake clang git build-essential` |\n| Fedora \u002F RHEL | `sudo dnf install -y pkg-config openssl-devel cmake clang git make gcc` |\n| Arch | `sudo pacman -S pkg-config openssl cmake clang git base-devel` |\n| macOS | `xcode-select --install` |\n\n---\n\n## Quick Start\n\n### Scrape one page\n\n```bash\nwebclaw https:\u002F\u002Fstripe.com --format markdown\n```\n\n### Return LLM-optimized text\n\n```bash\nwebclaw https:\u002F\u002Fdocs.anthropic.com --format llm\n```\n\n### Keep only the main content\n\n```bash\nwebclaw https:\u002F\u002Fexample.com\u002Fblog\u002Fpost --only-main-content\n```\n\n### Include or exclude selectors\n\n```bash\nwebclaw https:\u002F\u002Fexample.com \\\n  --include \"article, main, .content\" \\\n  --exclude \"nav, footer, .sidebar, .ad\"\n```\n\n### Crawl a documentation site\n\n```bash\nwebclaw https:\u002F\u002Fdocs.rust-lang.org --crawl --depth 2 --max-pages 50\n```\n\n### Extract brand assets\n\n```bash\nwebclaw https:\u002F\u002Fgithub.com --brand\n```\n\n### Compare a page over time\n\n```bash\nwebclaw https:\u002F\u002Fexample.com\u002Fpricing --format json > pricing-old.json\nwebclaw https:\u002F\u002Fexample.com\u002Fpricing --diff-with pricing-old.json\n```\n\n---\n\n## MCP Server\n\nwebclaw ships with an MCP server for AI agents.\n\n```bash\nnpx create-webclaw\n```\n\nManual config:\n\n```json\n{\n  \"mcpServers\": {\n    \"webclaw\": {\n      \"command\": \"~\u002F.webclaw\u002Fwebclaw-mcp\"\n    }\n  }\n}\n```\n\nThen ask your agent things like:\n\n```text\nScrape these competitor pricing pages and summarize the differences.\n```\n\n```text\nCrawl this documentation site and prepare clean context for a RAG index.\n```\n\n```text\nExtract the brand colors, fonts, and logos from this company website.\n```\n\n---\n\n## Tools\n\n| Tool | What it does | Local |\n| --- | --- | :-: |\n| `scrape` | Extract one URL as markdown, text, JSON, LLM format, or HTML | Yes |\n| `crawl` | Follow same-origin links and extract discovered pages | Yes |\n| `map` | Discover URLs without extracting every page | Yes |\n| `batch` | Scrape multiple URLs in parallel | Yes |\n| `extract` | Convert page content into structured data | Yes, with local or configured LLM |\n| `summarize` | Summarize a page | Yes, with local or configured LLM |\n| `diff` | Compare page content snapshots | Yes |\n| `brand` | Extract colors, fonts, logos, and metadata | Yes |\n| `search` | Search the web and scrape results | Hosted API |\n| `research` | Multi-source research workflow | Hosted API |\n\n---\n\n## SDKs\n\n```bash\nnpm install @webclaw\u002Fsdk\npip install webclaw\ngo get github.com\u002F0xMassi\u002Fwebclaw-go\n```\n\n\u003Cdetails>\n\u003Csummary>TypeScript\u003C\u002Fsummary>\n\n```ts\nimport { Webclaw } from \"@webclaw\u002Fsdk\";\n\nconst client = new Webclaw({ apiKey: process.env.WEBCLAW_API_KEY! });\n\nconst page = await client.scrape({\n  url: \"https:\u002F\u002Fexample.com\",\n  formats: [\"markdown\"],\n  only_main_content: true,\n});\n\nconsole.log(page.markdown);\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Python\u003C\u002Fsummary>\n\n```python\nfrom webclaw import Webclaw\n\nclient = Webclaw(api_key=\"wc_your_key\")\n\npage = client.scrape(\n    \"https:\u002F\u002Fexample.com\",\n    formats=[\"markdown\"],\n    only_main_content=True,\n)\n\nprint(page.markdown)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>cURL\u003C\u002Fsummary>\n\n```bash\ncurl -X POST https:\u002F\u002Fapi.webclaw.io\u002Fv1\u002Fscrape \\\n  -H \"Authorization: Bearer $WEBCLAW_API_KEY\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"url\": \"https:\u002F\u002Fexample.com\",\n    \"formats\": [\"markdown\"],\n    \"only_main_content\": true\n  }'\n```\n\n\u003C\u002Fdetails>\n\n---\n\n## Output Formats\n\n| Format | Use it when you need |\n| --- | --- |\n| `markdown` | Clean page content with structure preserved |\n| `llm` | Compact context for agents and RAG pipelines |\n| `text` | Plain text with minimal formatting |\n| `json` | Structured metadata, links, images, and extracted fields |\n| `html` | Cleaned HTML for custom processing |\n\n---\n\n## Local First, Hosted When Needed\n\nThe CLI and MCP server work locally without an account for the core extraction path.\n\nUse the hosted API at [webclaw.io](https:\u002F\u002Fwebclaw.io) when you need:\n\n- protected-site access without managing infrastructure\n- JavaScript rendering\n- async crawl and research jobs\n- web search\n- watches and production usage tracking\n- SDKs for application code\n\n```bash\nexport WEBCLAW_API_KEY=wc_your_key\n\nwebclaw https:\u002F\u002Fexample.com --cloud\n```\n\n---\n\n## What You Can Build\n\n| Use case | Example |\n| --- | --- |\n| AI agent web access | Give Claude, Cursor, or another MCP client clean page context |\n| RAG ingestion | Crawl docs, help centers, blogs, and knowledge bases |\n| Competitor monitoring | Track pricing pages, changelogs, docs, and product pages |\n| Structured extraction | Turn messy pages into typed JSON for automations |\n| Research workflows | Search, scrape, summarize, and cite multiple sources |\n| Brand intelligence | Extract logos, colors, fonts, and social metadata |\n\n## Architecture\n\n```text\nwebclaw\u002F\n  crates\u002F\n    webclaw-core     HTML to markdown, text, JSON, and LLM-ready output\n    webclaw-fetch    Fetching, crawling, batching, and mapping\n    webclaw-llm      Local and hosted LLM provider support\n    webclaw-pdf      PDF text extraction\n    webclaw-mcp      MCP server for AI agents\n    webclaw-cli      Command-line interface\n```\n\n`webclaw-core` is pure extraction logic: no network I\u002FO, small surface area, and usable independently from the fetching layer.\n\n---\n\n## Configuration\n\n| Variable | Description |\n| --- | --- |\n| `WEBCLAW_API_KEY` | Hosted API key |\n| `OLLAMA_HOST` | Ollama URL for local LLM features |\n| `OPENAI_API_KEY` | OpenAI-compatible LLM provider key |\n| `OPENAI_BASE_URL` | OpenAI-compatible base URL |\n| `ANTHROPIC_API_KEY` | Anthropic-compatible LLM provider key |\n| `ANTHROPIC_BASE_URL` | Anthropic-compatible base URL |\n| `WEBCLAW_PROXY` | Single proxy URL |\n| `WEBCLAW_PROXY_FILE` | Proxy pool file |\n\n---\n\n## Contributing\n\nThe most useful contributions right now are practical and small:\n\n- add examples for real agent and RAG workflows\n- improve SDK snippets\n- report pages that extract poorly\n- add failing fixtures for messy HTML\n- improve docs for MCP clients and local setup\n- test the CLI on more Linux\u002FmacOS environments\n\nGood first places to start:\n\n- [Good first issues](https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fissues?q=label%3A%22good+first+issue%22)\n- [Open a bug report](https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fissues\u002Fnew)\n- [Start a discussion](https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fdiscussions)\n\nIf a page extracts badly, include:\n\n```text\nURL:\nCommand or API request:\nExpected output:\nActual output:\nFormat used: markdown \u002F llm \u002F text \u002F json \u002F html\nCLI, MCP, SDK, or API:\n```\n\nPlease remove secrets, cookies, private tokens, and customer data from logs before posting.\n\n---\n\n## Community Plugins\n\nThird-party plugins that integrate webclaw with AI agent platforms:\n\n| Plugin | Platform | What it does |\n|---|---|---|\n| [openclaw-webclaw](https:\u002F\u002Fgithub.com\u002Fjal-co\u002Fopenclaw-webclaw) | [OpenClaw](https:\u002F\u002Fopenclaw.ai) | Native webclaw v1 API plugin with 9 tools: scrape, search, crawl, extract, summarize, diff, map, batch, brand |\n| [hermes-webclaw](https:\u002F\u002Fgithub.com\u002Fjal-co\u002Fhermes-webclaw) | [Hermes Agent](https:\u002F\u002Fgithub.com\u002FNousResearch\u002Fhermes-agent) | Web search provider and 9 dedicated tools for the full v1 API surface. Install with `hermes plugins install jal-co\u002Fhermes-webclaw` |\n\nBuilt a webclaw integration? [Open a PR](https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fpulls) to add it here.\n\n---\n\n## Contributors\n\nThanks to everyone improving webclaw through issues, examples, docs, bug reports, and pull requests.\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002F0xMassi\u002Fwebclaw\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=0xMassi\u002Fwebclaw\" alt=\"webclaw contributors\" \u002F>\n\u003C\u002Fa>\n\n---\n\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=0xMassi%2Fwebclaw&type=date&legend=top-left\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=0xMassi\u002Fwebclaw&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=0xMassi\u002Fwebclaw&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=0xMassi\u002Fwebclaw&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n---\n\n## License\n\n[AGPL-3.0](LICENSE)\n","webclaw 是一个快速、本地优先的网页内容提取工具，专为大型语言模型设计。它能够从网页中抓取、爬取并提取结构化数据，支持生成Markdown、JSON等格式，便于AI代理和检索增强生成（RAG）管道使用。项目采用Rust编写，提供了命令行界面（CLI）、REST API及MCP服务器等多种接入方式，具有高效稳定的特点。适用于需要自动化处理网页信息并将其转换为可直接利用的数据格式的各种场景，如构建知识库、内容分析与汇总等。",2,"2026-06-11 03:50:25","high_star"]