[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80787":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":13,"stars7d":13,"stars30d":13,"stars90d":15,"forks30d":15,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":15,"starSnapshotCount":15,"syncStatus":14,"lastSyncTime":30,"discoverSource":31},80787,"scraperecon","DaKheera47\u002Fscraperecon","DaKheera47","CLI recon tool for scraper developers. Detects TLS fingerprinting, JS challenges, bot protection, and rate limits across 4 stages","",null,"Python",40,1,2,0,3,0.9,"MIT License",false,"main",[22,23,24,25,26],"bot-detection","curl-cffi","python","scraping","tls-impersonation","2026-06-12 02:04:06","# scraperecon\n\nRun this before you write a scraper. It tells you what bot protection a site has, whether plain HTTP or TLS impersonation is enough to get through, and how aggressively it rate limits — before you've written a single line of scraper code.\n\n\u003Cimg width=\"1117\" height=\"409\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Faa7d0670-c612-4cfb-a579-b610b9c04163\" \u002F>\n\n\u003Cbr\u002F>\n\n## Usage\n\n```bash\nscraperecon https:\u002F\u002Ftarget.com\n```\n\n```\nscraperecon — https:\u002F\u002Ftarget.com\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\nScrape Report\n  robots.txt: https:\u002F\u002Ftarget.com\u002Frobots.txt\n  robots.txt blocks scraping, proceed at own caution\n  Sitemap:    12,843 URLs across 4 sitemap file(s)\n\nStage 1 — Plain HTTP (httpx, scraper User-Agent)\n  Status:   403 Forbidden\n  Time:     212ms\n  Verdict:  Blocked\n\nStage 2 — TLS Impersonation (chrome131)\n  Status:   200 OK\n  Time:     389ms\n  Verdict:  Open\n  Note:     TLS fingerprint was the blocker ✓\n\nStage 3 — Vendor Detection\n  Vendor:     Cloudflare\n  Confidence: High\n  Signals:    cf-ray header, __cf_bm cookie, challenges.cloudflare.com in body\n\nStage 4 — Rate Limit Probe\n  Skipped (pass --probe-rate to enable)\n\nEmbedded Data Patterns (2\u002F5 detected in Stage 2 (chrome131))\nPattern                 Detected  Signal                              Why It Matters\nJSON-LD                 Yes       script[type=\"application\u002Fld+json\"]  Parse schema payloads for product,\n                                                                      article, job, or org fields.\nNext.js hydration       Yes       script#__NEXT_DATA__                Read props\u002FpageProps from the\n                                                                      Next.js bootstrap JSON.\nNuxt payload            No        window.__NUXT__ or data-nuxt-data   Inspect the Nuxt payload for\n                                                                      server-rendered entities and route data.\nApollo\u002FRelay cache      No        window.__APOLLO_STATE__ or          Extract normalized GraphQL entities\n                                  __RELAY_PAYLOADS__                  from the hydrated client cache.\nBootstrapped app state  No        window.__INITIAL_STATE__ or         Mine the initial Redux-style store\n                                  __PRELOADED_STATE__                 for records already sent to the client.\n\nRecommendation\n  Use curl_cffi with chrome131 TLS profile\n  No CAPTCHA detected at probe volume\n  Proxy rotation not required at low request rates\n```\n\n---\n\n## What it does\n\nscraperecon starts with a scrape report, then runs four stages against a URL in order, stopping early where it can.\n\n**Scrape Report**\n\nFetches `robots.txt` before the network challenge stages. If the target path is disallowed for generic crawlers, scraperecon prints:\n\n```text\nrobots.txt blocks scraping, proceed at own caution\n```\n\nIt also discovers sitemap URLs from `robots.txt`, falls back to `\u002Fsitemap.xml` when needed, follows sitemap indexes recursively, and counts the number of page URLs found. This is useful for sizing a site before you write a scraper.\n\n**Stage 1 — Plain HTTP**\n\nA basic GET with no tricks. If this comes back clean, you don't need anything else — plain `httpx` or `requests` will work fine and you can stop here.\n\nIt also checks whether a 200 response is actually real content or a JS challenge page. Cloudflare in particular loves returning 200 with a challenge rather than a 403. scraperecon catches that and marks it `Challenged` instead of lying to you with a green `Open`.\n\n**Stage 2 — TLS Impersonation**\n\nIf Stage 1 was blocked or challenged, it retries using `curl_cffi` impersonating Chrome's TLS fingerprint. A lot of bot detection happens at the TLS handshake level — Python's `requests` library has a completely different fingerprint from a real browser, and that alone is enough to get you blocked on many sites before the server has even looked at your headers. If Stage 2 passes where Stage 1 didn't, you know exactly what the fix is.\n\n**Stage 3 — Vendor Detection**\n\nInspects headers, cookies, and the response body for known signatures and tells you which bot protection vendor is running. This matters because Cloudflare, DataDome, Akamai, and PerimeterX all require different bypass strategies. Knowing which one you're dealing with upfront saves you from trying things that were never going to work.\n\n**Stage 4 — Rate Limit Probe** _(opt-in)_\n\nFires N requests with configurable concurrency and watches what happens — hard 429s, silent response time degradation, mid-session redirects. Off by default because blasting a site without thinking about it is bad practice. Pass `--probe-rate` when you actually need the data.\n\n**Embedded Data Patterns**\n\nChecks the best HTML body it retrieved and reports five common client-visible data formats that are often directly scrapable without browser automation:\n\n- JSON-LD\n- Next.js `__NEXT_DATA__`\n- Nuxt `__NUXT__` payloads\n- Apollo\u002FRelay hydrated GraphQL caches\n- Redux-style bootstrapped app state\n\nIf any of these are present, the terminal table tells you what was found and what kind of extraction path is likely to work.\nWhen a payload is valid JSON, it also shows up to 7 top-level keys with a `+ n more` suffix when the object is larger.\n\n---\n\n## Install\n\n```bash\npipx install scraperecon\n```\n\n---\n\n## Usage\n\n```bash\nscraperecon https:\u002F\u002Ftarget.com\nscraperecon https:\u002F\u002Ftarget.com --probe-rate\nscraperecon https:\u002F\u002Ftarget.com --probe-rate --concurrency 10 --requests 50\nscraperecon https:\u002F\u002Ftarget.com --impersonate safari170\nscraperecon https:\u002F\u002Ftarget.com --show-sitemap-preview\nscraperecon https:\u002F\u002Ftarget.com --show-embedded-keys\nscraperecon https:\u002F\u002Ftarget.com --save\nscraperecon https:\u002F\u002Ftarget.com --json | jq .recommendation\n```\n\n| Flag                     | Default   | Description                                                                          |\n| ------------------------ | --------- | ------------------------------------------------------------------------------------ |\n| `--probe-rate`           | off       | Run Stage 4 rate limit probe                                                         |\n| `--concurrency`          | 5         | Workers for rate probe                                                               |\n| `--requests`             | 20        | Total requests for rate probe                                                        |\n| `--impersonate`          | chrome131 | TLS profile for Stage 2. Options: `chrome131`, `chrome120`, `safari170` |\n| `--timeout`              | 10        | Per-request timeout in seconds                                                       |\n| `--json`                 | off       | Machine-readable JSON output                                                         |\n| `--show-sitemap-preview` | off       | Show up to 3 sample URLs for each detected sitemap in the human-readable report      |\n| `--show-embedded-keys`   | off       | Show parsed top-level keys for embedded data patterns in the human-readable report   |\n| `--save`                 | off       | Save the full HTML responses to local files (`\u003Cdomain>_stage1.html`, etc.)           |\n| `--skip-tls`             | off       | Skip Stage 2                                                                         |\n| `--skip-vendor`          | off       | Skip Stage 3                                                                         |\n\n---\n\n## Reading the recommendation\n\nAt the end of every run you get a plain-English recommendation based on what was found.\n\n- **Plain HTTP should be sufficient** — `httpx` or `requests` will work. No special setup needed.\n- **Use curl_cffi with `\u003Cprofile>`** — TLS fingerprinting is blocking you. Switch to `curl_cffi` with the listed profile.\n- **May need browser automation** — both plain and TLS requests were blocked. You're likely looking at a full JS challenge (Turnstile, hCaptcha). Playwright with a stealth plugin is probably your next move.\n- **Proxy rotation recommended** — the rate probe hit throttling. At any real request volume you'll need rotating proxies.\n- **CAPTCHA detected** — the response body contained CAPTCHA indicators. Automated solving or a managed scraping service required.\n\n---\n\n## JSON output\n\nPass `--json` to get machine-readable output. Robots and sitemap data live under `scrape_report`, before the stage results.\n\n```json\n{\n  \"target\": \"https:\u002F\u002Ftarget.com\",\n  \"scrape_report\": {\n    \"blocked\": true,\n    \"robots_url\": \"https:\u002F\u002Ftarget.com\u002Frobots.txt\",\n    \"sitemap_url_count\": 12843,\n    \"sitemaps_checked\": 4,\n    \"sitemap_sources\": [\n      \"https:\u002F\u002Ftarget.com\u002Fsitemap.xml\",\n      \"https:\u002F\u002Ftarget.com\u002Fsitemap_0.xml\"\n    ]\n  },\n  \"stages\": {\n    \"plain\": {},\n    \"tls\": {},\n    \"vendor\": {},\n    \"rate_limit\": null\n  },\n  \"scrapable_patterns\": [\n    {\n      \"name\": \"JSON-LD\",\n      \"detected\": true,\n      \"signal\": \"script[type=\\\"application\u002Fld+json\\\"]\",\n      \"extraction_hint\": \"Parse schema payloads for product, article, job, or org fields.\"\n    }\n  ]\n}\n```\n\n---\n\n## Adding vendor signatures\n\nSignatures live in `scraperecon\u002Fdata\u002Fsignatures.json`. It's a flat JSON file — no code required. If you know a signal that's missing, open a PR.\n\n```json\n{\n  \"name\": \"YourVendor\",\n  \"signals\": [\n    { \"type\": \"header_present\", \"key\": \"x-your-vendor\", \"weight\": 0.8 },\n    { \"type\": \"cookie_name\", \"value\": \"your_cookie\", \"weight\": 0.6 }\n  ]\n}\n```\n\nSignal types: `header_present`, `header_value`, `cookie_name`, `body_contains`, `status_code`.\n\n---\n\n## What it won't do\n\nscraperecon is a recon tool, not a scraping library. It tells you what you need — it doesn't do it for you. No CAPTCHA solving, no Playwright integration, no proxy support, no persistent history, and no crawling page URLs from sitemap results.\n\n---\n\nEvery scraper project starts with the same 20 minutes of manual work: try curl, get blocked, try curl_cffi, check the headers, fire some requests and see what happens. This automates that.\n","scraperecon 是一个面向爬虫开发者的命令行侦察工具，用于检测目标网站的反爬机制。它能够识别TLS指纹、JavaScript挑战、机器人防护以及速率限制等，并在四个阶段中逐步进行探测。项目采用Python语言编写，利用了curl-cffi库来模拟浏览器行为，支持TLS伪装以绕过部分安全措施。适用于需要预先了解目标站点防护策略的爬虫开发者，在实际编写爬虫代码前评估难度和可行性。MIT许可证下开源，当前获得39星关注。","2026-06-11 04:02:20","CREATED_QUERY"]