[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1487":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},1487,"superspider","Lyx3314844-03\u002Fsuperspider","Lyx3314844-03","Enterprise-grade multi-language web scraping framework (Java\u002FGo\u002FRust\u002FPython) with complete capabilities",null,"HTML",218,38,210,4,0,1,3,7,48.97,"MIT License",false,"master",true,[],"2026-06-12 04:00:09","# SuperSpider\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Fsuperspider-wordmark.svg\" alt=\"SuperSpider — Multi-Language Web Crawler Framework\" width=\"900\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Fsuperspider-icon.svg\" alt=\"SuperSpider icon\" width=\"160\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Multi-Language Web Crawler Framework\u003C\u002Fb>\u003Cbr\u002F>\n  AI · Media Download · Distributed · Anti-Bot · Node-Reverse\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#-if-this-project-helped-you\">⭐ If this project helped you, please give it a star!\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\nSuperSpider is a **multi-language web crawler framework** that ships four production-ready runtimes in Python, Go, Rust, and Java. Each runtime covers the same broad capability surface — web crawling, browser automation, AI extraction, media download, anti-bot, distributed execution — but is optimized for a different engineering environment.\n\n| Runtime | Language | Delivery | Tagline |\n| --- | --- | --- | --- |\n| 🐍 **pyspider** | Python | virtualenv | AI-first, project-oriented, rapid iteration |\n| 🐹 **gospider** | Go | compiled binary | Concurrent, binary-first, distributed workers |\n| 🦀 **rustspider** | Rust | release binary | Performance-first, feature-gated, strongly typed |\n| ☕ **javaspider** | Java | Maven \u002F JAR | Enterprise-first, browser workflows, audit trails |\n\n---\n\n## 🕷️ What Can SuperSpider Do?\n\n### 🌐 Web Crawling\n- HTTP and browser-based crawling (Playwright + Selenium)\n- Scrapy-style project interface with plugin injection\n- Dynamic site handling (JavaScript-rendered pages)\n- **Crawler type templates** — hydrated SPA, bootstrap JSON, infinite scroll, login session, and e-commerce search JobSpec starters\n- **Site presets** — JD, Taobao, Tmall, Pinduoduo, Xiaohongshu, and Douyin Shop starter JobSpec templates\n- **Spider class kits** — reusable spider class templates for PySpider, GoSpider, RustSpider, and JavaSpider\n- **Native ecommerce crawler classes** — catalog\u002Fdetail\u002Freview wrappers in all four runtimes, with browser-backed capture companions where the runtime supports them\n- **Native ecommerce examples** — catalog\u002Fdetail\u002Freview examples in all four runtimes with a JD fast path plus a `generic` fallback for unknown storefronts\n- Proxy pool with health checking and automatic rotation\n- Rate limiting, circuit breaker, deduplication\n- Robots.txt compliance\n- Session and cookie management\n- Checkpoint and incremental crawl (resume interrupted crawls)\n- **Priority-based crawling** — request priority queue with SQLite persistence\n- **Multi-threaded execution** — thread pool, concurrent executor, async executor, rate-limited executor\n- **Incremental crawling with ETag\u002FLast-Modified** — content hash comparison, min-change interval enforcement, delta token generation\n- **Cookie management** — per-domain cookie jar with SameSite, Secure, HttpOnly, auto-expiry, Netscape export\u002Fimport\n- **Persistent priority queue** — SQLite-backed with URL deduplication, priority sorting, visited tracking\n- **Worker pool** — configurable thread pool with shutdown, wait, and statistics\n- **Concurrent executor** — semaphore-controlled ThreadPoolExecutor with execute-many\n- **Async executor** — asyncio.Semaphore-controlled async task execution\n- **Rate-limited executor** — token bucket algorithm with wait and execute\n- **Priority task queue** — heapq-based priority queue for task scheduling\n\n### 🎬 Media Download — 10 Platforms\nAll four runtimes can download from:\n\n| Platform | Format |\n| --- | --- |\n| **YouTube** | HLS, DASH, MP4 |\n| **Bilibili** | HLS, DASH, M4S |\n| **IQIYI** | HLS, DASH |\n| **Tencent Video** | HLS, direct link |\n| **Youku** | HLS, DASH |\n| **Douyin** | MP4, direct link |\n| **Generic HLS** | M3U8 streams |\n| **Generic DASH** | MPD manifests |\n| **FFmpeg merge** | TS\u002FM4S → MP4 |\n| **DRM detection** | Widevine, PlayReady |\n\n### 🤖 AI Extraction\n- **LLM extraction** — OpenAI (GPT-4o, etc.) and Anthropic\u002FClaude\n- **Entity extraction** — named entities, structured data\n- **Content summarization** — automatic page summarization\n- **Sentiment analysis** — positive\u002Fnegative\u002Fneutral classification\n- **Smart parser** — auto-detects page type and extracts relevant fields (PySpider only)\n- **Schema-driven output** — strongly typed structured extraction (PySpider only)\n- **Few-shot examples** — guide the LLM with examples\n- **XPath suggestion studio** — AI-suggested XPath selectors\n- **Keyword extraction** — AI-powered keyword extraction from page content\n- **Content classification** — AI categorization into predefined categories\n- **Translation** — AI-powered content translation to target languages\n- **Q&A over content** — ask questions about crawled page content with context\n\n### 🛡️ Anti-Bot\n- TLS fingerprint rotation (JA3\u002FJA4 mimicry)\n- Browser behavior simulation (mouse movement, scroll, reading pace)\n- WAF and access-friction detection with compliant browser upgrade paths\n- Night mode (reduced crawl rate during off-hours)\n- **Access friction classifier**: shared `level`, `signals`, `recommended_actions`, `challenge_handoff`, and `capability_plan` across all four runtimes\n- **HTTP response diagnostics**: PySpider `Response.meta[\"access_friction\"]`, GoSpider `Response.AccessFriction`, RustSpider `Response.access_friction`, and JavaSpider `Page.getField(\"access_friction\")`\n- **Captcha and login challenge handling**: detect CAPTCHA\u002Fauth\u002Frisk-control pages, pause for authorized human access, persist session assets, and resume only after validation\n- **Cloudflare\u002FAkamai handling**: vendor profiling, browser-render recommendation, artifact capture, and stop conditions when access is denied\n- **Browser fingerprint management**: Canvas, WebGL, font fingerprint generation with session persistence\n- **Smart delay strategy**: adaptive frequency-based delay adjustment with human-like jitter\n- **Cookie management**: per-domain cookie jar with automatic rotation\n- SSRF protection (blocks internal network access, cloud metadata endpoints)\n- **Input sanitization**: XSS prevention, HTML cleaning, dangerous character filtering\n- **Block detection**: keyword-based block\u002Fban detection with automatic proxy switching\n- **Compliance boundary**: the framework does not promise automated CAPTCHA cracking, forced risk-control bypass, or access to private\u002Flogin-gated data without authorization\n\n### 🔒 Security & Reliability\n- **SSRF protection**: blocks requests to private IPs, cloud metadata (169.254.169.254), loopback, multicast\n- **URL validation**: protocol whitelisting, domain allowlist\u002Fblocklist, port restrictions, length limits\n- **Input sanitization**: script tag removal, event handler stripping, HTML entity decoding, filename sanitization\n- **Circuit breaker**: configurable failure threshold, half-open state recovery, prevents cascade failures\n- **Retry strategies**: fixed, linear, exponential, exponential+jitter backoff with configurable status code handling\n- **Failure classification**: automatic categorization (blocked, throttled, anti_bot, timeout, server, proxy)\n- **Request fingerprinting**: SHA-256 fingerprints based on URL + method + headers + cookies + body\n- **Content deduplication**: SHA-256 content hashing to avoid re-processing identical pages\n\n### 🔐 JS Encryption \u002F Node-Reverse\nMany modern sites protect their APIs with JavaScript-generated signatures. SuperSpider handles this via a Node.js bridge:\n- Node-reverse client for JS-encrypted sites\n- Encrypted site crawler (HMAC, AES, token-based)\n- JS signature execution via Node.js bridge server\n- Supports HMAC-SHA256, AES-encrypted params, timestamp tokens\n\n### 🌍 Distributed Crawling\n- **Redis** queue (native, all four runtimes)\n- **RabbitMQ** — broker-native (Go, Java), bridge (Rust), native (Python)\n- **Kafka** — broker-native (Go, Java), bridge (Rust), native (Python)\n- Distributed workers with state machine\n- Node discovery: environment variables, file, DNS-SRV, Consul, etcd\n- Dataset mirror to database backends\n- **Autoscaled frontier**: auto-adjusts concurrency based on latency and failure rate\n- **Session pool**: reusable session slots with fingerprint profile + proxy affinity\n- **Dead-letter queue**: failed requests after max retries are quarantined for inspection\n- **Lease-based request dispatching**: TTL-gated leases with heartbeat renewal and domain inflight limits\n- **Checkpoint persistence**: SQLite-backed checkpoint manager with auto-save intervals\n- **Proxy scoring**: success\u002Ffailure ratio-based proxy selection with automatic degradation\n- **Middleware chain**: composable request\u002Fresponse processing pipeline\n\n### 🗄️ Storage Backends\n| Backend | PySpider | GoSpider | RustSpider | JavaSpider |\n| --- | --- | --- | --- | --- |\n| SQLite | ✅ | ✅ | ✅ | ✅ |\n| PostgreSQL | ✅ | ✅ process | ✅ driver+process | ✅ |\n| MySQL | ✅ | ✅ process | ✅ driver+process | ✅ |\n| MongoDB | ✅ | ✅ process | ✅ driver+process | ✅ |\n| JSON \u002F CSV \u002F JSONL | ✅ | ✅ | ✅ | ✅ |\n\n### 📊 Observability\n- Audit trail: in-memory, file, JSONL, composite\n- Monitoring and metrics dashboard\n- Preflight validation (check config before crawling)\n- Checkpoint and resume (SQLite-backed)\n- Incremental crawl (only crawl new\u002Fchanged pages)\n- **Structured event logging**: trace-id correlated events with Prometheus text and OpenTelemetry export\n- **Observability collector**: request latency tracking, failure classification, outcome histograms\n- **Artifact store**: filesystem-based artifact storage for screenshots, traces, JSON snapshots, HTML\n- **Graph artifact persistence**: per-page DOM graph (nodes, edges, stats) saved automatically during crawl\n- **Frontier state snapshot**: pending, known, leases, domain-inflight, dead-letters all exportable as JSON\n- **Dashboard API**: `\u002Fapi\u002Fv1\u002Fmonitors\u002F\u003Cname>\u002Fdashboard` provides real-time crawl stats, performance metrics, resource usage\n- **REST API server**: full spider lifecycle management via HTTP (start\u002Fstop\u002Fstats\u002Fqueues\u002Ftasks)\n- **API authentication**: Bearer token \u002F X-API-Token support for production API security\n\n### 🧭 Operator and Authoring Surface\n- **Shared config scaffolding** — `config init` bootstraps the cross-runtime contract config\n- **Site profiling** — `profile-site` emits `crawler_type`, `site_family`, `runner_order`, `strategy_hints`, and `job_templates`\n- **Pre-crawl discovery** — `sitemap-discover` expands crawl candidates before you commit to selectors\n- **Selector debugging** — `selector-studio` lets you validate CSS\u002FXPath\u002Fregex rules against saved HTML\n- **Control-plane tools** — `plugins`, `jobdir`, `http-cache`, `console`, and `audit` are public operator surfaces\n- **Browser tooling** — `fetch`, `trace`, `mock`, and `codegen` are exposed in the browser CLI across the runtimes\n- **Shared starter assets** — `examples\u002Fcrawler-types\u002F`, `examples\u002Fsite-presets\u002F`, and `examples\u002Fclass-kits\u002F` are the canonical starting points for hard site families\n- **Research flows** — async research, notebook-style output, and scenario playbooks are part of the published runtime surface\n\n## Public E-commerce Scope\n\nThe native ecommerce examples, crawler classes, and class kits are built for publicly accessible marketplace data.\n\nThey currently aim for:\n\n- product links and identifiers\n- price and promotion signals\n- shop \u002F seller signals\n- review and rating summaries\n- images, videos, embedded JSON, and API candidates\n\nCurrent fast paths:\n\n- `jd`: SKU extraction plus price\u002Freview public APIs\n- `taobao`, `tmall`, `pinduoduo`, `amazon`: JSON-LD product \u002F aggregate-rating fast paths when present\n- `generic`: fallback public-data extraction for unknown storefronts\n\nEach runtime now exposes a unified ecommerce crawler class style entrypoint, plus a browser-backed companion in the runtimes that support full browser capture. The naming is intentionally consistent so the examples can be lifted into a project with minimal translation.\n\nThey do not guarantee universal extraction across every storefront and they do not imply access to login-gated, private, or user-owned commerce data.\n\n---\n\n## 🐍 PySpider — AI-First Python Crawler\n\n**Best for:** AI-powered extraction, rapid prototyping, research workflows\n\n### Unique Capabilities\n- **Smart parser** — automatically detects page type (article, product, listing, etc.) and extracts relevant fields without writing selectors\n- **Schema-driven LLM extraction** — define a JSON schema, get structured output from any page\n- **Graph crawler** — crawl relationship graphs, extract nodes and edges; REST API `\u002Fapi\u002Fv1\u002Fgraph\u002Fextract`\n- **Research runtime** — Jupyter-style notebook output for data analysis\n- **Plugin injection** — extend any part of the pipeline with Python plugins\n- **Async runtime** — full async\u002Fawait support with aiohttp\n- **REST API server** — Flask-based server with spider start\u002Fstop, task management, queue control, monitoring dashboards, and metrics\n- **Advanced anti-bot** — browser fingerprint generation (Canvas, WebGL, fonts), TLS profile management, captcha detection\u002Fhandling, human behavior simulation (mouse trajectories, scroll, reading time), smart delay strategy\n- **Cloudflare & Akamai bypass** — specialized header profiles for major WAFs\n- **Security suite** — SSRF protection (blocks private IPs, cloud metadata), URL validation (protocol\u002Fdomain\u002Fport whitelisting), input sanitization (XSS prevention, HTML cleaning)\n- **Circuit breaker** — configurable failure threshold with half-open recovery, prevents cascade failures\n- **Retry strategies** — fixed, linear, exponential, exponential+jitter backoff with async support\n- **Failure classification** — automatic categorization: blocked, throttled, anti_bot, timeout, server, proxy, runtime\n- **Autoscaled frontier** — auto-adjusts concurrency based on latency and failure rate with dead-letter queue\n- **Session pool** — reusable session slots with fingerprint profile + proxy affinity, max 32 sessions\n- **Middleware chain** — composable request\u002Fresponse processing pipeline\n- **Request fingerprinting** — SHA-256 fingerprints from URL + method + headers + cookies + body + meta\n- **Artifact store** — filesystem storage for screenshots, traces, JSON, HTML with metadata\n- **Prometheus + OTel export** — metrics export in Prometheus text format and OpenTelemetry payload\n- **Robots.txt compliance** — crawl-delay respect and disallow enforcement\n- **Curl converter** — convert curl commands to spider requests\n- **Production config** — multi-environment configuration with validation\n- **Crawler type playbook** — `docs\u002FCRAWLER_TYPE_PLAYBOOK.md` plus `examples\u002Fcrawler-types\u002F`\n\n### Install\n```bash\n# Install all four runtimes\nscripts\\windows\\install-superspider.bat\nbash scripts\u002Flinux\u002Finstall-superspider.sh\nbash scripts\u002Fmacos\u002Finstall-superspider.sh\n\n# Install only PySpider\n# Windows\nscripts\\windows\\install-pyspider.bat\n\n# Linux \u002F macOS\nbash scripts\u002Flinux\u002Finstall-pyspider.sh\nbash scripts\u002Fmacos\u002Finstall-pyspider.sh\n```\n\n**Output:** `.venv-pyspider` — run `python -m pyspider version` to verify\n\n---\n\n## 🐹 GoSpider — Concurrent Binary Crawler\n\n**Best for:** High-concurrency production crawling, binary deployment, distributed worker clusters\n\n### Unique Capabilities\n- **Single binary** — no runtime dependencies, deploy anywhere\n- **Native Selenium\u002FWebDriver** — direct WebDriver protocol, no wrapper overhead\n- **Broker-native queues** — RabbitMQ and Kafka via native Go clients\n- **Dedicated platform extractors** — separate packages for Bilibili, IQIYI, Tencent, Youku, Douyin\n- **Process + driver DB adapters** — flexible database backend selection\n- **Audit trail module** — structured audit logging with composite writers\n- **Browser automation** — Chrome browser pool with lifecycle management, auto-restart on failure, graceful shutdown\n- **WAF bypass suite** — Cloudflare, Akamai, Alibaba Cloud, Tencent Cloud specialized bypass strategies\n- **Anti-detection** — stealth mode, WebDriver property removal, Chrome automation flag masking\n- **TLS fingerprint rotation** — JA3\u002FJA4 mimicry profiles for different browsers\n- **Behavior simulation** — mouse movement, reading pace, scroll patterns\n- **DASH media downloader** — HTTP range-based segment download with parallel workers, FFmpeg merge, retry logic\n- **Distributed node reverse** — Node.js bridge with managed subprocess lifecycle\n- **Task engine** — task creation, execution, status tracking, result storage\n- **Scheduler** — Cron-based scheduling, one-time tasks, interval tasks, concurrent management\n- **Event system** — structured events, priority queue, dispatcher, subscriber model\n- **Monitor suite** — performance monitoring, resource tracking, health checks, alerting\n- **Extractor framework** — XPath, CSS selector, regex, JSONPath extraction with validation\n- **Proxy rotation** — proxy pool with health checking, automatic failover\n- **Rate limiting** — token bucket, sliding window, adaptive rate control\n- **Config-driven crawling** — JSON-based spider configuration, template-driven execution\n\n### Install\n```bash\n# Windows\nscripts\\windows\\install-gospider.bat\n\n# Linux \u002F macOS\nbash scripts\u002Flinux\u002Finstall-gospider.sh\nbash scripts\u002Fmacos\u002Finstall-gospider.sh\n```\n\n**Output:** `gospider\u002Fgospider` binary — run `.\u002Fgospider\u002Fgospider --version` to verify\n\n---\n\n## 🦀 RustSpider — Performance-First Crawler\n\n**Best for:** Performance-sensitive deployments, strict resource boundaries, feature-gated release control\n\n### Unique Capabilities\n- **Feature-gated modules** — compile only what you need (browser, distributed, API, web)\n- **Native node+playwright process** — Playwright runs as a managed subprocess, not a wrapper\n- **Fantoccini Selenium facade** — async Rust WebDriver client\n- **Real captcha API flow** — async 2captcha\u002FAnti-Captcha with polling, not placeholder\n- **Driver-level DB adapters** — native Rust drivers for Postgres, MySQL, MongoDB\n- **Benchmark suite** — built-in performance benchmarks\n- **Preflight validation** — validate all config and dependencies before starting\n- **Encrypted site crawler** — HMAC-SHA256, AES-encrypted params, timestamp token generation\n- **Media downloader** — HLS\u002FDASH with FFmpeg integration, segment tracking, progress reporting\n- **Async runtime** — tokio-based async execution with cancellation support\n- **Distributed worker** — Redis-based task distribution with worker heartbeat\n- **Proxy rotation** — proxy pool with success rate scoring, automatic failover\n- **Task scheduler** — cron-based scheduling with execution history\n- **Performance monitor** — resource usage tracking, latency histograms, throughput metrics\n- **Transformer pipeline** — composable data transformation stages\n- **Node reverse** — Node.js subprocess management for JS signature execution\n- **FFI bindings** — C-compatible interface for embedding in other languages\n- **API server** — HTTP API for spider control and status querying\n- **Artifact storage** — file-based artifact persistence with metadata\n\n### Install\n```bash\n# Windows\nscripts\\windows\\install-rustspider.bat\n\n# Linux \u002F macOS\nbash scripts\u002Flinux\u002Finstall-rustspider.sh\nbash scripts\u002Fmacos\u002Finstall-rustspider.sh\n```\n\n**Output:** `rustspider\u002Ftarget\u002Frelease\u002Frustspider` — run `.\u002Frustspider\u002Ftarget\u002Frelease\u002Frustspider --version` to verify\n\n---\n\n## ☕ JavaSpider — Enterprise Java Crawler\n\n**Best for:** Enterprise Java environments, Maven\u002FJAR delivery, browser-heavy automation, audit-conscious execution\n\n### Unique Capabilities\n- **Maven profiles** — `lite \u002F ai \u002F browser \u002F distributed \u002F full` — build only what you need\n- **Dedicated audit trail** — the strongest audit support in the set: in-memory, file, JSONL, composite\n- **Broker-native queues** — RabbitMQ via `amqp-client`, Kafka via `kafka-clients`\n- **REST API server** — built-in `\u002Fhealth`, `\u002Fjobs`, `\u002Fjobs\u002F{id}`, `\u002Fjobs\u002F{id}\u002Fresult` endpoints\n- **Async spider runtime** — `AsyncSpiderRuntime` for non-blocking execution\n- **Workflow replay** — record and replay browser workflows\n- **Generic-parser fallback** — media parsing never fails silently; falls back to generic extraction\n- **Adaptive rate limiter** — AI-guided rate control with latency-based backpressure\n- **Batch media downloader** — concurrent download with progress tracking, retry, merge\n- **User-agent rotator** — browser-specific UA pools with header consistency\n- **Connector framework** — pluggable database connectors (SQLite, PostgreSQL, MySQL, MongoDB)\n- **CLI interface** — command-line spider execution with config loading\n- **Bridge module** — cross-runtime communication bridge\n- **Session management** — cookie jar, session persistence across requests\n- **Media pipeline** — YouTube, Bilibili, IQIYI, Tencent, Youku, Douyin with format detection\n- **Workflow engine** — DAG-based workflow execution with conditional branching\n\n### Maven Profiles\n```bash\n# Minimal (core crawling only)\nmvn -f javaspider\u002Fpom.xml -P lite -DskipTests package\n\n# With AI extraction\nmvn -f javaspider\u002Fpom.xml -P ai -DskipTests package\n\n# With browser automation\nmvn -f javaspider\u002Fpom.xml -P browser -DskipTests package\n\n# With distributed runtime\nmvn -f javaspider\u002Fpom.xml -P distributed -DskipTests package\n\n# Everything\nmvn -f javaspider\u002Fpom.xml -P full -DskipTests package\n```\n\n### Install\n```bash\n# Windows\nscripts\\windows\\install-javaspider.bat\n\n# Linux \u002F macOS\nbash scripts\u002Flinux\u002Finstall-javaspider.sh\nbash scripts\u002Fmacos\u002Finstall-javaspider.sh\n```\n\n**Output:** `javaspider\u002Ftarget\u002F` — run `java -jar javaspider\u002Ftarget\u002Fjavaspider-*.jar --version` to verify\n\n---\n\n## 📦 Install Prerequisites\n\n| Framework | Required |\n| --- | --- |\n| 🐍 PySpider | Python 3.10+ recommended, pip, venv |\n| 🐹 GoSpider | Go 1.24+ |\n| 🦀 RustSpider | Rust 1.70+ recommended, Cargo |\n| ☕ JavaSpider | Java 17 target, Maven 3.8+ |\n\nSupported installer operating systems: Windows 10\u002F11 or Windows Server 2022+, Ubuntu\u002FDebian\u002FRHEL-compatible Linux, and macOS 13+. The current Windows verification host is Microsoft Windows 11 Pro 10.0.28000, 64-bit.\n\n---\n\n## 🗺️ Quick Selection Guide\n\n| I need... | Use |\n| --- | --- |\n| AI-powered extraction with LLM | 🐍 PySpider |\n| High-concurrency binary deployment | 🐹 GoSpider |\n| Maximum performance, strict boundaries | 🦀 RustSpider |\n| Enterprise Java, Maven\u002FJAR, audit trail | ☕ JavaSpider |\n| Download YouTube \u002F Bilibili \u002F Douyin | any (all four) |\n| Crawl JS-encrypted sites | any (all four support node-reverse) |\n| Distributed worker cluster | 🐹 GoSpider or 🦀 RustSpider |\n| Rapid prototyping and research | 🐍 PySpider |\n| REST API to control crawlers | 🐍 PySpider (Flask) or ☕ JavaSpider |\n| Browser fingerprint management | 🐍 PySpider (Canvas, WebGL, fonts) |\n| Circuit breaker + retry strategies | 🐍 PySpider (4 strategies + circuit breaker) |\n| Cloudflare \u002F Akamai bypass | 🐍 PySpider or 🐹 GoSpider |\n| SSRF protection + input sanitization | 🐍 PySpider |\n| Workflow automation (DAG) | ☕ JavaSpider |\n| Feature-gated compilation | 🦀 RustSpider |\n| Single binary deployment | 🐹 GoSpider |\n| Prometheus \u002F OTel metrics export | 🐍 PySpider |\n| Graph crawling + relationship extraction | 🐍 PySpider |\n| Session pool management | 🐍 PySpider (32 sessions with fingerprint affinity) |\n| Autoscaled concurrency | 🐍 PySpider (frontier-based auto-scaling) |\n\n---\n\n## 📚 Documentation\n\n| Document | Description |\n| --- | --- |\n| [`docs\u002FDOCS_INDEX.md`](docs\u002FDOCS_INDEX.md) | Canonical documentation index and recommended reading order |\n| [`docs\u002FFRAMEWORK_CAPABILITIES.md`](docs\u002FFRAMEWORK_CAPABILITIES.md) | Detailed per-framework capability descriptions |\n| [`docs\u002FFRAMEWORK_CAPABILITY_MATRIX.md`](docs\u002FFRAMEWORK_CAPABILITY_MATRIX.md) | Full capability comparison tables |\n| [`docs\u002FACCESS_FRICTION_PLAYBOOK.md`](docs\u002FACCESS_FRICTION_PLAYBOOK.md) | High-friction crawl model, challenge handoff, and compliant recovery policy |\n| [`docs\u002FCRAWL_SCENARIO_GAP_MATRIX.md`](docs\u002FCRAWL_SCENARIO_GAP_MATRIX.md) | Real crawling scenarios that are still partial or missing across the four runtimes |\n| [`docs\u002FLATEST_SCENARIO_CASES.md`](docs\u002FLATEST_SCENARIO_CASES.md) | Latest practical scenario playbooks and recommended runtime choices |\n| [`docs\u002FCRAWLER_TYPE_PLAYBOOK.md`](docs\u002FCRAWLER_TYPE_PLAYBOOK.md) | Shared crawler types, runner-order guidance, and JobSpec template mapping |\n| [`docs\u002FSITE_PRESET_PLAYBOOK.md`](docs\u002FSITE_PRESET_PLAYBOOK.md) | Site-family starter presets for major marketplace and social-commerce domains |\n| [`examples\u002Fclass-kits\u002FREADME.md`](examples\u002Fclass-kits\u002FREADME.md) | Reusable spider class templates for all four runtimes |\n| [`docs\u002FSUPERSPIDER_INSTALLS.md`](docs\u002FSUPERSPIDER_INSTALLS.md) | Install instructions for Windows, Linux, and macOS |\n| [`docs\u002FFOUR_RUNTIME_HEALTH_REPORT.md`](docs\u002FFOUR_RUNTIME_HEALTH_REPORT.md) | Current compile, dependency, and test status for all four runtimes |\n| [`MEDIA_PARITY_REPORT.md`](MEDIA_PARITY_REPORT.md) | Media platform coverage evidence |\n| [`ADVANCED_USAGE_GUIDE.md`](ADVANCED_USAGE_GUIDE.md) | Advanced crawling scenarios |\n| [`ENCRYPTED_SITE_CRAWLING_GUIDE.md`](ENCRYPTED_SITE_CRAWLING_GUIDE.md) | JS-encrypted site crawling |\n| [`NODE_REVERSE_INTEGRATION_GUIDE.md`](NODE_REVERSE_INTEGRATION_GUIDE.md) | Node.js reverse engineering bridge |\n| [`ULTIMATE_ENHANCEMENT_GUIDE.md`](ULTIMATE_ENHANCEMENT_GUIDE.md) | Full capability enhancement reference |\n| [`PUBLISH_RELEASE_STATUS.md`](PUBLISH_RELEASE_STATUS.md) | Publish-time verification status and release notes |\n| [`CHANGELOG.md`](CHANGELOG.md) | Version history |\n| [`CONTRIBUTING.md`](CONTRIBUTING.md) | Contribution guide |\n\n---\n\n## ✅ Verification Snapshot\n\nChecked against the current workspace on **2026-04-25**:\n\n| Runtime | Verified command(s) | Current result |\n| --- | --- | --- |\n| 🐍 PySpider | `python -m pytest tests\\test_access_friction.py tests\\test_locator_analyzer.py tests\\test_super_framework.py tests\\test_api_server.py tests\\test_core_spider.py tests\\test_downloader.py -q` | Pass, 40 tests |\n| 🐹 GoSpider | `go test .\u002F...` | Pass |\n| 🦀 RustSpider | `cargo test --quiet --lib`, `cargo test --quiet --test access_friction` | Pass on checked slices; full suite is heavy and should be run in CI with a longer timeout window |\n| ☕ JavaSpider | `mvn -q test`, `mvn -q -Dtest=HtmlSelectorContractTest test` | Pass |\n\nNotes:\n\n- The four runtimes now share access-friction detection for high-risk pages, browser-upgrade planning, XPath\u002FCSS locator helpers, and browser\u002Fdevtools-oriented element analysis.\n- PySpider full-suite success is not claimed here because an earlier broad `pytest -q` run exceeded the local timeout window; use CI with longer timeouts for unrestricted release coverage.\n\n---\n\n## ⭐ If This Project Helped You\n\nIf SuperSpider saved you time, helped you build something, or taught you something new — please consider giving it a **⭐ star** on GitHub!\n\n**Why it matters:**\n- ⭐ Stars help other developers discover this project\n- ⭐ Stars motivate continued development and maintenance\n- ⭐ Stars show the community that multi-language crawler frameworks are valuable\n\n**Has SuperSpider helped you?**\n- 🎬 Downloaded videos from YouTube, Bilibili, or other platforms?\n- 🤖 Extracted structured data using AI\u002FLLM?\n- 🛡️ Bypassed anti-bot protection on a challenging site?\n- 🔐 Cracked a JS-encrypted API?\n- 🌍 Built a distributed crawler cluster?\n- 📊 Automated data collection for research or business?\n\nIf yes to any of the above — **[click the ⭐ Star button](https:\u002F\u002Fgithub.com\u002FLyx3314844-03\u002Fsuperspider)** at the top of this page. It takes 2 seconds and means a lot! 🙏\n\n---\n\n## 📄 License\n\nMIT License — see [LICENSE](LICENSE) for details.\n","SuperSpider 是一个多语言的网页爬虫框架，支持Java、Go、Rust和Python四种语言。其核心功能包括基于HTTP及浏览器的网页抓取、动态页面处理、AI提取、媒体下载、反爬虫技术以及分布式执行等。每个运行时版本都针对不同的工程环境进行了优化，如Python版侧重快速迭代与项目导向，Go版强调并发与二进制文件优先，Rust版注重性能与强类型特性，而Java版则更适用于企业级应用，具备完善的审计跟踪功能。该框架特别适合需要进行大规模网络数据采集、电商数据分析或内容自动化处理的企业和个人开发者使用。",2,"2026-06-11 02:44:05","CREATED_QUERY"]