[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80802":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":13,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":15,"starSnapshotCount":15,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},80802,"acon","WillyEverGreen\u002Facon","WillyEverGreen","The intelligence layer for any web scraper. Pair with Scrapling, Playwright, or httpx to crawl smarter.","https:\u002F\u002Fpypi.org\u002Fproject\u002Facon-intel",null,"Python",39,1,38,0,0.9,"MIT License",false,"main",true,[22,23,24,25,26,27,28],"crawler","playwright","python","scraping","site-intelligence","spider","web-scraping","2026-06-12 02:04:07","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002FWillyEverGreen\u002Facon\u002Fmain\u002Flogo.png\" width=\"120\" alt=\"Acon Logo\">\n  \u003Ch1>Acon — The Intelligent Brain for Any Scraper\u003C\u002Fh1>\n  \u003Cp>Acon doesn't replace Scrapling or Firecrawl. It tells them where to look.\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n---\n\n## Why Acon?\n\nMost crawlers are dumb. They follow links blindly, return raw HTML, and break the moment a site changes its structure. Before you can extract anything useful, you need to understand what you're dealing with.\n\n**Acon is a site intelligence engine.** It maps the structural \"skeleton\" of a website automatically — before any data extraction happens — so your scraper always knows where to look.\n\n---\n\n## 🏗️ The Core Thesis\n\nMost modern web scrapers suffer from **\"URL Exhaustion\"** — they spend 90% of their bandwidth fetching identical product or blog pages. Acon introduces a **Topology Orchestrator** that maps, classifies, and samples site structures, then **stops the moment it has fully learned the site's DNA** — no wasted requests.\n\n---\n\n## 📊 Real-World Benchmark Results (v0.1.2 - Final 10\u002F10 Polish)\n\n**The correct question**: How many pages does each engine need to fully map a site's structure?\n\nBoth crawlers given an **uncapped budget**. BFS runs until exhaustion. Acon stops the moment `low_information_gain` fires — meaning the site's structural DNA is fully mapped.\n\n### Comparison Summary (4 Representative Sites)\n\n| Site | BFS Pages | **Acon Pages** | **Request Reduction** | **Time Saved** | Stopped By |\n| :--- | :---: | :---: | :---: | :---: | :--- |\n| **books.toscrape.com** | 200 | **6** | **97.0%** | **93.7%** | `low_information_gain` |\n| **Hacker News** | 50 | **9** | **82.0%** | **89.0%** | `low_information_gain` |\n| **Wikipedia** | 100 | **8** | **92.0%** | **93.7%** | `low_information_gain` |\n| **PyPI** | 100 | **20** | **80.0%** | **93.4%** | `queue_exhausted` |\n\n---\n\n### Deep Dive: books.toscrape.com (E-Commerce)\n\n| | Blind BFS | Acon |\n| :--- | :---: | :---: |\n| **Pages Crawled** | 200 | **6** |\n| **Time Taken** | 54.1s | **3.4s** |\n| **Stopped by** | budget cap | `low_information_gain` |\n| **Topology Detected** | — | `deep_uniform` |\n\n**97% fewer requests. Acon stopped at 6 pages because it detected that the structural DNA (product pages, category pages) was already fully mapped.**\n\n---\n\n### Deep Dive: PyPI (Multi-Template Registry)\n\n| | Blind BFS | Acon |\n| :--- | :---: | :---: |\n| **Pages Crawled** | 100 | **20** |\n| **Time Taken** | 100.7s | **6.6s** |\n| **Stopped by** | budget cap | `queue_exhausted` |\n| **Topology Detected** | — | `thin` |\n\n**80% fewer requests. Acon classified the site and exhausted the relevant discovery queue in just 20 pages.**\n\n---\n\n> The key insight: a blind crawler keeps crawling because it doesn't know what it doesn't know. Acon tracks information gain in a sliding window — once new pages stop adding structural novelty, it stops and hands you the map.\n\n---\n\n## 🚀 Use Cases\n\n**Price Monitoring & E-Commerce Intelligence**\nAcon detects pagination patterns and repeating product templates automatically. No manual selector configuration per site.\n\n**Content Archival & Research**\nFeed Acon a publication's root URL. It identifies the site's content structure, prioritizes article pages over navigation noise, and hands you a clean discovery map.\n\n**Site Auditing & SEO Analysis**\nGet an instant structural report — template count, link depth, topology classification (SPA vs static vs paginated) — in a single run.\n\n---\n\n## ⚡ What Makes Acon Different\n\n| Capability | Typical Crawler | Acon |\n|---|---|---|\n| **JS-rendered sites** | Manual Playwright setup | **Autonomous escalation** |\n| **Site structure** | Unknown until scraped | **Detected before extraction** |\n| **Large site performance** | Degrades at scale | **O(log N) priority queue** |\n| **Bandwidth efficiency** | Downloads everything | **Asset blocking (Discovery mode)** |\n| **Discovery Latency** | Static only | **Static-First Hybrid Escalation** |\n| **Failed crawls** | Lost progress | **SQLite resumption (WAL)** |\n| **Budget waste** | Crawls until cap | **Stops when structure is learned** |\n\n---\n\n## 🏗️ The Efficiency Pillars\n\nAcon is optimized for production environments where every request costs money:\n\n*   ⚡ **Static-First Discovery**: Acon probes pages with raw HTTP first. It only launches a browser if the site is a SPA, saving 90% of compute on standard sites.\n*   🚫 **Intelligent Asset Blocking**: During discovery, Acon automatically aborts requests for images, fonts, and CSS to slash bandwidth and CPU usage.\n*   📉 **Adaptive Early Stop (`low_information_gain`)**: Acon tracks structural novelty across a sliding window. When new pages stop adding unique signal, crawling stops — before the budget is spent.\n*   🧬 **Debounced Topology Detection**: Structural analysis (DNA mapping) is throttled to key milestones (1, 10, 25, 50 pages) to ensure max throughput.\n\n---\n\n## 🏗️ The Unified Intelligence Stack (The Acon Alliance)\n\nAcon doesn't just map sites; it orchestrates the most powerful open-source scraping tools into a single, high-fidelity pipeline.\n\n*   **🕵️ Stealth (Camoufox)**: Enable `use_stealth=True` to launch an \"invisible\" browser engine that bypasses Cloudflare and Akamai automatically.\n*   **📄 Content (Trafilatura)**: Enable `extract_content=True` to get clean, LLM-ready Markdown from every discovered page natively.\n*   **🚀 Speed (Scrapling)**: Use the `scrapling_adapter` to export Acon's \"DNA Map\" into Scrapling for turbo-charged mass extraction.\n\n---\n\n## 🛠️ Installation\n\n```bash\npip install acon-intel\n\n# To enable the Alliance pillars (Highly Recommended)\npip install trafilatura camoufox scrapling\nplaywright install chromium\n```\n\n---\n\n## ⚡ Quick Start\n\n```python\nimport asyncio\nfrom acon import SiteCrawlOrchestrator, CrawlConfig\n\nasync def main():\n    config = CrawlConfig(\n        max_pages=50,          # Hard ceiling\n        extract_content=True,  # Trafilatura: clean Markdown per page\n        use_stealth=True       # Camoufox: bypass bot detection\n    )\n\n    brain = SiteCrawlOrchestrator()\n    result = await brain.crawl_site(\"https:\u002F\u002Fnews.ycombinator.com\", config)\n\n    print(f\"Topology: {result['topology']}\")\n    print(f\"Pages crawled: {result['pages_crawled']}\")\n    print(f\"Stopped by: {result['crawl_meta']['early_stop_reason']}\")\n\n    for page in result[\"page_summaries\"]:\n        print(f\"  {page['url']} — {page['page_type']}\")\n        if page['content']:\n            print(f\"    {page['content'][:80]}...\")\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n---\n\n## 📦 The Output Shape\n\n```json\n{\n  \"topology\": \"multi_template\",\n  \"pages_crawled\": 12,\n  \"pages_failed\": 0,\n  \"page_summaries\": [\n    {\n      \"url\": \"https:\u002F\u002Fpypi.org\u002Fproject\u002Frequests\u002F\",\n      \"page_type\": \"standard\",\n      \"js_required\": false,\n      \"content\": \"# requests 2.31.0...\",\n      \"parent_url\": \"https:\u002F\u002Fpypi.org\"\n    }\n  ],\n  \"crawl_meta\": {\n    \"early_stop_reason\": \"low_information_gain\",\n    \"crawl_duration_s\": 29.5,\n    \"reflection\": {\n      \"intelligence_score\": 0.33,\n      \"failure_rate\": 0.0,\n      \"advice\": \"Continue current strategy.\"\n    }\n  }\n}\n```\n\n---\n\n## 🛣️ Roadmap\n- [x] **Stealth Integration**: Native support for **Camoufox** (Fingerprint bypass).\n- [x] **LLM-Ready Pipeline**: Native **Trafilatura** integration for high-fidelity Markdown output.\n- [x] **Speed Pillar**: Official **Scrapling** adapter for mass extraction.\n- [x] **Session Persistence**: SQLite WAL-mode crawl resumption across process restarts.\n- [x] **Adaptive Intelligence**: `low_information_gain` early stop — avoids burning crawl budgets.\n- [ ] **Discovery API**: Expose Acon as a standalone Discovery microservice.\n\n---\n\n*Acon: The connective tissue of the intelligent web.*\n","Acon 是一个为任何网页抓取工具提供智能层的项目，旨在与Scrapling、Playwright或httpx等工具配合使用，以实现更高效的爬虫。其核心功能包括自动映射网站结构，通过Topology Orchestrator对站点进行分类和采样，并在完全理解站点架构后停止不必要的请求，从而大幅减少数据抓取过程中的资源消耗。根据实际测试，在多个不同类型的网站上，Acon能够显著降低页面请求量（最高可达97%），节省大量时间。适用于需要高效精准地从复杂或大规模网站中提取信息的场景，如价格监控、电子商务智能分析等领域。",2,"2026-06-11 04:02:23","CREATED_QUERY"]