[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79982":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":14,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":15,"starSnapshotCount":15,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},79982,"SherlockMaps","Ayyouboss0011\u002FSherlockMaps","Ayyouboss0011","Powerful Google Maps Crawler \u002F Scraper tool with REST API, Docker support & multi-format export","",null,"Python",74,7,1,0,42.81,"MIT License",false,"main",true,[22,23,24,25,26,27,28,29,30,31,32,33,34],"browser-automation","data-extraction","docker","google","google-maps","maps","maps-api","playwright","python","rest-api","scrapi","scraping","web-crawler","2026-06-12 04:01:26","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"public\u002FSherlockMaps.png\" alt=\"SherlockMaps Icon\" width=\"200\">\n\u003C\u002Fdiv>\n\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F) [![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT) [![Playwright](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbrowser--automation-playwright-green.svg)](https:\u002F\u002Fplaywright.dev\u002F)\n\nA professional, open-source Google Maps web crawler that extracts company information from Google Maps. Built with [Playwright](https:\u002F\u002Fplaywright.dev\u002F) for browser automation.\n\n\u003Cdiv align=\"center\">\n  \u003Ch2>Sherlock Maps\u003C\u002Fh2>\n  \u003Cp>Open-Source Google Maps Webcrawler\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## Features\n\n- **Object-Oriented** - Cleanly structured with classes, dataclasses, and design patterns\n- **Search** - Search Google Maps with any search term\n- **Detailed Company Information** extraction:\n  - Company name\n  - Category \u002F Industry\n  - Address\n  - Phone number\n  - Website URL\n  - Rating (stars)\n  - Number of reviews\n  - Plus Code\n  - Opening hours\n  - Attributes (wheelchair accessibility, etc.)\n- **Deduplication** based on company name + website\n- **URL Validation** (filters out invalid websites)\n- **Multiple Output Formats**: JSON, CSV, Pretty-Print, File, Print\n- **REST API** - Asynchronous job queue server with full-featured endpoints\n- **Docker Support** - Containerized deployment\n- **Chrome Profile Persistence** - Session data persists between runs\n\n---\n\n## Quick Start\n\n### Option 1: Docker (Recommended)\n\nThe easiest way to get started. Docker handles all dependencies, Playwright, and browser setup automatically.\n\n#### Using docker-compose (Simplest)\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FAyyouboss0011\u002FSherlockMaps.git\ncd GoogleMapsCrawler\n\n# Start the API server\ndocker compose up -d\n\n# The API is now running at http:\u002F\u002Flocalhost:8000\n# Interactive documentation: http:\u002F\u002Flocalhost:8000\u002Fdocs\n```\n\n#### Start a crawl via API\n\n```bash\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fcrawl \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"prompt\": \"restaurants berlin\"}'\n```\n\n#### Get results\n\n```bash\n# Check job status\ncurl http:\u002F\u002Flocalhost:8000\u002Fstatus\n\n# Get all results\ncurl http:\u002F\u002Flocalhost:8000\u002Fresults\n```\n\n#### Stop the container\n\n```bash\ndocker compose down\n```\n\n#### Using Docker CLI (without docker-compose)\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FAyyouboss0011\u002FSherlockMaps.git\ncd GoogleMapsCrawler\n\n# Build the image\ncd core && docker build -t sherlock-maps . && cd ..\n\n# Run as API server\ndocker run -d -p 8000:8000 --name sherlock-maps sherlock-maps\n\n# Run in CLI mode (one-time crawl)\ndocker run --rm -e PROMPT=\"restaurants berlin\" sherlock-maps python \u002Fapp\u002Fcore\u002Fmain_cli.py\n```\n\n---\n\n### Option 2: Without Docker\n\nInstall Python dependencies manually and run the crawler directly.\n\n#### Prerequisites\n\n- Python 3.9 or higher\n- Git\n\n#### Installation\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FAyyouboss0011\u002FSherlockMaps.git\ncd GoogleMapsCrawler\n\n# Install Python dependencies\ncd core\npip install -r requirements.txt\n\n# Install Playwright browsers\nplaywright install chromium\n```\n\n#### Run the CLI crawler\n\n```bash\n# Set search term\nexport PROMPT=\"restaurants berlin\"\n\n# Run the crawler\npython main.py\n```\n\n#### Run the REST API server\n\n```bash\n# Start the API server on port 8000\npython api_main.py\n\n# The API is now running at http:\u002F\u002Flocalhost:8000\n# Interactive documentation: http:\u002F\u002Flocalhost:8000\u002Fdocs\n```\n\n#### Start a crawl via API\n\n```bash\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fcrawl \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"prompt\": \"restaurants berlin\"}'\n```\n\n#### Use as a Python library\n\n```python\nfrom core.crawler import run_crawler\n\nresults = run_crawler(\n    prompt=\"restaurants berlin\",\n    headless=False,\n    output_format=\"json\"\n)\n\nfor company in results:\n    print(f\"{company.name} - {company.website}\")\n```\n\n---\n\n## CLI Mode\n\nThe crawler can be used directly from the command line. All results are output to stdout as JSON by default.\n\n### Output Formats\n\n```bash\n# JSON to stdout (default)\nexport PROMPT=\"restaurants berlin\"\npython main.py\n\n# Save as JSON file\nexport PROMPT=\"restaurants berlin\"\nexport OUTPUT_FORMAT=\"file\"\npython main.py\n\n# Save as CSV file\nexport PROMPT=\"restaurants berlin\"\nexport OUTPUT_FORMAT=\"csv\"\npython main.py\n\n# Formatted output (one company per block)\nexport PROMPT=\"restaurants berlin\"\nexport OUTPUT_FORMAT=\"print\"\npython main.py\n\n# Human-readable output\nexport PROMPT=\"restaurants berlin\"\nexport OUTPUT_FORMAT=\"pretty\"\npython main.py\n\n# Headless mode (for production\u002Fservers)\nexport PROMPT=\"restaurants berlin\"\nexport HEADLESS=\"true\"\npython main.py\n```\n\n### Output Formats Overview\n\n| Format | Description |\n|---|---|\n| `json` | JSON array to stdout (default) |\n| `file` | Saves results as `sherlock-maps_YYYYMMDD_HHMMSS.json` |\n| `csv` | Saves results as `sherlock-maps_YYYYMMDD_HHMMSS.csv` |\n| `print` | Each company individually with separator |\n| `pretty` | Human-readable format with aligned fields |\n\n### Environment Variables\n\n| Variable | Description | Default |\n|---|---|---|\n| `PROMPT` | Search term for Google Maps | Required |\n| `OUTPUT_FORMAT` | Output format: `json`, `print`, `file`, `csv`, `pretty` | `json` |\n| `HEADLESS` | Run browser in headless mode | `false` |\n| `GOOGLE_API_KEY` | Optional Google API key | (empty) |\n\n---\n\n## Python Library\n\n### Simple (Convenience Function)\n\n```python\nfrom core.crawler import run_crawler\n\nresults = run_crawler(\n    prompt=\"restaurants berlin\",\n    headless=False,\n    output_format=\"json\"\n)\n\nfor company in results:\n    print(f\"{company.name} - {company.website}\")\n```\n\n### Complete (With Configuration)\n\n```python\nfrom core.models import CrawlerConfig, CompanyData\nfrom core.crawler import GoogleMapsCrawler\n\n# Create configuration\nconfig = CrawlerConfig(\n    search_prompt=\"restaurants berlin\",\n    headless=False,\n    output_format=\"pretty\",\n)\n\n# Use crawler with context manager\nwith GoogleMapsCrawler(config) as crawler:\n    results = crawler.crawl()\n\n# Process results\nfor company in results:\n    if isinstance(company, CompanyData):\n        print(f\"{company.name}: {company.rating} stars ({company.reviews_count} reviews)\")\n```\n\n### Custom Search at Runtime\n\n```python\nfrom core.models import CrawlerConfig\nfrom core.crawler import GoogleMapsCrawler\n\nconfig = CrawlerConfig(\n    search_prompt=\"cafes berlin\",\n    output_format=\"json\",\n)\n\nwith GoogleMapsCrawler(config) as crawler:\n    # First search\n    results1 = crawler.crawl()\n\n    # Second search with different term\n    results2 = crawler.crawl(prompt=\"restaurants munich\")\n```\n\n---\n\n## REST API\n\nThe crawler can run as a persistent service with REST API. The container starts as an API server and can process multiple crawl jobs sequentially.\n\n### Start the API\n\n```bash\n# Build the image\ncd core\ndocker build -t sherlock-maps .\n\n# Start API server (port 8000)\ndocker run -p 8000:8000 sherlock-maps\n\n# With custom port\ndocker run -p 8080:8080 -e API_PORT=8080 sherlock-maps\n```\n\n### API Endpoints\n\n#### Health & Status\n\n| Method | Path | Description |\n|---|---|---|\n| GET | `\u002Fhealth` | Health check (for Docker orchestrators) |\n| GET | `\u002Fstatus` | Current status (idle\u002Fbusy), active jobs, queue length |\n| GET | `\u002Fstats` | Detailed statistics |\n\n#### Crawler Control\n\n| Method | Path | Description |\n|---|---|---|\n| POST | `\u002Fcrawl` | Start a new crawl job |\n| GET | `\u002Fcrawl\u002F{job_id}` | Get job status |\n| GET | `\u002Fcrawl\u002F{job_id}\u002Fresults` | Get job results |\n| DELETE | `\u002Fcrawl\u002F{job_id}` | Cancel a running job |\n| GET | `\u002Fcrawl\u002Fhistory` | List all jobs with pagination |\n\n#### Data Management\n\n| Method | Path | Description |\n|---|---|---|\n| GET | `\u002Fresults` | Get all results |\n| POST | `\u002Fresults\u002Fexport` | Export results |\n| DELETE | `\u002Fresults\u002Fclear` | Clear all results |\n\n#### Configuration\n\n| Method | Path | Description |\n|---|---|---|\n| GET | `\u002Fconfig` | Get current configuration |\n| PUT | `\u002Fconfig` | Update configuration |\n\n#### Browser\n\n| Method | Path | Description |\n|---|---|---|\n| GET | `\u002Fbrowser\u002Finfo` | Browser information |\n| POST | `\u002Fbrowser\u002Frestart` | Restart browser |\n\n### API Examples\n\n```bash\n# Start a new crawl job\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fcrawl \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"prompt\": \"restaurants berlin\", \"output_format\": \"json\"}'\n\n# Get job status\ncurl http:\u002F\u002Flocalhost:8000\u002Fcrawl\u002F\u003Cjob_id>\n\n# Get results\ncurl http:\u002F\u002Flocalhost:8000\u002Fcrawl\u002F\u003Cjob_id>\u002Fresults\n\n# Get all results as CSV\ncurl \"http:\u002F\u002Flocalhost:8000\u002Fresults?format=csv\"\n\n# Get status\ncurl http:\u002F\u002Flocalhost:8000\u002Fstatus\n\n# Health check\ncurl http:\u002F\u002Flocalhost:8000\u002Fhealth\n\n# Cancel job\ncurl -X DELETE http:\u002F\u002Flocalhost:8000\u002Fcrawl\u002F\u003Cjob_id>\n\n# Job history\ncurl \"http:\u002F\u002Flocalhost:8000\u002Fcrawl\u002Fhistory?limit=10&offset=0\"\n```\n\n### Request Example\n\n```json\n{\n  \"prompt\": \"restaurants berlin\",\n  \"output_format\": \"json\",\n  \"headless\": false,\n  \"locale\": \"de-DE\",\n  \"max_results\": 100\n}\n```\n\n### Response Example (Job Status)\n\n```json\n{\n  \"job_id\": \"abc-123-def\",\n  \"status\": \"completed\",\n  \"prompt\": \"restaurants berlin\",\n  \"created_at\": \"2026-01-15T10:30:00Z\",\n  \"completed_at\": \"2026-01-15T10:31:30Z\",\n  \"results_count\": 42,\n  \"error\": null\n}\n```\n\n### Job Status\n\n| Status | Description |\n|---|---|\n| `pending` | In the queue |\n| `running` | Currently running |\n| `completed` | Successfully completed |\n| `failed` | Failed |\n| `cancelled` | Cancelled |\n\n### Interactive API Documentation\n\nWhen the API server is running, interactive Swagger documentation is available:\n\n```\nhttp:\u002F\u002Flocalhost:8000\u002Fdocs\n```\n\n---\n\n## How It Works\n\n1. **Search** - Navigates to Google Maps with the search term\n2. **Scroll** - Loads all search results by scrolling\n3. **Extract** - Navigates to each result's detail page and extracts:\n   - Company name, category, address, phone, website\n   - Rating and number of reviews\n   - Opening hours\n   - Attributes\n4. **Filter** - Removes duplicates and validates website URLs\n5. **Output** - Outputs results in the desired format\n\n---\n\n## Architecture\n\n```\nSherlock Maps\u002F\n├── .gitignore\n├── docker-compose.yml\n├── README.md\n├── public\u002F\n│   └── SherlockMaps.png\n└── core\u002F\n    ├── __init__.py                   # Package exports\n    ├── main.py                       # CLI entry point\n    ├── main_cli.py                   # CLI logic\n    ├── crawler.py                    # Main crawler class\n    ├── requirements.txt              # Python dependencies\n    ├── api\u002F\n    │   ├── __init__.py\n    │   ├── models.py                 # API data models\n    │   ├── queue_manager.py          # Job queue management\n    │   └── server.py                 # FastAPI server\n    ├── browser\u002F\n    │   ├── __init__.py\n    │   └── browser_manager.py        # Browser lifecycle management\n    ├── exceptions\u002F\n    │   ├── __init__.py\n    │   └── crawler_exceptions.py     # Custom exceptions\n    ├── extractors\u002F\n    │   ├── __init__.py\n    │   └── maps_extractor.py         # Google Maps data extraction\n    ├── models\u002F\n    │   ├── __init__.py\n    │   ├── company.py                # CompanyData model\n    │   └── crawler_config.py         # CrawlerConfig model\n    ├── output\u002F\n    │   ├── __init__.py\n    │   └── output_handler.py         # Output formats\n    └── processors\u002F\n        ├── __init__.py\n        ├── url_validator.py          # URL validation\n        └── deduplication_processor.py # Deduplication\n```\n\n### Class Overview\n\n| Class | Module | Description |\n|---|---|---|\n| `Sherlock Maps` | | Open-Source Google Maps Webcrawler |\n| `GoogleMapsCrawler` | `crawler.py` | Main class, orchestrates the entire crawling process |\n| `BrowserManager` | `browser\u002Fbrowser_manager.py` | Manages Playwright browser lifecycle |\n| `MapsExtractor` | `extractors\u002Fmaps_extractor.py` | Extracts company data from Google Maps |\n| `CompanyData` | `models\u002Fcompany.py` | Data model for a company |\n| `CrawlerConfig` | `models\u002Fcrawler_config.py` | Crawler configuration |\n| `URLValidator` | `processors\u002Furl_validator.py` | Validates HTTP(S) URLs |\n| `DeduplicationProcessor` | `processors\u002Fdeduplication_processor.py` | Removes duplicates |\n| `OutputHandler` | `output\u002Foutput_handler.py` | Formats and outputs results |\n| `CrawlerBaseException` | `exceptions\u002Fcrawler_exceptions.py` | Base exception class |\n\n---\n\n## Configuration\n\n### CrawlerConfig Attributes\n\n| Attribute | Type | Default | Description |\n|---|---|---|---|\n| `search_prompt` | `str` | `\"\"` | The search term for Google Maps |\n| `headless` | `bool` | `False` | Run browser in headless mode |\n| `output_format` | `Literal` | `\"json\"` | Output format |\n| `chrome_profile_path` | `str` | `\"Chrome_Profile\"` | Path to Chrome user data directory |\n| `viewport` | `ViewPort` | `1920x1080` | Browser viewport dimensions |\n| `locale` | `str` | `\"de-DE\"` | Browser localization |\n| `page_timeout` | `int` | `30000` | Maximum navigation timeout in ms |\n| `selector_timeout` | `int` | `15000` | Maximum timeout for selectors in ms |\n| `scroll_timeout` | `int` | `45` | Maximum time for scrolling in seconds |\n| `max_scroll_attempts` | `int` | `5` | Number of scroll attempts before stop |\n| `max_retries` | `int` | `3` | Number of navigation retry attempts |\n| `request_timeout` | `int` | `25000` | Request timeout in ms |\n\n---\n\n## Example Output\n\n```json\n[\n  {\n    \"name\": \"Restaurant Name\",\n    \"category\": \"Restaurant\",\n    \"address\": \"Musterstrasse 1, 10115 Berlin\",\n    \"phone\": \"+49 30 12345678\",\n    \"website\": \"https:\u002F\u002Fwww.restaurant-example.de\",\n    \"rating\": \"4.5\",\n    \"reviews_count\": \"234\",\n    \"plus_code\": \"GVMF+8H Berlin\",\n    \"opening_hours\": \"Mon: 12:00-22:00, Tue: 12:00-22:00, ...\",\n    \"attributes\": [\"Wheelchair accessible entrance\"]\n  }\n]\n```\n\n---\n\n## Limitations\n\n- Google Maps UI changes may break selectors (CSS classes like `h1.DUwDvf` are Google-specific)\n- Rate limiting: Google may show CAPTCHAs for fast requests\n- German localization is hardcoded (`hl=de`), for other languages `browser_manager.py` must be modified\n- Requires a display or headless mode for Chromium\n\n---\n\n## Resources\n\n- [Playwright Documentation](https:\u002F\u002Fplaywright.dev)\n- [Contribution Guide](CONTRIBUTING.md)\n- [MIT License](LICENSE)\n\n---\n\n## License\n\nMIT License","SherlockMaps 是一个强大的 Google 地图爬虫工具，能够从 Google 地图中提取公司信息。它使用 Playwright 进行浏览器自动化操作，支持 REST API、Docker 部署及多种格式导出。其核心功能包括搜索任意关键词、详细提取公司名称、地址、电话号码等信息，并具备去重和 URL 验证功能。此外，SherlockMaps 提供了异步任务队列服务器和 Chrome 会话持久化特性，确保数据在多次运行之间保持一致。适用于需要从 Google 地图批量收集商业信息的场景，如市场调研或数据分析项目。",2,"2026-06-11 03:58:46","CREATED_QUERY"]