[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82138":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":16,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":13,"starSnapshotCount":13,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},82138,"vibe-scanner","safeboundai\u002Fvibe-scanner","safeboundai",null,"HTML",214,58,1,0,7,106,21,5.31,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:23","# vibe-scanner\n\n[![License: Apache 2.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![Node 20+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnode-20+-green.svg)](https:\u002F\u002Fnodejs.org\u002F)\n\nDiscovers and assesses **vibe-coded shadow apps** — internal tools deployed by employees, without IT review, on AI\u002Fno-code builders (Lovable, Replit, Base44), JAMstack\u002Fserverless hosts (Netlify, Vercel, Cloudflare Pages, Fly.io, Firebase Hosting), ML demo platforms (Hugging Face Spaces, Streamlit Cloud), and quick-prototype platforms (Glitch). Given a single target domain (e.g. `example.com`), it enumerates apps belonging to that organization across all 11 platforms, then probes each app for exposed authentication, hardcoded secrets, Supabase RLS-bypass conditions ([CVE-2025-48757](https:\u002F\u002Fnvd.nist.gov\u002Fvuln\u002Fdetail\u002FCVE-2025-48757)), and sensitive data classes.\n\nBuilt for **enterprise red teams** doing authorized shadow-IT discovery against their own organizations.\n\nThe CLI streams JSON-line events on stdout. The bundled Node dashboard forwards them as Server-Sent Events to a browser-based terminal UI at `\u002Fvibe-scan.html`.\n\n---\n\n## Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsafeboundai\u002Fvibe-scanner.git\ncd vibe-scanner\ncp .env.example .env             # fill in SERPER_API_KEY (required)\ndocker build -t vibe-scanner .\ndocker run --rm -p 8080:8080 --env-file .env vibe-scanner\n# open http:\u002F\u002Flocalhost:8080\n```\n\nWithout Docker:\n\n```bash\npython3 -m venv venv && source venv\u002Fbin\u002Factivate\npip install -r requirements.txt\ncd server && npm install && cd ..\ncp .env.example .env             # fill in SERPER_API_KEY\nPYTHON_BIN=$(pwd)\u002Fvenv\u002Fbin\u002Fpython (cd server && npm start)\n# open http:\u002F\u002Flocalhost:8080\n```\n\nOr run the CLI directly:\n\n```bash\nvenv\u002Fbin\u002Fpython -m scans.vibe_scan_cli --domain example.com --name \"Example Co\"\n```\n\n### Test locally\n\nAfter the image is built, re-run with your local `.env` injected at run time:\n\n```bash\ndocker run --rm -p 8080:8080 --env-file .env vibe-scanner\n```\n\nThe container reads `SERPER_API_KEY` (and any other vars) from your `.env` via `--env-file`. The file itself is **not** baked into the image — `.dockerignore` excludes it on purpose so secrets don't leak into a shared registry.\n\n---\n\n## Configuration\n\nAll settings come from environment variables (auto-loaded from `.env`). See `.env.example` for the full list.\n\n| Variable | Required | Purpose |\n|---|---|---|\n| `SERPER_API_KEY` | **yes** | Drives the Google-index dork queries. Get one at [serper.dev](https:\u002F\u002Fserper.dev). |\n| `HF_TOKEN` | recommended | Speeds up the first GLiNER model download (~700 MB). |\n| `USE_AI` | optional | `true` to enable GPT-4o risk-assessment narratives. Default `false` (rules-engine fallback). |\n| `OPENAI_API_KEY` | optional | Required when `USE_AI=true`. |\n| `PORT` | optional | Default `8080`. |\n| `ALLOWED_ORIGINS` | optional | CORS allowlist (comma-separated). Blank = same-origin only. |\n| `SSM_PREFIX` | optional | Look up missing secrets in AWS SSM under this prefix. Requires `pip install vibe-scanner[ssm]`. |\n\n---\n\n## Algorithm\n\nThree sequential phases plus per-app post-processing.\n\n### Phase 0 — Target map (pre-discovery, optional)\n\n**Purpose:** widen the identity-token set fed to the dork generator. URL-derived identity (`example.com` → `Example` \u002F `example`) misses apps named after products rather than the parent organization — e.g. Example Corp's vibe-coded tool for \"ProductX\" won't surface for `site:vercel.app \"example\"`.\n\n**Code:** `scans\u002F_target_map.py`\n\n1. BFS crawl the target domain using `requests` + BeautifulSoup. Settings: `MAX_DEPTH=3`, `MAX_PAGES=50`, `TIME_BUDGET_S=20`, `REQUEST_TIMEOUT=8`. The time budget is a hard wall-clock cutoff so a slow or hostile target never blocks a scan.\n2. Same-host filter excludes external links and social media at link-extraction time.\n3. From each fetched page, pull `\u003Ctitle>`, `\u003Cmeta name=\"description\">`, `\u003Ch1\u002Fh2\u002Fh3>`, and body text.\n4. Run the concatenated text through GLiNER with entity labels `[\"organization\", \"product name\", \"brand name\"]` at threshold 0.5.\n5. Normalize each span: trim, drop length ≤ 3 or ≥ 50, strip stop tokens (`home`, `about`, `login`, …), require at least one ASCII letter.\n6. Filter out tokens already covered by the URL-derived identity, plus sub\u002Fsuperstrings thereof.\n7. Return sorted list. Attached to `identity[\"extra_tokens\"]` and used in (a) relevance scoring (+0.2 per match) and (b) one extra `site:{platform} \"{token}\"` dork per platform.\n\nSkipped entirely with `--no-target-map`. Gracefully degrades to `[]` when GLiNER isn't available.\n\n### Phase 1 — Discovery\n\n**Code:** `scans\u002Fvibe_code.py:discover`\n\nFor each platform × dork template (× extra token), call Serper's `\u002Fsearch`.\n\n**Dork templates** (11 total):\n\n| Family | Template | Purpose |\n|---|---|---|\n| Identity | `site:{platform} \"{company_name}\"` | Direct company-name mentions |\n| Identity | `site:{platform} \"{domain_root}\"` | Bare brand string |\n| Identity | `site:{platform} \"mailto:{domain}\"` | Pages with org-domain mailto links |\n| Identity | `site:{platform} \"@{domain}\"` | Pages mentioning org email addresses |\n| Identity | `site:{platform} \"{company_slug}-\"` | Hostname-style slug prefix |\n| Identity | `site:{platform} \"{company_slug} \"` | Slug as standalone token |\n| Intent | `site:{platform} \"{domain}\" \"Supabase\"` | Pages with Supabase config + org tie-in |\n| Intent | `site:{platform} \"{domain}\" \"firebaseConfig\"` | Pages with Firebase config + org tie-in |\n| Intent | `site:{platform} \"{company_name}\" \"dashboard\"` | Admin\u002Fdata UIs |\n| Intent | `site:{platform} \"{company_name}\" \"admin\"` | Admin surfaces |\n| Intent | `site:{platform} \"{company_name}\" \"login\"` | Auth surfaces |\n\n**Pagination:** loops up to `SERPER_PAGES=2` pages at `num=30`, stopping early when a page returns empty or fewer than `num` results. Per-scan upper bound: 11 dorks × 11 platforms × 2 pages = 242 Serper requests, plus one per extra-token dork × 11 platforms × 2 pages.\n\n**Per-hit filtering:**\n\n1. Drop if hostname matches `TEST_SLUG_PATTERNS` (`^(test|demo|example|my-first|hello-world|untitled)-`, `\\b(template|starter|boilerplate)\\b`).\n2. Compute relevance: +0.5 for full `domain.com` in title\u002Fsnippet\u002Flink, +0.3 for company slug, +0.3 for company name, +0.2 for domain root, +0.2 per extra-token match. Capped at 1.0.\n3. Compute hostname match: split the subdomain on `[^a-z0-9]+`, drop generic tokens (`app`, `dev`, `tools`, …), accept if any identity needle is in the resulting token set.\n4. Compute text match: full `domain.com` literal in title or snippet.\n5. **Keep only if** `relevance ≥ 0.3` **AND** (`host_match` **OR** `text_match`). The second condition is what kills false positives like \"company mentioned in passing in a pentest blog.\"\n6. Dedupe by URL keeping the highest-relevance copy.\n\nReturn the top `max_apps` (default 20) sorted by relevance descending.\n\n### Phase 2 — Probing\n\n**Code:** `scans\u002Fvibe_code.py:probe`. Runs in a 5-worker `ThreadPoolExecutor`.\n\nFor each surviving candidate:\n\n1. **HEAD** with 10s timeout, redirects followed. HTTP 401\u002F403 → `auth_status=\"secured\"`. HTTP ≥ 400 → return.\n2. **GET** with 10s timeout. If the final URL contains `login` and differs from the request URL → `auth_status=\"platform_auth\"`.\n3. Scan response body (≤ first 5000 chars):\n   - Password-input regex → `auth_status=\"trivial\"`. Absence + 200 OK → `auth_status=\"none\"`.\n   - Hardcoded password \u002F API-key regex → `hardcoded_credentials=True`.\n   - Supabase URL regex → `supabase_detected=True`. With an anon-JWT also present, calls `_check_supabase_rls`.\n4. **Supabase RLS probe** (CVE-2025-48757): for each guess in `(\"users\",\"profiles\",\"customers\",\"leads\",\"accounts\")`, GET `{supabase_url}\u002Frest\u002Fv1\u002F{table}?select=*&limit=1` with the anon JWT. The function asserts `r.request.method == \"GET\"` defensively — the probe is read-only by construction. Non-empty 200 → `supabase_rls_bypass=True`.\n5. **Attribution** — accepts only ownership-grade signals (a bare `domain.com` substring is *not* sufficient — see \"service-provider targets\" below). Signals in order:\n   - `mailto:*@domain` → `signal=\"mailto:@domain\"`\n   - `@domain\\b` literal → `signal=\"@domain\"`\n   - `©\u002F&copy;\u002F&#169;\u002Fcopyright` within 40 chars of `company_name` → `signal=\"copyright\"`\n   - `\u003Clink rel=\"canonical\" href=\"https:\u002F\u002Fdomain\">` → `signal=\"canonical\"`\n   - GLiNER `organization` span overlapping `company_name` → `signal=\"gliner:organization\"` (demoted when \"uses\" language is present)\n   - Hostname-only match (e.g. `acme-crm.vercel.app` for acme.com) → `signal=\"hostname\"` (also demoted by \"uses\" language)\n6. **\"Uses\" language detection** — regex over the body for `powered by X`, `built with X`, `made with X`, `via X`, `using X`, `integrates with X`, etc. When matched, the page frames the target as a consumed dependency, so the two weakest attribution paths (GLiNER organization span, hostname-only match) are blocked. Strong literals (mailto, @domain) and ownership-grade signals (copyright, canonical) override \"uses\" language.\n7. **SPA fallback** — if attribution still fails AND the body looks like an empty SPA shell, re-fetch via headless Chrome (`scans\u002F_browser.py`) and re-run attribution, credential scan, and Supabase detection against the rendered DOM.\n\nCandidates without `attribution_found=True` are dropped before classification — the CLI emits a `SKIP` phase event for each.\n\n### Classification\n\n**Code:** `scans\u002Fvibe_code.py:classify` + `scans\u002F_gliner.py:classify_text`.\n\n1. Strip HTML tags from the snippet.\n2. GLiNER with 22 entity labels at threshold 0.4. Input HTML-stripped, capped at 8000 chars, chunked at 1200 chars on sentence boundaries (or force-broken when there's no punctuation) to stay under gliner_medium-v2.1's 384-token per-sentence sequence limit.\n3. Map detected labels onto data classes: `{person name, email, phone, address}` → `pii_contact`; `{customer record}` → `crm`; `{employee record, salary}` → `hr`; `{go-to-market, competitive analysis, unreleased product}` → `strategy`; `{budget, vendor contract}` → `finance`; `{medical, health record}` → `healthcare`; `{api key, db connection string}` → `credentials`; `{source code}` → `source_code`.\n4. **Fallback when GLiNER unavailable:** regex pass over keyword patterns against the first 6000 chars.\n5. `sensitivity_score = min(1.0, 0.2 * len(data_classes))`.\n\n### Severity scoring\n\n| Condition | Severity |\n|---|---|\n| `supabase_rls_bypass` | **CRITICAL** |\n| `auth_status=\"none\"` AND `sensitivity > 0.6` | **CRITICAL** |\n| `auth_status=\"none\"` | **HIGH** |\n| `auth_status=\"trivial\"` AND `sensitivity > 0.4` | **HIGH** |\n| `hardcoded_credentials` | **HIGH** |\n| `auth_status=\"oauth_any\"` AND `sensitivity > 0.3` | **MEDIUM** |\n| `auth_status=\"platform_auth\"` | **LOW** |\n| else | **LOW** |\n\n### Regulatory mapping\n\nEach detected data class maps to relevant regulatory frameworks (e.g. `pii_contact` → `CCPA · GDPR · state breach notification laws`). The joined string is attached as `regulatory_exposure`.\n\n### Known limitation: service-provider targets\n\nThe attribution model is calibrated for **product companies** (clear product boundary, rare incidental mentions). For **service-provider targets** — `huggingface.co`, `github.com`, `openai.com`, `stripe.com`, `npmjs.com` — the world is full of third-party apps that legitimately reference them in product context. The ownership-grade rules (strong literals only, \"uses\" framing demotes weak attribution) cut most of that noise but cannot fully solve it: a third-party tool that ships with a `© Hugging Face` snippet copied from upstream will still slip through. For such targets, prefer a narrower identity (a specific product name as `--name`) and accept lower recall, or use a different tool that walks `*.target-domain` via DNS + cert-transparency logs (out of scope here).\n\n---\n\n## SSE event schema\n\nThe CLI emits one JSON object per line on stdout. The Node parent forwards each as an SSE `data:` frame.\n\n```\n{\"type\": \"phase\",  \"label\": \"...\", \"detail\": \"...\"}        progress in terminal animation\n{\"type\": \"app\",    \"app\":   {url, platform, severity, ...}} per-app probe result, streamed\n{\"type\": \"result\", \"result\": {identity, apps, summary}}    final payload after all probes\n{\"type\": \"error\",  \"message\": \"...\"}                       fatal error (e.g. missing API key)\n```\n\nPhase labels: `PHASE 0` (target map), `PHASE 1` (discovery), `PHASE 2` (probing), per-platform `Querying`, `SKIP`\u002F`WARN` per candidate, `FILTERED` summary, `SCAN COMPLETE`.\n\n---\n\n## Module layout\n\n```\nscans\u002F\n  vibe_code.py       ── pipeline (discover, probe, classify, calculate_severity)\n  vibe_scan_cli.py   ── CLI wrapper, emits SSE-shaped JSON lines on stdout\n  _target_map.py     ── Phase 0: BFS crawl + GLiNER token extraction\n  _gliner.py         ── lazy-loaded GLiNER singleton + label maps\n  _browser.py        ── headless-Chrome SPA fallback\nutils\u002F\n  secrets.py         ── env-var first, optional SSM backend\nserver\u002F\n  server.js          ── SSE endpoint \u002Fapi\u002Fvibe-scan, GPT-4o proxy \u002Fapi\u002Fassess\n  public\u002Fvibe-scan.html  ── dashboard terminal animation\n```\n\n---\n\n## Dependencies\n\n- **`requests`, `beautifulsoup4`, `python-dotenv`** — required.\n- **`gliner`** — required for high-accuracy classification and attribution. First call downloads `urchade\u002Fgliner_medium-v2.1` (~700 MB) into `~\u002F.cache\u002Fhuggingface\u002F`. Without it, both fall back to regex (lower recall).\n- **`selenium`** — required for the SPA-rendering fallback (`scans\u002F_browser.py`). The Dockerfile installs Chrome stable; outside Docker, ensure `google-chrome-stable` is on `PATH` or set `BROWSER_PATH`.\n- **`boto3`** — optional (`pip install vibe-scanner[ssm]`). Only needed if you set `SSM_PREFIX` to use AWS SSM as a secret backend.\n\n---\n\n## Ethics & authorization\n\nVibeScan is built for **authorized security testing**: scanning domains you own, or domains where you have explicit written permission from the owner. The Supabase RLS probe is read-only by construction (`assert r.request.method == \"GET\"`), but discovery-time dorks and SPA-rendering probes generate traffic to third-party platforms — use accordingly. Don't point this at organizations without authorization.\n\n---\n\n## Contributing\n\nIssues and PRs are welcome. For substantial changes, please open an issue first to discuss.\n\n---\n\n## License\n\nApache License 2.0 — see [LICENSE](LICENSE).\n","vibe-scanner 是一个用于发现和评估企业内部未经IT审查的影子应用程序的安全工具。它能够扫描包括无代码构建平台、JAMstack\u002Fserverless主机、机器学习演示平台以及快速原型平台在内的11个不同平台上的应用，检测这些应用是否存在暴露的身份验证信息、硬编码密钥、特定漏洞（如CVE-2025-48757）及敏感数据泄露等问题。该工具采用Python 3.10+和Node.js 20+开发，支持通过命令行界面或Web界面进行操作，并可通过Docker容器化部署简化安装过程。适用于企业红队在授权情况下对自身组织内的影子IT资源进行安全审计与风险评估。",2,"2026-06-11 04:07:51","CREATED_QUERY"]