[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-46":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":15,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},46,"speca","NyxFoundation\u002Fspeca","NyxFoundation","SPECA: Specification-to-Checklist Agentic Auditing Framework","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26495",null,"Python",429,27,269,9,0,3,7,50,4.34,"MIT License",false,"main",true,[],"2026-06-12 02:00:07","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fspeca_logo.png\" alt=\"SPECA logo\" width=\"240\" \u002F>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">SPECA: A Specification-to-Checklist Agentic Auditing Framework\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26495\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2604.26495-b31b1b.svg\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue.svg\" alt=\"License: MIT\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNyxFoundation\u002Fspeca\u002Factions\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCI-GitHub%20Actions-2088FF?logo=githubactions&logoColor=white\" alt=\"CI\">\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.11+-3776AB?logo=python&logoColor=white\" alt=\"Python 3.11+\">\n\u003C\u002Fp>\n\n> **Paper:** Masato Kamba, Hirotake Murakami, Akiyoshi Sannai. *Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification.* arXiv preprint [arXiv:2604.26495](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26495), 2026.\n\n### Abstract\n\nSecurity-critical software is routinely audited by tools that reason about vulnerabilities as repository-local code patterns. Yet specification-governed systems — protocol stacks, consensus implementations, cryptographic libraries — are constrained by invariants and correctness conditions defined in natural-language specifications. When a vulnerability arises from what the *specification requires* rather than how code is written, code-level approaches lack the representational vocabulary to detect it, and their false positives resist systematic diagnosis.\n\n**SPECA** is a specification-anchored security audit framework that derives explicit, typed security properties from natural-language specifications and audits implementations through structured **proof-attempt** reasoning grounded in each property. The framework yields three capabilities absent from code-driven auditing:\n\n1. **Spec-dependent detections** that no code-local pattern matcher can express.\n2. **Controlled cross-implementation comparison** under a shared property vocabulary.\n3. **False positives that decompose into interpretable, pipeline-phase-traceable root causes.**\n\n### Headline Results\n\n- **Sherlock Ethereum Fusaka Audit Contest** (366 submissions, 10 implementations): SPECA recovers **all 15** in-scope H\u002FM\u002FL vulnerabilities (5H\u002F2M\u002F8L) and independently discovers **4 bugs confirmed by developer fix commits** — including a cryptographic invariant violation absent from all 366 adjudicated contest submissions.\n- **RepoAudit C\u002FC++ benchmark** (15 projects, 35 non-disputed ground-truth bugs): SPECA matches the best published precision (**88.9%**, Sonnet 4.5) while surfacing **12 author-validated candidate bugs beyond the established ground truth** — two confirmed by upstream maintainers.\n- **All false positives** in the deep analysis (N=16) trace to **three interpretable root causes** — trust boundary misunderstanding (50%), code reading error (37.5%), specification misinterpretation (12.5%) — each mapped to a specific pipeline phase.\n\nSee [Evaluation](#evaluation) for full numbers and charts.\n\n## Table of Contents\n\n- [Why SPECA?](#why-speca)\n- [Quick Start](#quick-start)\n- [Demo](#demo)\n- [Architecture](#architecture)\n- [Phases](#phases)\n- [Running on GitHub Actions](#running-on-github-actions)\n- [Configuration](#configuration)\n- [Evaluation](#evaluation) — RQ1 Sherlock + RQ2 RepoAudit, with charts\n- [Reproducing the Benchmarks](#reproducing-the-benchmarks)\n- [Contributing](#contributing)\n- [Citation](#citation)\n- [License](#license)\n\n## Why SPECA?\n\nExisting LLM-based auditors begin from the *code* and work outward — scanning a repository for bug-pattern templates, dataflow anomalies, and API misuse. Specification-governed systems break this assumption: a vulnerability can arise from what the spec *requires* even when no local code pattern looks suspicious. The KZG batch-verification bug recovered by SPECA in [§Evaluation](#evaluation) is exactly this kind of issue — a violation of a mathematical invariant defined only in the specification, missed by all 366 contest auditors despite the code being open and well-reviewed.\n\nSPECA inverts the direction of analysis. It begins from the **specification** and derives a typed property vocabulary, then asks the implementation to *prove* each property. This shift produces three capabilities that code-driven tools cannot match:\n\n| | Code-driven auditing | SPECA (specification-anchored) |\n|---|---|---|\n| **Detection** | Finds defects that look like known bug patterns | Finds defects defined as violations of explicit, typed properties |\n| **Cross-implementation comparison** | Each codebase analyzed in isolation | Single property vocabulary applied uniformly across N implementations |\n| **False positive triage** | Opaque — \"the model thought this was a bug\" | FPs decompose into 3 root causes (trust boundary \u002F code reading \u002F spec misinterpretation), each tied to a pipeline phase |\n\nA second, often-overlooked benefit: because every finding is grounded in a specific property derived from a specific spec section, every detection has a **provenance chain** (`property → subgraph → spec section → INV-* label`). This makes findings auditable, not just generated.\n\n### Why \"proof-attempt\" instead of \"find bugs\"\n\nAn early prototype used the conventional adversarial framing — *\"find bugs in this code\"* — and produced an **88% false-positive rate**. Without a structured claim to disprove, the model emitted speculative findings with weak grounding. The proof-attempt framing forces the model to commit to a verifiable claim before reporting a gap, and the recall-safe 3-gate review filter (simplified down from a 5-gate prototype, after the dropped gates were shown to filter informational true positives at 0% precision) preserves H\u002FM\u002FL recall while filtering ~2\u002F3 of the remaining false positives.\n\n## Quick Start\n\n### Prerequisites\n\n- **Python 3.11+** and [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) (`pip install uv`)\n- **Node.js 20+** (for the Claude Code CLI and MCP servers)\n- **Anthropic API access** — `ANTHROPIC_API_KEY` exported in your shell, or a logged-in [Claude Code](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fclaude-code) session\n- **`git`** — Phase 03 auto-clones the target repository at the commit pinned in `outputs\u002FTARGET_INFO.json`\n\n### Install\n\n```bash\n# 1. Clone\ngit clone https:\u002F\u002Fgithub.com\u002FNyxFoundation\u002Fspeca.git\ncd speca\n\n# 2. Install Claude Code CLI (used as the worker runtime)\nnpm install -g @anthropic-ai\u002Fclaude-code\n\n# 3. Install Python deps via uv (creates an isolated env)\nuv sync\n\n# 4. Register MCP servers (tree_sitter \u002F filesystem \u002F fetch)\nbash scripts\u002Fsetup_mcp.sh\nbash scripts\u002Fsetup_mcp.sh --verify\n```\n\n### Run a single phase\n\n```bash\n# Smoke-test: discover specs from a seed URL\nSPEC_URLS=\"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7594.md\" \\\n  uv run python3 scripts\u002Frun_phase.py --phase 01a\n```\n\n### End-to-end audit\n\n```bash\n# Place these two files first:\n#   outputs\u002FBUG_BOUNTY_SCOPE.json   # required by Phase 01e\n#   outputs\u002FTARGET_INFO.json        # required by Phase 02c\u002F03\n\nuv run python3 scripts\u002Frun_phase.py --target 04 --workers 4 --max-concurrent 64\n```\n\nOutputs are written to `outputs\u002F\u003Cphase_id>_PARTIAL_*.json`. See the [Configuration](#configuration) section below for `BUG_BOUNTY_SCOPE.json` \u002F `TARGET_INFO.json` formats.\n\n### Run the test suite\n\n```bash\nuv run python3 -m pytest tests\u002F -v --tb=short\n```\n\n## Demo\n\nSee past and ongoing audit runs on the **GitHub Actions** page:\n\n**[View Actions Runs](https:\u002F\u002Fgithub.com\u002FNyxFoundation\u002Fspeca\u002Factions)**\n\nEach workflow step (01a through 04) can be triggered independently via `workflow_dispatch`. Results are committed to audit branches and can be reviewed as Pull Requests.\n\n## Architecture\n\nSPECA is organized as a **6-phase pipeline** in two stages: **Knowledge Structuring** (Phases 1–3) transforms natural-language specifications into explicit security properties, and **Systematic Auditing** (Phases 4–6) applies structured proof-attempt reasoning to check whether each implementation satisfies those properties.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fpipeline.png\" alt=\"SPECA pipeline\" width=\"900\" \u002F>\n\u003C\u002Fp>\n\nIn multi-implementation settings, the **left stage executes once** against the specification (producing a shared property vocabulary), and the **right stage executes per implementation** — enabling controlled cross-implementation security comparison by holding security expectations constant while varying the code under test.\n\n| Stage | Phase | Name | Purpose |\n|---|---|---|---|\n| **Knowledge Structuring** | 1 | Specification Discovery | Crawl spec documents into a structured index |\n|  | 2 | Subgraph Extraction | Decompose specs into [Nielson & Nielson](https:\u002F\u002Fwww.imm.dtu.dk\u002F~hrni\u002F) program graphs with RFC 2119–derived invariants |\n|  | 3 | Property Generation | STRIDE + CWE Top 25 threat model → typed security properties (Invariant \u002F Pre \u002F Post \u002F Assumption) |\n| **Systematic Auditing** | 4 | Code Pre-resolution | Tree-sitter symbol resolution links each property to source locations (40–60% audit-token reduction) |\n|  | 5 | Property-Grounded Audit | Per-property *Map → Prove → Stress-Test* — gaps in the proof are findings |\n|  | 6 | Severity-Preserving Review | Three narrow mechanical gates (Dead Code \u002F Trust Boundary \u002F Scope) preserve H\u002FM\u002FL recall |\n\n### The Audit Harness\n\nThe pipeline ships as a reusable **audit harness** under `scripts\u002Forchestrator\u002F` — not a one-off script. The harness provides the infrastructure that every phase needs (queueing, parallel worker dispatch, token-aware batching, resume on partial failure, per-phase budget enforcement, shared circuit-breaker logic, and structured log\u002Fcost telemetry); each phase plugs in a worker prompt and a Pydantic schema and inherits all of the above for free. This separation is what makes the framework reusable: you can drop in a new phase, target a new codebase, or swap a model backbone without touching the harness itself.\n\nConcretely, the harness:\n- **Drives the Claude Code CLI** as the worker runtime (one subprocess per batch, with `--prompt-path` and `--stream-json`), so each worker inherits Claude Code's tool sandbox (Read\u002FWrite\u002FGrep\u002FGlob, MCP servers when enabled).\n- **Resumes from `outputs\u002F*_PARTIAL_*.json`** so a 10-implementation RQ1 run that's interrupted at hour 4 picks up exactly where it left off without re-spending tokens.\n- **Enforces a per-phase budget** at the runner level (`BudgetExceeded` is raised, not logged) so a runaway prompt cannot burn the whole RQ1 budget on a single target.\n- **Validates leniently** — Pydantic schema mismatches generate warnings, not aborts; partial results are first-class and never blocked on validation failures.\n- **Shares one circuit breaker per phase** across all workers, so systemic issues (bad prompt, API outage, schema drift) trigger a fast abort instead of N parallel failures.\n\nIn other words: the harness handles the messy parts of running a 100-target audit at scale, leaving the per-phase prompts to focus on auditing.\n\n```\nscripts\u002F\n├── run_phase.py            # Entry point\n├── setup_mcp.sh            # MCP server registration\n└── orchestrator\u002F\n    ├── config.py            # Phase definitions (PhaseConfig)\n    ├── base.py              # BaseOrchestrator (async pipeline)\n    ├── runner.py            # ClaudeRunner + CircuitBreaker\n    ├── batch.py             # Token\u002Fcount-based batching\n    ├── queue.py             # Queue splitting & state\n    ├── collector.py         # Result parsing & aggregation\n    ├── resume.py            # Resume & cleanup manager\n    ├── watchdog.py          # LogWatcher + CostTracker\n    ├── schemas.py           # Pydantic data contracts\n    └── factory.py           # create_orchestrator()\n```\n\n> **Phase ID note.** The paper uses Phase 1–6 labels; the codebase uses the legacy IDs `01a → 01b → 01e → 02c → 03 → 04` (a one-to-one mapping). Phases 5–6 of the paper correspond to legacy `03` (Audit Map) and `04` (Audit Review). The remainder of this README uses the legacy IDs to match the file layout.\n\n## Phases\n\n### Phase 01a: Specification Discovery\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F01a_crawl.md` |\n| **Skill** | `\u002Fspec-discovery` |\n| **Input** | Seed URLs (via `SPEC_URLS` env var) |\n| **Output** | `outputs\u002F01a_STATE.json` |\n\nCrawls seed URLs to discover all relevant technical specification documents. Uses the `mcp__fetch__fetch` tool to recursively follow links and build a catalog of specification pages.\n\n\u003Cdetails>\n\u003Csummary>Output example (\u003Ccode>outputs\u002F01a_STATE.json\u003C\u002Fcode>, from \u003Ccode>ethereum-fusaka-20260220\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n```json\n{\n  \"start_url\": \"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7594.md\",\n  \"found_specs\": [\n    {\n      \"url\": \"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7594.md\",\n      \"title\": \"EIP-7594: PeerDAS - Peer Data Availability Sampling\",\n      \"category\": \"EIP\",\n      \"type\": \"Standards Track \u002F Core\",\n      \"status\": \"Final\",\n      \"layer\": \"consensus+networking\",\n      \"description\": \"Introducing simple DAS utilizing gossip distribution and peer requests...\"\n    },\n    {\n      \"url\": \"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7823.md\",\n      \"title\": \"EIP-7823: Set Upper Bounds for MODEXP\",\n      \"category\": \"EIP\",\n      \"type\": \"Standards Track \u002F Core\",\n      \"status\": \"Final\",\n      \"layer\": \"execution\",\n      \"description\": \"Restricts each MODEXP precompile input field to a maximum of 8192 bits...\"\n    }\n  ],\n  \"metadata\": {\n    \"timestamp\": \"2026-02-05T12:00:00Z\",\n    \"keywords\": [\"ethereum\", \"fusaka\", \"fulu\", \"osaka\", \"...\"],\n    \"total_specs\": 28,\n    \"breakdown\": { \"eips\": 11, \"consensus_specs\": 7, \"execution_specs\": 9 }\n  }\n}\n```\n\u003C\u002Fdetails>\n\n### Phase 01b: Subgraph Extraction\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F01b_extract_worker.md` |\n| **Skill** | `\u002Fsubgraph-extractor` |\n| **Input** | `outputs\u002F01a_STATE.json` |\n| **Output** | `outputs\u002F01b_PARTIAL_*.json` + `outputs\u002Fgraphs\u002F*\u002F*.mmd` |\n\nExtracts formal **Program Graphs** (following Nielson & Nielson's definition) from each specification document. Each subgraph is output as an enriched Mermaid state diagram (`.mmd`) with YAML frontmatter and inline invariant annotations. PARTIAL JSON files reference the `.mmd` paths for downstream consumption.\n\n\u003Cdetails>\n\u003Csummary>Output example — PARTIAL JSON (\u003Ccode>outputs\u002F01b_PARTIAL_W0B1_*.json\u003C\u002Fcode>, from \u003Ccode>ethereum-fusaka-20260220\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n```json\n{\n  \"specs\": [\n    {\n      \"source_url\": \"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7951.md\",\n      \"title\": \"EIP-7951: Precompile for secp256r1 Curve Support\",\n      \"sub_graphs\": [\n        {\n          \"id\": \"SG-001\",\n          \"name\": \"p256verify_main\",\n          \"mermaid_file\": \"outputs\u002Fgraphs\u002FW0B1_1770278556\u002FEIP-7951\u002FSG-001_p256verify_main.mmd\"\n        },\n        {\n          \"id\": \"SG-002\",\n          \"name\": \"input_validation\",\n          \"mermaid_file\": \"outputs\u002Fgraphs\u002FW0B1_1770278556\u002FEIP-7951\u002FSG-002_input_validation.mmd\"\n        },\n        {\n          \"id\": \"SG-003\",\n          \"name\": \"signature_verification\",\n          \"mermaid_file\": \"outputs\u002Fgraphs\u002FW0B1_1770278556\u002FEIP-7951\u002FSG-003_signature_verification.mmd\"\n        }\n      ]\n    }\n  ],\n  \"metadata\": {\n    \"phase\": \"01b\",\n    \"worker_id\": 0,\n    \"batch_index\": 1,\n    \"item_count\": 2,\n    \"timestamp\": 1770278944,\n    \"processed_ids\": [\"https:\u002F\u002Fgithub.com\u002Fethereum\u002FEIPs\u002Fblob\u002Fmaster\u002FEIPS\u002Feip-7951.md\"]\n  }\n}\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Output example — enriched Mermaid file (\u003Ccode>.mmd\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n```mermaid\n---\ntitle: \"p256verify_main (EIP-7951: Precompile for secp256r1 Curve Support)\"\n---\nstateDiagram-v2\n    direction TB\n    [*] --> q_gas: charge 6900 gas\n    q_gas --> q_decode: decode input(h, r, s, qx, qy)\n    q_decode --> q_validate: input_validation(h, r, s, qx, qy)\n    q_validate --> q_fail_validation: validation failed\n    q_validate --> q_verify: validation passed\n    q_fail_validation --> [*]: return empty\n    q_verify --> q_check_result: signature_verification(h, r, s, qx, qy)\n    q_check_result --> q_success: verified = true\n    q_check_result --> q_fail_verify: verified = false\n    q_success --> [*]: return 0x01 (32 bytes)\n    q_fail_verify --> [*]: return empty\n\n    note right of q_fail_verify\n        INV-001: Precompile MUST NOT revert under any circumstances\n        INV-002: Gas cost is always 6900 regardless of execution path\n        INV-003: Output is exactly 32 bytes on success or 0 bytes on failure\n    end note\n```\n\u003C\u002Fdetails>\n\n### Phase 01e: Property Generation\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F01e_prop_worker.md` (inlined — no skill fork) |\n| **Input** | `outputs\u002F01b_PARTIAL_*.json` + `outputs\u002FBUG_BOUNTY_SCOPE.json` (required) |\n| **Output** | `outputs\u002F01e_PARTIAL_*.json` |\n\nPerforms inline trust model analysis and generates formal security properties from subgraphs. Combines former phases 01d (Trust Model) and property generation into a single inlined prompt. Key features:\n\n- **Domain-agnostic STRIDE + CWE Top 25**: General STRIDE thinking framework augmented with CWE Top 25 patterns (CWE-22\u002F78\u002F89\u002F94\u002F200\u002F502\u002F639\u002F770\u002F862). No domain-specific hardcoding.\n- **Reachability classification**: `external-reachable`, `internal-only`, `api-only`\n- **Bug bounty scope determination**: Uses `severity_classification` from `BUG_BOUNTY_SCOPE.json` as authoritative severity definitions\n- **Slim output**: `covers` is a string (primary element ID), `reachability` has 4 fields only (`classification`, `entry_points`, `attacker_controlled`, `bug_bounty_scope`)\n\nThe orchestrator **requires** `outputs\u002FBUG_BOUNTY_SCOPE.json` and aborts if the file is missing.\n\n\u003Cdetails>\n\u003Csummary>Output example (\u003Ccode>outputs\u002F01e_PARTIAL_W0B1_*.json\u003C\u002Fcode>, from \u003Ccode>ethereum-fusaka-20260220\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n```json\n{\n  \"properties\": [\n    {\n      \"property_id\": \"PROP-56ad1eb2-inv-001\",\n      \"text\": \"P256VERIFY must accept valid secp256r1 signatures and reject all invalid ones deterministically.\",\n      \"type\": \"invariant\",\n      \"assertion\": \"forall (h,r,s,qx,qy): p256verify(h,r,s,qx,qy) == true iff ECDSA_verify(h,r,s,(qx,qy)) == true\",\n      \"severity\": \"CRITICAL\",\n      \"covers\": \"SG-003\",\n      \"reachability\": {\n        \"classification\": \"external-reachable\",\n        \"entry_points\": [\"Transaction\", \"P2P\"],\n        \"attacker_controlled\": true,\n        \"bug_bounty_scope\": \"in-scope\"\n      },\n      \"bug_bounty_eligible\": true,\n      \"exploitability\": \"external-attack\"\n    },\n    {\n      \"property_id\": \"PROP-56ad1eb2-pre-001\",\n      \"text\": \"Execution payload parent_hash must chain to state.latest_execution_payload_header.block_hash.\",\n      \"type\": \"pre-condition\",\n      \"assertion\": \"forall payload p: p.parent_hash == state.latest_execution_payload_header.block_hash\",\n      \"severity\": \"HIGH\",\n      \"covers\": \"SG-002\",\n      \"reachability\": {\n        \"classification\": \"external-reachable\",\n        \"entry_points\": [\"P2P\"],\n        \"attacker_controlled\": true,\n        \"bug_bounty_scope\": \"in-scope\"\n      },\n      \"bug_bounty_eligible\": true,\n      \"exploitability\": \"external-attack\"\n    }\n  ],\n  \"metadata\": {\n    \"timestamp\": \"1771748647\",\n    \"total_properties\": 45,\n    \"by_severity\": { \"CRITICAL\": 9, \"HIGH\": 18, \"MEDIUM\": 16, \"INFORMATIONAL\": 2 },\n    \"by_scope\": { \"in_scope\": 35, \"out_of_scope\": 10 },\n    \"bug_bounty_eligible_count\": 30\n  }\n}\n```\n\u003C\u002Fdetails>\n\n### Phase 02c: Code Location Pre-resolution\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F02c_codelocation_worker.md` (inlined — no skill fork) |\n| **Input** | `outputs\u002F01e_PARTIAL_*.json` + `outputs\u002FTARGET_INFO.json` + `outputs\u002F01b_SUBGRAPH_INDEX.json` |\n| **Output** | `outputs\u002F02c_PARTIAL_*.json` |\n| **Model** | Sonnet |\n\nPre-resolves code locations for each property against the target repository using Tree-sitter MCP (primary) with Glob\u002FGrep fallback. Records file paths, symbol names, and line ranges without extracting code. Applies severity gating (drops `Informational` properties by default). Builds `outputs\u002F01b_SUBGRAPH_INDEX.json` from 01b partials for spec-level context. Reads `outputs\u002FTARGET_INFO.json` (created by 02c workflow before phase runs).\n\nReduces token consumption in Phase 03 by ~40-60%.\n\n\u003Cdetails>\n\u003Csummary>Output example — resolved (\u003Ccode>outputs\u002F02c_PARTIAL_W0B1_*.json\u003C\u002Fcode>)\u003C\u002Fsummary>\n\n```json\n{\n  \"properties_with_code\": [\n    {\n      \"property_id\": \"PROP-56ad1eb2-inv-001\",\n      \"text\": \"P256VERIFY must accept valid secp256r1 signatures and reject all invalid ones deterministically.\",\n      \"type\": \"invariant\",\n      \"assertion\": \"forall (h,r,s,qx,qy): p256verify(h,r,s,qx,qy) == true iff ECDSA_verify(h,r,s,(qx,qy)) == true\",\n      \"severity\": \"CRITICAL\",\n      \"covers\": \"SG-003\",\n      \"reachability\": { \"classification\": \"external-reachable\", \"entry_points\": [\"Transaction\", \"P2P\"], \"attacker_controlled\": true, \"bug_bounty_scope\": \"in-scope\" },\n      \"exploitability\": \"external-attack\",\n      \"code_scope\": {\n        \"locations\": [\n          {\n            \"file\": \"core\u002Fvm\u002Fcontracts.go\",\n            \"symbol\": \"p256Verify.Run\",\n            \"line_range\": { \"start\": 1433, \"end\": 1449 },\n            \"role\": \"primary\"\n          },\n          {\n            \"file\": \"crypto\u002Fsecp256r1\u002Fverifier.go\",\n            \"symbol\": \"Verify\",\n            \"line_range\": { \"start\": 27, \"end\": 27 },\n            \"role\": \"callee\"\n          }\n        ],\n        \"resolution_status\": \"resolved\",\n        \"resolution_error\": \"\",\n        \"resolution_method\": \"grep_fallback\"\n      }\n    }\n  ]\n}\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Output example — out-of-scope \u002F not-found\u003C\u002Fsummary>\n\n```json\n{\n  \"property_id\": \"PROP-56ad1eb2-inv-004\",\n  \"text\": \"Blob commitment count in block must not exceed get_blob_parameters(epoch).max_blobs_per_block.\",\n  \"code_scope\": {\n    \"locations\": [],\n    \"resolution_status\": \"out_of_scope\",\n    \"resolution_error\": \"Property references get_blob_parameters (consensus-layer function). Target is ethereum\u002Fgo-ethereum (execution client) with no consensus-layer logic.\"\n  }\n}\n```\n\u003C\u002Fdetails>\n\n### Phase 03: Audit Map (Formal Audit)\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F03_auditmap_worker_inline.md` (inlined — no skill fork) |\n| **Input** | `outputs\u002F02c_PARTIAL_*.json` + Target codebase (auto-cloned from `TARGET_INFO.json`) |\n| **Output** | `outputs\u002F03_PARTIAL_*.json` |\n| **Model** | Sonnet |\n\nPerforms a proof-based 3-sub-phase formal audit for each property against the target codebase. **The core method: try to prove the property holds; where the proof breaks, that gap is the bug.** This framing was chosen over an adversarial *\"find bugs\"* prompt after preliminary experiments showed the adversarial approach produced an **88% false positive rate** — without a structured claim to disprove, the model produced numerous speculative findings with weak grounding.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fphase5.png\" alt=\"Phase 5 — Property-Grounded Audit (Map \u002F Prove \u002F Stress-Test)\" width=\"700\" \u002F>\n\u003C\u002Fp>\n\n1. **Sub-phase 1 (Map):** Decompose the property's assertion into verifiable sub-claims, read the enforcement code completely (full function bodies plus callers\u002Fcallees), and link each sub-claim to the code responsible for satisfying it.\n2. **Sub-phase 2 (Prove):** Verify input coverage, path coverage, concurrency safety, temporal validity, and implementation-pattern obligations (e.g., cache keys and deduplication keys computed from complete inputs); gaps are recorded as findings.\n3. **Sub-phase 3 (Stress-Test):** Challenge the conclusion — re-examine every assumption (if the proof succeeded) or attempt to construct a concrete attack path (if it failed); findings without a plausible attack path are downgraded to `potential-vulnerability`.\n\n> \"Proof attempt\" is precise terminology: this is **LLM-driven evidence construction with structured reasoning steps, not formal verification**. The structure is what makes both detections and failures analyzable.\n\nCompact 6-field output per item: `property_id`, `classification`, `code_path`, `proof_trace`, `attack_scenario`, `checklist_id`.\n\n\u003Cdetails>\n\u003Csummary>Output example — vulnerability found (Sherlock #190: Prysm inclusion proof cache poisoning)\u003C\u002Fsummary>\n\n```json\n{\n  \"audit_items\": [\n    {\n      \"property_id\": \"PROP-6a4369e9-inv-042\",\n      \"classification\": \"vulnerability\",\n      \"code_path\": \"beacon-chain\u002Fverification\u002Fdata_column.go::inclusionProofKey::L527-547\",\n      \"proof_trace\": \"The cache key omits KzgCommitments (the data being proven), including only the inclusion proof and header hash. Two data columns with identical proofs\u002Fheaders but different commitments produce the same cache key, causing the second to skip verification and reuse the first's cached result.\",\n      \"attack_scenario\": \"Attacker sends valid DataColumnSidecar A, then sends forged DataColumnSidecar M with same inclusion proof and header but malicious KzgCommitments. Cache lookup succeeds on M's key, bypassing full Merkle verification and accepting invalid commitments.\",\n      \"checklist_id\": \"PROP-6a4369e9-inv-042\"\n    }\n  ],\n  \"metadata\": {\n    \"phase\": \"03\",\n    \"worker_id\": 0,\n    \"batch_index\": 81,\n    \"item_count\": 1,\n    \"timestamp\": 1771777036,\n    \"processed_ids\": [\"PROP-6a4369e9-inv-042\"]\n  }\n}\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Output example — not-a-vulnerability (proof succeeded)\u003C\u002Fsummary>\n\n```json\n{\n  \"audit_items\": [\n    {\n      \"property_id\": \"PROP-6a4369e9-inv-047\",\n      \"classification\": \"not-a-vulnerability\",\n      \"code_path\": \"eip_7594\u002Fsrc\u002Flib.rs::get_custody_groups::L52\",\n      \"proof_trace\": \"The loop at L67 is guarded by validation at L52 (ensure! custody_group_count \u003C= number_of_custody_groups). All call paths use local custody_group_count (validator-computed or config-derived), not peer-reported values.\",\n      \"attack_scenario\": \"\",\n      \"checklist_id\": \"PROP-6a4369e9-inv-047\"\n    }\n  ]\n}\n```\n\u003C\u002Fdetails>\n\n### Phase 04: Audit Review\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F04_review_worker.md` (inlined — no skill fork) |\n| **Input** | `outputs\u002F03_PARTIAL_*.json` + `outputs\u002FBUG_BOUNTY_SCOPE.json` + `outputs\u002FTARGET_INFO.json` |\n| **Output** | `outputs\u002F04_PARTIAL_*.json` |\n| **Model** | Sonnet |\n\nFilters false positives from Phase 03 findings via a recall-safe 3-gate pipeline with early exit. **Only these 3 gates may produce DISPUTED_FP** — no other reasoning may dispute a finding:\n\n1. **Gate 1 (Dead Code):** Grep for callers — zero non-test callers → DISPUTED_FP. Public\u002Fexported API exception: passes gate regardless of internal caller count. Skipped for \"missing validation\" findings.\n2. **Gate 2 (Trust Boundary):** Look up the attack path's data source in `trust_assumptions` from BUG_BOUNTY_SCOPE.json — if trust level is TRUSTED\u002FSEMI_TRUSTED and no untrusted path also reaches the code → DISPUTED_FP. No code analysis; purely a lookup.\n3. **Gate 3 (Scope Check):** Check `out_of_scope`, `conditional_scope`, and `in_scope.scope_restriction` in BUG_BOUNTY_SCOPE.json — finding falls under an excluded category → DISPUTED_FP.\n\nItems that pass all gates undergo severity calibration against `severity_classification` thresholds (with optional network-share-based severity cap from `deployment_context.client_diversity`). Non-findings (not-a-vulnerability, out-of-scope, informational) early-exit as PASS_THROUGH. Verdicts: CONFIRMED_VULNERABILITY, CONFIRMED_POTENTIAL, DISPUTED_FP, DOWNGRADED, NEEDS_MANUAL_REVIEW, PASS_THROUGH.\n\n\u003Cdetails>\n\u003Csummary>Output example — CONFIRMED_VULNERABILITY\u003C\u002Fsummary>\n\n```json\n{\n  \"reviewed_items\": [\n    {\n      \"property_id\": \"PROP-6a4369e9-pre-009\",\n      \"review_verdict\": \"CONFIRMED_VULNERABILITY\",\n      \"original_classification\": \"vulnerability\",\n      \"adjusted_severity\": \"Medium\",\n      \"reviewer_notes\": \"Spec requires: 'data_column_sidecars_by_root must reject requests exceeding MAX_REQUEST_DATA_COLUMN_SIDECARS'. Code reading verified: codec.rs:562-570 validates number of identifiers \u003C=128, each identifier can have \u003C=128 columns, enabling 128x128=16384 total columns. Handler rpc_methods.rs:408-460 lacks total column validation. Severity calibrated to Medium per BUG_BOUNTY_SCOPE.json: client market share \u003C5%.\",\n      \"spec_reference\": \"01e property PROP-6a4369e9-pre-009: 'data_column_sidecars_by_root must reject requests exceeding MAX_REQUEST_DATA_COLUMN_SIDECARS'\"\n    }\n  ],\n  \"metadata\": { \"phase\": \"04\", \"worker_id\": 1, \"batch_index\": 2, \"item_count\": 1, \"timestamp\": 1771818928, \"processed_ids\": [\"PROP-6a4369e9-pre-009\"] }\n}\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Output example — DISPUTED_FP (Gate triggered)\u003C\u002Fsummary>\n\n```json\n{\n  \"reviewed_items\": [\n    {\n      \"property_id\": \"PROP-6a4369e9-inv-010\",\n      \"review_verdict\": \"DISPUTED_FP\",\n      \"original_classification\": \"vulnerability\",\n      \"adjusted_severity\": \"Informational\",\n      \"reviewer_notes\": \"Phase 03 misunderstood the validation architecture. The array length validation DOES exist and IS enforced on all paths (gossip, RPC, and database loads). The claim of 'out-of-bounds panic' is false — the length check at kzg_utils.rs:84-89 prevents any indexing operation.\",\n      \"spec_reference\": \"01e property: 'Column, kzg_commitments, and kzg_proofs arrays must all have equal length.' Code enforces this on all paths via kzg_utils.rs:84-89.\"\n    }\n  ]\n}\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Output example — DOWNGRADED (severity cap)\u003C\u002Fsummary>\n\n```json\n{\n  \"reviewed_items\": [\n    {\n      \"property_id\": \"PROP-57888860-inv-006\",\n      \"review_verdict\": \"CONFIRMED_POTENTIAL\",\n      \"original_classification\": \"vulnerability\",\n      \"adjusted_severity\": \"Low\",\n      \"reviewer_notes\": \"Code reading verified: reconstruction.go:79 iterates Go map (sidecarByIndex) which has randomized iteration order, building cellsIndices without sorting before passing to RecoverCellsAndKZGProofs (line 86). Spec SG-024 explicitly requires 'assert cell_indices == sorted(cell_indices)'. Downgraded from Medium to Low: single-client bug affecting Prysm (31% CL share), below the 33% threshold for Medium severity.\",\n      \"spec_reference\": \"Fulu Polynomial Commitments Sampling SG-024: INV requires cell indices unique and in ascending order\"\n    }\n  ]\n}\n```\n\u003C\u002Fdetails>\n\n### Phase 05: PoC Generation (Manual)\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F05_poc.md` |\n| **Usage** | `\u002F05_poc TYPE=unit VULN_ID=... OUTPUT_PATH=...` |\n\nGenerates minimal, self-verifying Proof-of-Concept tests in the project's native stack (auto-detected language and test framework). Supports unit \u002F integration \u002F e2e scopes. Includes a self-repair loop (up to 4 attempts) and false-positive mitigation via guard assertions.\n\n### Phase 06: Bug-Bounty Report (Manual)\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F06_report.md` |\n| **Usage** | `\u002F06_report VULN_ID=... REPORT_TYPE=ETHEREUM` |\n\nGenerates a platform-tailored Markdown bug-bounty report (CANTINA, CODE4RENA, ETHEREUM, IMMUNEFI, SHERLOCK). Fills template placeholders with sanitized data, embeds PoC code with run commands, and derives severity from bounty guidelines when not specified.\n\n### Phase 06b: Full Audit Report (Manual)\n\n| | |\n|---|---|\n| **Prompt** | `prompts\u002F06b_audit_report.md` |\n| **Usage** | `\u002F07_audit_report OUTPUT_PATH=outputs\u002FAUDIT_REPORT.md` |\n\nCompiles a publication-ready security assessment report covering all findings. Includes: Cover Page, Executive Summary, Scope, System Overview, Methodology, Specification Traceability, Finding Classification, Findings Summary, Detailed Findings, Re-Verification, Operational Recommendations, and Appendix. All internal IDs are sanitized to sequential labels (e.g., Finding-01, Gap-02).\n\n## Running on GitHub Actions\n\nAll pipeline phases are executed via **GitHub Actions workflows** with `workflow_dispatch` triggers:\n\n| Workflow | File | Description |\n|---|---|---|\n| 01a. Discovery | `01a-discovery.yml` | Crawl specification URLs |\n| 01b. Subgraph Extraction | `01b-subgraph.yml` | Extract program graphs |\n| 01e. Properties | `01e-properties.yml` | Trust model + property generation |\n| 02c. Code Resolution | `02c-enrich-code.yml` | Pre-resolve code locations |\n| 03. Audit Map | `03-audit-map.yml` | Proof-based 3-phase formal audit |\n| 04. Audit Review | `04-audit-review.yml` | 3-gate FP filter + severity calibration |\n\nEach workflow:\n1. Checks out the repository and syncs the latest `scripts\u002F`, `prompts\u002F`, `.claude\u002F` from the base branch.\n2. Installs Claude Code CLI and registers MCP servers via `scripts\u002Fsetup_mcp.sh`.\n3. Runs the orchestrator: `uv run python3 scripts\u002Frun_phase.py --phase \u003CID> --workers N`.\n4. Commits results to an audit branch and uploads logs as artifacts.\n\nFor local execution, see [Quick Start](#quick-start) above.\n\n### MCP Servers\n\nThe following MCP servers are registered by `scripts\u002Fsetup_mcp.sh`:\n\n| Server | Command | Used In |\n|---|---|---|\n| `tree_sitter` | `uvx mcp-server-tree-sitter` | 02c |\n| `filesystem` | `npx -y @modelcontextprotocol\u002Fserver-filesystem` | 01b, 02c |\n| `fetch` | `uvx mcp-server-fetch` | 01a |\n\nNote: Phases 01e, 03, and 04 use inlined prompts with no MCP servers (only built-in Read\u002FWrite\u002FGrep\u002FGlob tools).\n\n## Configuration\n\nSPECA expects two JSON files in `outputs\u002F` before running the audit phases:\n\n### `outputs\u002FBUG_BOUNTY_SCOPE.json` — *required by Phase 01e and Phase 04*\n\nDefines the trust model and severity rubric for the target. Phase 01e aborts (`sys.exit(1)`) if it is missing. Minimal shape:\n\n```json\n{\n  \"in_scope\":   { \"components\": [\"...\"], \"scope_restriction\": \"...\" },\n  \"out_of_scope\": [\"...\"],\n  \"conditional_scope\": [\"...\"],\n  \"trust_assumptions\": {\n    \"p2p_input\":      { \"trust_level\": \"UNTRUSTED\",   \"rationale\": \"...\" },\n    \"consensus_state\":{ \"trust_level\": \"TRUSTED\",     \"rationale\": \"...\" },\n    \"rpc_input\":      { \"trust_level\": \"SEMI_TRUSTED\",\"rationale\": \"...\" }\n  },\n  \"severity_classification\": {\n    \"CRITICAL\": \"Loss of funds \u002F consensus split \u002F mass DoS\",\n    \"HIGH\":     \"...\",\n    \"MEDIUM\":   \"...\",\n    \"LOW\":      \"...\"\n  },\n  \"deployment_context\": {\n    \"type\": \"multi-implementation\",\n    \"target_share\": { \"value\": 0.31, \"metric\": \"validator-share\" }\n  }\n}\n```\n\n`deployment_context.target_share.value` ∈ [0, 1] is used by Phase 04 as an optional severity cap (e.g. a single-client bug below a 33% network-share threshold gets downgraded).\n\n### `outputs\u002FTARGET_INFO.json` — *required by Phase 02c \u002F 03 \u002F 04*\n\nPins the target repository and commit. Phase 03 will `git clone` to this exact ref:\n\n```json\n{\n  \"name\":   \"go-ethereum\",\n  \"repo\":   \"https:\u002F\u002Fgithub.com\u002Fethereum\u002Fgo-ethereum\",\n  \"commit\": \"abc1234deadbeef...\",\n  \"language\": \"go\"\n}\n```\n\n### Environment Variables\n\n| Variable | Used By | Purpose |\n|---|---|---|\n| `ANTHROPIC_API_KEY` | All phases | Claude Code authentication |\n| `SPEC_URLS` | 01a | Comma-separated seed URLs to crawl |\n| `KEYWORDS` | 01a | Optional crawl keyword filter |\n| `FORCE_EXECUTE=1` | All phases | Bypass resume state (set automatically by `--force`) |\n| `CLAUDE_CODE_PERMISSIONS=bypassPermissions` | CI | Skip interactive permission prompts |\n| `CLAUDE_CODE_MAX_OUTPUT_TOKENS=100000` | CI | Raise output cap for long audit traces |\n| `GITHUB_PERSONAL_ACCESS_TOKEN` | Optional | Used by GitHub MCP server when enabled |\n\n## Evaluation\n\nSPECA is evaluated on two complementary benchmarks. **RQ1** measures effectiveness on a large multi-implementation security contest with 366 professional auditors; **RQ2** compares SPECA against published code-driven baselines on an established C\u002FC++ benchmark.\n\n> All numbers below are taken verbatim from the paper ([arXiv:2604.26495](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26495)). Charts are reproducible via the scripts under [`benchmarks\u002F`](.\u002Fbenchmarks\u002FREADME.md); raw artifacts (logs, per-finding labels, model outputs) ship with this repository.\n\n### RQ1 — Sherlock Ethereum Fusaka Audit Contest\n\n**Benchmark.** 10 production Ethereum client implementations of EIP-7594 (PeerDAS) and EIP-7691, spanning **5 programming languages** (Go, Rust, Nim, TypeScript, C). 366 submissions from professional auditors; 15 judged valid at H\u002FM\u002FL severity (5 High, 2 Medium, 8 Low).\n\n**Headline detection numbers (post-Phase 6, N=72):**\n\n| Metric | Value |\n|---|---|\n| Phase 5 findings (pre-review) | 102 |\n| Phase 6 findings (post-review) | 72 |\n| **H\u002FM\u002FL recovered (expert-augmented)** | **15 \u002F 15 (100%)** |\n| H\u002FM\u002FL recovered (automated-only) | 8 \u002F 15 (53%) |\n| **Novel bugs confirmed by fix commits** | **4** |\n| Confirmed FPs (post-review) | 24 (33.3%) |\n| Strict precision (H\u002FM\u002FL match) | 26.4% (19\u002F72) |\n| Confirmed-useful precision | 59.7% (43\u002F72) |\n| Broad precision (non-FP rate) | 66.7% (48\u002F72) |\n\n**Phase 6 lifts precision while preserving recall.** The severity-preserving review filter raises broad precision from **56.9% → 66.7%** while preserving 100% recall on H\u002FM\u002FL true positives, raising **F1 from 72.5% → 80.0%**:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_phase_comparison.png\" alt=\"Phase 5 vs Phase 6\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\n**Property neighborhoods drive recall.** Many issues are recovered not by a single alert but by multiple complementary properties. Cluster-level strict precision (grouping all findings against the same issue into one cluster) is **48.7%** (vs. 26.4% finding-level), confirming genuine redundancy rather than alert duplication.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_findings_per_issue.png\" alt=\"Findings per issue\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nThe Sankey diagram below makes the same neighborhood structure visible from the property side: the horizontal density around a handful of issues shows which property families converge on the same root cause.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_sankey_flow.png\" alt=\"Property family → ground-truth issue Sankey flow\" width=\"700\" \u002F>\n\u003C\u002Fp>\n\n**Per-repository finding distribution** across the 10 implementations (Phase 5 vs Phase 6):\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_per_repo.png\" alt=\"Per-repository findings\" width=\"700\" \u002F>\n\u003C\u002Fp>\n\n**4 novel bugs absent from all 366 contest submissions, confirmed by developer fix commits:**\n\n| # | Target | Bug | Fix |\n|---|---|---|---|\n| A | **c-kzg-4844** | KZG batch-verification challenge hash uses the original commitment array rather than the deduplicated array — selective forgery against batch verify | `f18ba082` |\n| B | **Lodestar** | Inverted logic + missing validation on column sidecar cache | `3b98c59c` |\n| C | **Nimbus** | Unchecked array access reachable from RPC and Engine API | `b3a3f3f9` |\n| D | **Prysm** | Wrong subnet parameter computation + missing cell-count validation | `b5bdd65f` |\n\n> **Bug A** is particularly significant: a cryptographic correctness bug in a core library used by multiple Ethereum clients. It is a violation of an invariant that the specification defines but no code-level auditing tool was designed to check.\n\n**Recovered Sherlock H\u002FM\u002FL issues** (representative subset):\n\n| Severity | Target | Issue | Sherlock # |\n|---|---|---|---|\n| HIGH | Prysm | Inclusion proof cache key omits `KzgCommitments` → cache poisoning bypasses Merkle verification | #190 |\n| HIGH | Nethermind | Mismatched loop bounds between `BlobVersionedHashes` and `wrapper.Blobs` → extra hashes bypass commitment validation | #210 |\n| HIGH | c-kzg-4844 | Fiat-Shamir challenge hash uses original array instead of deduplicated commitments → selective forgery | #203 |\n| HIGH | Lighthouse | `get_beacon_proposer_indices` recomputes from active validators instead of reading `proposer_lookahead` → consensus split | #40 |\n| MEDIUM | Nimbus | `handle_custody_groups` loop terminates only when `HashSet.size == custody_group_count` → infinite-loop DoS via P2P metadata | #15 |\n| MEDIUM | Nimbus | 30-minute metadata refresh timer with no fork-aware acceleration → stale `custody_group_count=0` blocks data-column sync | #216 |\n| LOW | Grandine | `verify_kzg_proofs` returns `Ok(false)` but boolean is discarded by `.map_err()?` → invalid KZG proofs accepted | #376 |\n| LOW | Grandine | `get_blob_schedule_entry` assumes descending order but named-network constructors define ascending → wrong epoch match causes chain split | #319 |\n| LOW | Lodestar | Cache key `(blockRootHex, index)` excludes signature → attacker rebroadcasts invalid-signature sidecars via cache hit | #381 |\n| LOW | Reth\u002Falloy-evm | `next_block_excess_blob_gas_osaka()` receives child's base fee instead of parent's → invalid block proposals | #371 |\n\n**Phase 6 three-gate filter effectiveness** (N=30 `DISPUTED_FP`):\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_gate_effectiveness.png\" alt=\"Gate effectiveness\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nThe current 3-gate design was **simplified from a 5-gate prototype** after empirical analysis showed two of the original gates (Spec Cross-Reference and Exploitability) filtered informational true positives at **0% precision**, providing no net benefit.\n\n#### Structured False-Positive Analysis\n\nA defining capability of SPECA: every false positive decomposes into a **traceable root cause**. Of 16 deeply analyzed FPs (drawn from a population of 44 total):\n\n| Root Cause | Phase Origin | N | % |\n|---|---|---|---|\n| Trust boundary misunderstanding | Phase 3 (Property Generation) | 8 | 50.0% |\n| Code reading error | Phase 5 (Audit) | 6 | 37.5% |\n| Specification misinterpretation | Phase 3 | 2 | 12.5% |\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_fp_taxonomy.png\" alt=\"FP taxonomy\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nEach root cause maps to a **concrete, implementable improvement target**: explicit trust-boundary configuration, richer code-reading context, and enforced spec-section re-reading before classification. This is the property-centered representation's payoff: failures are diagnosable.\n\n#### Property-Type Ablation\n\nWhich parts of the property vocabulary actually drive detection?\n\n| Property Type | N | TP | FP | Precision |\n|---|---|---|---|---|\n| Invariant | 67 | 18 | 6 | **75.0%** |\n| Precondition | 11 | 4 | 0 | **100.0%** |\n| Postcondition | 5 | 1 | 1 | 50.0% |\n| Assumption | 5 | 0 | 1 | **0.0%** |\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002Fchart_property_type_ablation.png\" alt=\"Property type ablation\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nInvariants account for 76% of findings at 75% precision and dominate detection today. Assumption-type properties are too noisy for reliable auditing and are best treated as an exploratory mode. **Postcondition and assumption generation are the concrete research frontiers** for the automated-only configuration.\n\n#### Automated-Only vs. Expert-Augmented\n\n| Configuration | Properties | H\u002FM\u002FL | Coverage |\n|---|---|---|---|\n| Automated-only | Auto-generated (Phases 1–3) | 8 \u002F 15 | 53% |\n| **Expert-augmented** | Auto + 7 manual properties | **15 \u002F 15** | **100%** |\n\nThe 7 manual properties cluster in two domain-specific areas — **cryptographic invariants** (KZG polynomial commitment edge cases, BLS12-381 identity element handling) and **protocol-lifecycle rules** (custody group bounds, cache key completeness, fork-transition metadata refresh) — that require mathematical domain knowledge or multi-specification cross-referencing not yet reliably automated. They are authored once per spec corpus and **reused across all 10 implementations**, so expert-knowledge injection has high amortized leverage in multi-implementation settings.\n\n### RQ2 — RepoAudit C\u002FC++ Benchmark\n\n**Benchmark.** 15 open-source C\u002FC++ projects with 35 non-disputed ground-truth bugs (null-pointer dereferences, memory leaks, use-after-free) confirmed by developer fixes, plus 5 disputed bugs. Comparison: published RepoAudit baselines (4 model configurations) plus Meta Infer and Amazon CodeGuru.\n\n| Method | TP | FP | Precision | New cand. | Cost |\n|---|---|---|---|---|---|\n| _Partially controlled (DeepSeek R1)_ | | | | | |\n| RepoAudit (DeepSeek R1) | 41 | 6 | 87.2% | (in TP) | $8.55 |\n| **SPECA (DeepSeek R1)** | — | 15 | 72.7% | **7** | $93.51 |\n| _Latest models_ | | | | | |\n| RepoAudit (Claude 3.7 Sonnet) | 40 | 5 | 88.9% | (in TP) | $23.85 |\n| **SPECA (Sonnet 4.5)** | — | 6 | **88.9%** | **12** | $81.05 |\n| _Other configurations_ | | | | | |\n| Amazon CodeGuru | 0 | 18 | 0.0% | 0 | — |\n| Meta Infer | 7 | 2 | 77.8% | 0 | free |\n| RepoAudit (o3-mini) | 36 | 9 | 80.0% | (in TP) | $4.50 |\n| RepoAudit (Claude 3.5 Sonnet) | 40 | 11 | 78.4% | (in TP) | $38.10 |\n| **SPECA (Sonnet 4)** | — | 13 | 81.2% | **18** | $100.68 |\n\n> **New cand.** = author-validated candidate bugs *beyond* the established ground truth. Recall is not reported because the GT was constructed from RepoAudit's own discoveries (structurally unfair to compare).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq2a\u002Ffigures\u002Frq2a_precision_comparison.png\" alt=\"Precision comparison\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq2a\u002Ffigures\u002Frq2a_tp_fp_comparison.png\" alt=\"TP vs FP\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\n**SPECA (Sonnet 4.5) matches the best published baseline precision (88.9%)** while uniquely surfacing 12 author-validated beyond-GT candidates. Two of those candidates are externally validated:\n\n- **`PROP-N3-npd-001` (coturn, NPD)** — confirmed at **Level A** (bug existed in the analyzed commit, independently fixed in a later release; PR #1841 self-withdrawn after discovering the fix).\n- **`PROP-U5-uaf-002` (ICU\u002Fi18n, UAF race condition)** — confirmed at **Level B** (ICU maintainer approved the corresponding Jira ticket; PR #3921).\n\n**Cost vs. detection performance:**\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq2a\u002Ffigures\u002Frq2a_cost_efficiency.png\" alt=\"Cost vs precision\" width=\"600\" \u002F>\n\u003C\u002Fp>\n\nAt the Sonnet 4.5 configuration, SPECA achieves the highest precision while uniquely reporting double-digit beyond-GT candidates at a per-bug cost (~**$1.69\u002Fbug**) competitive with the best published baseline.\n\n**Symmetric cross-backbone comparison** (same-backbone DeepSeek R1 left, latest-models right):\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"benchmarks\u002Fresults\u002Frq2a\u002Ffigures\u002Frq2a_symmetric_comparison.png\" alt=\"Symmetric comparison\" width=\"700\" \u002F>\n\u003C\u002Fp>\n\n#### The Property Adherence Effect\n\nAn instructive non-monotonic pattern: Sonnet 4 discovers **18** beyond-GT candidates, Sonnet 4.5 discovers **12**, DeepSeek R1 discovers **7**. This is *not* a simple precision–discovery tradeoff — it reflects **increasing property adherence**. More capable models audit more faithfully against the stated property, checking exactly what the specification-derived property asserts and no more. Less capable models drift from the property scope during the proof-attempt phase, producing some genuine bugs (beyond-GT) and some false positives.\n\n> Engineering implication: **as models improve, property generation (Phases 1–3) becomes the binding constraint on detection coverage.** The model audits precisely what the properties tell it to audit; comprehensive property derivation is the primary lever for improving recall.\n\n### Cost & Throughput\n\n- **RQ1 (Sherlock):** ≈ $400–620 total API cost (10 implementations).\n- **RQ2 (RepoAudit, Sonnet 4.5):** $81.05 total = **$1.69 \u002F bug**.\n- Phases 1–3 use Claude **Opus** (specification understanding); Phases 4–6 use Claude **Sonnet** (code analysis & review).\n\n## Reproducing the Benchmarks\n\nAll evaluation scripts, per-repository outputs, and labeling artifacts ship with the repo:\n\n- [`benchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002F`](.\u002Fbenchmarks\u002Fresults\u002Frq1\u002Fsherlock_ethereum_audit_contest\u002F) — RQ1 raw outputs, labels, and chart-generation scripts.\n- [`benchmarks\u002Fresults\u002Frq2a\u002F`](.\u002Fbenchmarks\u002Fresults\u002Frq2a\u002F) — RQ2 RepoAudit outputs and figures.\n- [`benchmarks\u002FREADME.md`](.\u002Fbenchmarks\u002FREADME.md) — full reproduction instructions.\n\n## Contributing\n\nWe welcome issues and pull requests from the community.\n\n- **Bugs \u002F feature requests:** open a [GitHub issue](https:\u002F\u002Fgithub.com\u002FNyxFoundation\u002Fspeca\u002Fissues) with a minimal reproducer or a concrete use-case.\n- **Pull requests:**\n  1. Fork the repo and create a topic branch off `master`.\n  2. Run the test suite: `uv run python3 -m pytest tests\u002F -v --tb=short`.\n  3. Keep changes focused — pipeline phases are deliberately decoupled, so a PR should usually touch one phase at a time.\n  4. Open the PR with a brief description of *what* changed and *why*. If the change affects an inter-phase data contract, update `scripts\u002Forchestrator\u002Fschemas.py` and the relevant prompt under `prompts\u002F` together.\n- **New target domains:** SPECA is domain-agnostic by design. To onboard a new target, you typically only need to write a `BUG_BOUNTY_SCOPE.json` and a `TARGET_INFO.json` — no code change required.\n\n## Citation\n\nIf you use SPECA in academic work, please cite the accompanying paper:\n\n```bibtex\n@misc{kamba2026speca,\n  title         = {Beyond Code Reasoning: A Specification-Anchored Audit Framework for Expert-Augmented Security Verification},\n  author        = {Kamba, Masato and Murakami, Hirotake and Sannai, Akiyoshi},\n  year          = {2026},\n  eprint        = {2604.26495},\n  archivePrefix = {arXiv},\n  primaryClass  = {cs.CR},\n  url           = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.26495}\n}\n```\n\n## License\n\nSPECA is released under the [MIT License](LICENSE). See the `LICENSE` file for full terms.\n\n> **Disclaimer.** SPECA is a research artifact. Findings produced by the pipeline are *candidate* vulnerabilities and **must** be validated by a human auditor before being reported to a vendor or bug-bounty program. The maintainers make no warranty as to the completeness or correctness of any audit produced by this software.\n","SPECA 是一个将自然语言规范转化为检查清单的安全审计框架。它通过从自然语言规格说明中提取明确的、类型化的安全属性，并基于这些属性对实现进行结构化证明尝试推理来审计软件。该项目采用Python编写，支持Python 3.11及以上版本。SPECA的核心功能包括：依赖于规格说明的检测能力、跨实现比较以及可追溯到根源的误报分析。这使得它特别适用于协议栈、共识实现和加密库等需要遵循特定规范的系统安全性验证场景，尤其在处理由规范而非代码本身引起的漏洞时表现出色。",2,"2026-06-11 02:30:36","CREATED_QUERY"]