[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79874":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":28,"discoverSource":29},79874,"VulnGym","Tencent\u002FVulnGym","Tencent","VulnGym: A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents","",null,158,8,1,0,2,17,37,6,2.86,"Other",false,"main",true,[],"2026-06-12 02:03:55","\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fimg\u002Fwukong_logo.png\" alt=\"VulnGym\" height=\"60\">\n\u003C\u002Fp>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\".\u002FREADME_zh.md\">中文\u003C\u002Fa> |\n        \u003Ca href=\"#\">English\u003C\u002Fa>\n    \u003C\u002Fp>\n\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym\u002Fstargazers\">\u003Cimg alt=\"GitHub Stars\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVulnGym?color=gold\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym\u002Fnetwork\u002Fmembers\">\u003Cimg alt=\"GitHub Forks\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FTencent\u002FVulnGym?color=gold\">\u003C\u002Fa>\n  \u003Ca href=\".\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-CC--BY--4.0-blue.svg\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>A Real-World, Project-Level Vulnerability Benchmark for White-Box Vulnerability-Hunting Agents\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F⭐-Give VulnGym a Star-yellow?style=flat&logo=github\" alt=\"Give VulnGym a Star\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftencent\u002FVulnGym\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HuggingFace-Dataset-yellow?style=flat\" alt=\"HuggingFace Dataset\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n**VulnGym** is a project-level benchmark for white-box vulnerability-hunting agents, designed to evaluate an agent's vulnerability detection capabilities within **real-world engineering contexts**, with **verifiable vulnerability trigger paths and code-semantic evidence chains**.\n\n**Three core design principles:**\n- **🏗️ Real project-level evaluation units** — every sample is bound to a specific vulnerable commit of a real repository, evaluating an agent's ability to discover and locate vulnerabilities inside real multi-file, multi-module engineering projects.\n- **🧠 Comprehensive vulnerability-type coverage** — the benchmark covers both business-logic defects that demand cross-module code-semantic reasoning (e.g., authorization bypass, broken authentication) and traditional security flaws (e.g., injection, path traversal), providing a comprehensive assessment of an agent's ability to discover diverse vulnerability classes.\n- **✅ Verifiable vulnerability paths** — each sample ships with a human-reviewed **reachable entry point** (`entry_point`), **critical operation** (`critical_operation`), and **cross-module reasoning chain** (`trace`), enabling reproducible, explainable, and deterministic evaluation.\n\n---\n\n## 📢 What's New\n- **2026-05-17** — 🔧 v0.1.1 data refresh: added a `verify` field on every entry to mark human-audit status; **113 \u002F 408 entries** (covering **61 \u002F 184 advisories**) are now human-verified. Selected `entry_point` \u002F `critical_operation` \u002F `trace` values were also refined.\n- **2026-05-15** — 🎉 VulnGym v0.1.0 officially open-sourced!\n\n\n\n## Table of Contents\n\n- [🔍 Why VulnGym](#-why-vulngym)\n- [✨ Dataset overview](#-dataset-overview)\n- [📈 Baseline evaluation results](#-baseline-evaluation-results)\n- [📦 Repository layout](#-repository-layout)\n- [🚀 Quick start](#-quick-start)\n- [📊 Evaluating your tool](#-evaluating-your-tool)\n- [📖 Citation](#-citation)\n- [🤝 Contribution Guide](#-contribution-guide)\n- [🙏 Acknowledgements](#-acknowledgements)\n- [📄 License](#-license)\n\n---\n\n## 🔍 Why VulnGym\n\nExisting vulnerability benchmarks have the following limitations when\nevaluating the real-world vulnerability-hunting capabilities of AI agents:\n\n| Limitation | Manifestation |\n|---|---|\n| **Insufficient evaluation granularity** | Most benchmarks use functions or diff snippets as the evaluation unit, failing to reflect an agent's ability to locate vulnerabilities within complete engineering projects |\n| **Narrow vulnerability types** | Over-emphasis on pattern-matchable CWE flaws such as SQL injection and buffer overflow, with little coverage of categories requiring deep contextual reasoning |\n| **Coarse-grained ground truth** | Typically binary labels (vulnerable \u002F not vulnerable) or patch diffs, unable to precisely verify whether the agent locates the correct entry point and defect site |\n\n\n## ✨ Dataset overview\n\nThis is the **v0.1.1 release** of VulnGym. Data is provided\nas two JSONL files under the `data\u002F` directory:\n\n- `reports.jsonl` — aggregated records at the GitHub Advisory granularity\n- `entries.jsonl` — annotated records at the reachable entry point granularity\n\nEach record contains `repo_url` and `commit`, allowing you to check out the\nfull vulnerable source tree for the corresponding version.\n\n### Data scale\n\n| Metric | Value |\n|---|---|\n| Advisories (reports) | **184** |\n| Reachable entry points (entries) | **408** |\n| Distinct projects | 38 |\n| Distinct repositories | 23 |\n| Human-audited entries (`verify = 1`) | **113 \u002F 408 (27.7 %)** |\n| Human-audited advisories (≥ 1 verified entry) | **61 \u002F 184 (33.2 %)** |\n\n### Human audit status\n\nStarting in v0.1.1, every row in `entries.jsonl` carries a `verify` field\n(`int`, `0` or `1`):\n\n- `verify == 1` — the entry's `entry_point`, `critical_operation`, and\n  `trace` have been reviewed and confirmed by a human annotator. These\n  rows form a high-confidence ground-truth subset and are recommended\n  for strict, reproducible benchmarking.\n- `verify == 0` — automatically annotated; not yet human-confirmed.\n  Useful for scale and recall studies, but values may still be refined\n  in future releases.\n\nOf the **184** advisories, **50** have all of their entries verified and\n**11** are partially verified, for a total of **61** advisories with at\nleast one human-audited entry. Future releases will continue to expand\nthe verified subset.\n\n### Vulnerability type distribution\n\nEvery entry carries a two-level classification: `vuln_category_l1`\n(coarse type) and `vuln_category_l2` (fine-grained sub-type). **71.2 %** of\nadvisories are business-logic vulnerabilities, classified with a\n**12-class + 1 fallback** taxonomy (see below). The remaining 28.8 %\ncover traditional vulnerability types. Full data model and field\ndefinitions are in [`SCHEMA.md`](SCHEMA.md).\n\nThe initial release (v0.1.0) draws primarily from recent high-star open-source projects and focuses on frequently occurring business-logic vulnerabilities; future releases will continue expanding vulnerability categories and project coverage.\n\n> Note: one advisory may map to multiple entries — the counts below\n> are by **advisory (vulnerability)**, not by entry.\n\n**Business-logic advisories (131 \u002F 184, 71.2 %) — `vuln_category_l2` breakdown:**\n\n| Sub-category | Advisories | % of BL |\n|---|---|---|\n| BL-AUTHZ-BROKEN — broken authorization logic | 31 | 23.7 % |\n| BL-AUTHZ-MISSING — missing authorization | 23 | 17.6 % |\n| BL-AGENT-CAPABILITY — AI \u002F Agent capability boundary bypass | 20 | 15.3 % |\n| BL-PRIV-ESC — privilege escalation | 13 | 9.9 % |\n| BL-AUTH-BYPASS — authentication bypass | 11 | 8.4 % |\n\n\u003Cdetails>\n\u003Csummary>7 more sub-categories (33 advisories, 25.2 % of BL)\u003C\u002Fsummary>\n\n| Sub-category | Advisories | % of BL |\n|---|---|---|\n| BL-ORIGIN-INTEGRITY — origin \u002F signature \u002F integrity check missing | 8 | 6.1 % |\n| BL-WORKFLOW-VIOLATION — workflow \u002F state-machine violation | 7 | 5.3 % |\n| BL-INSECURE-DEFAULT — insecure default configuration | 6 | 4.6 % |\n| BL-RACE-LOGIC — business-layer race condition | 4 | 3.1 % |\n| BL-MULTI-TENANT — multi-tenant \u002F isolation failure | 3 | 2.3 % |\n| BL-MASS-ASSIGNMENT — mass assignment \u002F parameter pollution | 3 | 2.3 % |\n| BL-TRUST-BOUNDARY — implicit trust in internal input | 2 | 1.5 % |\n\n\u003C\u002Fdetails>\n\n\u003Cbr>\n\n**Traditional vulnerability advisories (53 \u002F 184, 28.8 %) — top `vuln_category_l1`:**\n\n| Category | Advisories | % of Trad. |\n|---|---|---|\n| Code Injection | 12 | 22.6 % |\n| Path Traversal \u002F File ops | 9 | 17.0 % |\n| Command Injection | 8 | 15.1 % |\n| XSS | 5 | 9.4 % |\n| Sandbox Escape | 5 | 9.4 % |\n\n\u003Cdetails>\n\u003Csummary>4 more categories (14 advisories, 26.4 % of Trad.)\u003C\u002Fsummary>\n\n| Category | Advisories | % of Trad. |\n|---|---|---|\n| SSRF | 4 | 7.5 % |\n| Authentication Bypass | 3 | 5.7 % |\n| Deserialization | 2 | 3.8 % |\n| Other (Template Injection, RCE, Supply Chain, etc.) | 5 | 9.4 % |\n\n\u003C\u002Fdetails>\n\n> Future releases will continue expanding vulnerability categories and project coverage.\n\n\n\n## 📈 Baseline evaluation results\n\n> 🚧 **Coming soon** — We are systematically evaluating mainstream tools and AI agents. Results will be published alongside the technical report.\n\n\n## 📦 Repository layout\n\n```\nVulnGym\u002F\n├── README.md                    # English version\n├── README_zh.md                 # 中文版\n├── SCHEMA.md                    # field reference & validation invariants\n├── CHANGELOG.md\n├── CITATION.cff\n├── LICENSE                      # CC-BY-4.0\n├── data\u002F\n│   ├── reports.jsonl            # 184 rows — one GitHub Advisory per row\n│   └── entries.jsonl            # 408 rows — one entry point per row, with human-audit flag (verify)\n└── examples\u002F\n    ├── load_dataset.py          # stdlib \u002F pandas \u002F HuggingFace datasets loader\n    ├── example_result.jsonl     # illustrative tool-findings submission\n    └── evaluate.py              # coverage \u002F recall evaluator\n```\n\n---\n\n## 🚀 Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym.git\ncd VulnGym\npython3 examples\u002Fload_dataset.py\n```\n\nOr load directly in Python:\n\n```python\nimport json\nwith open(\"data\u002Fentries.jsonl\", encoding=\"utf-8\") as f:\n    entries = [json.loads(line) for line in f if line.strip()]\n\nxss = [e for e in entries if e[\"vuln_category_l1\"] == \"XSS\"]\nprint(len(xss), \"XSS entries\")\nprint(xss[0][\"entry_point\"], \"→\", xss[0][\"critical_operation\"])\n\n# Restrict to the human-audited high-confidence subset\nverified = [e for e in entries if e[\"verify\"] == 1]\nprint(len(verified), \"human-audited entries\")\n```\n\nPandas:\n\n```python\nimport pandas as pd\nreports = pd.read_json(\"data\u002Freports.jsonl\", lines=True)\nentries = pd.read_json(\"data\u002Fentries.jsonl\", lines=True)\n```\n\nHuggingFace `datasets`:\n\nVulnGym is also published on the HuggingFace Hub: [tencent\u002FVulnGym](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftencent\u002FVulnGym).\n\n```python\nfrom datasets import load_dataset\n\n# Load directly from the HuggingFace Hub\nds = load_dataset(\"tencent\u002FVulnGym\")\n\n# Or load from local JSONL files\nds = load_dataset(\"json\", data_files={\n    \"reports\": \"data\u002Freports.jsonl\",\n    \"entries\": \"data\u002Fentries.jsonl\",\n})\n```\n\n\n## 📊 Evaluating your tool\n\nWrite your tool's findings to a JSONL file (one finding per line) and run:\n\n```bash\npython3 examples\u002Fevaluate.py path\u002Fto\u002Fyour_findings.jsonl -v\n```\n\nEach finding must carry at least `repo_url`, `commit`, `entry_point`\n(reachable entry point), and `critical_operation` (core defect location).\n`trace` (cross-module reasoning chain) is optional and ignored by the\nmatcher. See `examples\u002Fexample_result.jsonl` for a working sample.\n\nThe script reports two metrics:\n\n- **Advisory-level recall** (primary) — `covered_advisories \u002F\n  usable_advisories`. An advisory is covered if **at least one** of its\n  entries is matched.\n- **Entry-level recall** (secondary) — `matched_entries \u002F usable_entries`.\n\n**Default matching policy**\n\n| Aspect | Default |\n|---|---|\n| Path match | normalized, exact |\n| Line tolerance | `\\|Δline\\| ≤ 5` on entry_point **and** critical_operation |\n| Direction | strict (entry_point-to-entry_point, critical_operation-to-critical_operation) |\n| `line == 0` in ground truth | excluded from numerator and denominator |\n\nAll policies are documented and configurable via CLI arguments\n(`--line-tolerance`, etc.).\n\n> **Note:** The current evaluator **only computes recall \u002F coverage** and\n> cannot penalize over-reporting. The resulting numbers should be\n> interpreted as coverage metrics, not a full precision-aware benchmark.\n\n\n## 📖 Citation\n\n> 📚 **A companion paper is in preparation.** Until it is released, please cite VulnGym using the dataset entry below; we will update this section once the paper is publicly available.\n\n```bibtex\n@misc{vulngym2026,\n  title        = {VulnGym: A Real-World, Project-Level Vulnerability Benchmark\n                  for White-Box Vulnerability-Hunting Agents},\n  author       = {{Tencent Wukong Code Security Team and contributors}},\n  year         = {2026},\n  version      = {0.1.1},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym}},\n  note         = {Dataset. A companion paper is in preparation; please check\n                  the repository for the latest citation.}\n}\n```\n\nOnce the paper is public, the entry below will be filled in and should be preferred:\n\n```bibtex\n@inproceedings{vulngym2026paper,\n  title     = {TBA — A companion paper for VulnGym is in preparation.},\n  author    = {{To be announced}},\n  year      = {TBA},\n  note      = {Placeholder; will be replaced once the paper is publicly available.}\n}\n```\n\nSee `CITATION.cff` for the machine-readable form.\n\n---\n\n## 🤝 Contribution Guide\n\nVulnGym aims to be an **open, reproducible, and continuously evolving**\ncommunity benchmark. Contributions from both academia and industry are\nwarmly welcomed:\n\n- 🧠 **Dataset contributions** — new advisories, additional reachable\n  entry points for existing advisories, corrections to `entry_point` \u002F\n  `critical_operation` \u002F `trace`.\n- 🔧 **Evaluator improvements** — precision \u002F F1, per-category\n  breakdowns, statistical significance (bootstrap CI), alternative\n  matching policies.\n- 📊 **Evaluation result submissions** — submit your tool's evaluation\n  results via PR to be included in the baseline comparison.\n- 💬 **Discussions & feedback** — file an\n  [Issue](https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym\u002Fissues) or start a\n  [Discussion](https:\u002F\u002Fgithub.com\u002FTencent\u002FVulnGym\u002Fdiscussions).\n\nPlease read `SCHEMA.md` before proposing data changes — all invariants\nlisted there are enforced at release time.\n\n---\n\n## 🙏 Acknowledgements\n\nVulnGym is jointly built by the **Tencent Wukong Security Team**\ntogether with the following academic partners (listed in no particular\norder, final order TBD):\n- ARISE Lab, The Chinese University of Hong Kong\n- Systems Software & Security Lab, Fudan University\n- JC STEM Lab of Intelligent Cybersecurity, The University of Hong Kong\n- Narwhal-Lab, Peking University\n- Network Threat Analysis Lab, Institute of Information Engineering, Chinese Academy of Sciences\n\nMany thanks to all partners for their outstanding contributions to\nVulnGym.\n\n---\n\n## 📄 License\n\nThe dataset is released under **CC-BY-4.0** — see [`LICENSE`](LICENSE).\nYou may use it for commercial and academic purposes with attribution.\nSource code paths and commit hashes referenced in `entry_point` \u002F\n`critical_operation` \u002F `trace` fields belong to their respective upstream\nprojects under their original licenses; consult the referenced\nrepositories before reusing any quoted code fragment.\n","VulnGym 是一个针对白盒漏洞检测代理的真实项目级漏洞基准测试平台。该项目的核心功能包括基于真实工程环境中的可验证漏洞触发路径和代码语义证据链来评估代理的漏洞发现能力，覆盖了从授权绕过、认证失效等业务逻辑缺陷到注入攻击、路径遍历等传统安全问题的多种漏洞类型，并为每个样本提供了人工审核过的可达入口点、关键操作及跨模块推理链，确保了评估结果的可复现性和解释性。VulnGym 适用于需要对软件安全分析工具或自动化漏洞扫描器进行严格测试与校验的场景，特别是那些要求在复杂多文件或多模块项目中有效定位潜在威胁的应用场合。","2026-06-11 03:58:22","CREATED_QUERY"]