[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80058":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":11,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":12,"stars30d":12,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":13,"rankGlobal":9,"rankLanguage":9,"license":14,"archived":15,"fork":15,"defaultBranch":16,"hasWiki":17,"hasPages":15,"topics":18,"createdAt":9,"pushedAt":9,"updatedAt":19,"readmeContent":20,"aiSummary":21,"trendingCount":12,"starSnapshotCount":12,"syncStatus":22,"lastSyncTime":23,"discoverSource":24},80058,"2026-AI-DETECTOR-BENCHMARK","mattc95\u002F2026-AI-DETECTOR-BENCHMARK","mattc95","Benchmarking AI text detectors (GPTHumanizer, GPTZero, ZeroGPT, Sapling) across multiple datasets to evaluate accuracy, human false positive rates, and risk trade-offs.",null,"Python",62,0,40,"MIT License",false,"main",true,[],"2026-06-12 04:01:26","# 2026 AI Detector Benchmark\n\nThis repository contains a 2026 benchmark of four AI-text detection systems on a balanced English dataset of 1,000 texts. The benchmark is designed to evaluate not only overall accuracy, but also the human false positive rate: the rate at which real human writing is incorrectly flagged as AI-generated.\n\nThe tested detectors are:\n\n- GPTHumanizer\n- GPTZero\n- ZeroGPT\n- Sapling AI Detector\n\nAll detector runs in this repository were completed on May 14, 2026. The repository includes the benchmark input data, evaluation scripts, and aggregate metrics. Because the complete per-item detector outputs are large, they are hosted as public Google Drive artifacts and linked below for learning and research use.\n\n## Related Links\n\n- [GPTHumanizer Official Website](https:\u002F\u002Fwww.gpthumanizer.ai\u002F)\n- [Related benchmark blog article](https:\u002F\u002Fwww.gpthumanizer.ai\u002Fblog\u002F2026-ai-detector-benchmark)\n\n## Key Result\n\n![Overall AI detector benchmark performance](img\u002Foverall_performance.png)\n\nGPTZero achieved the highest overall accuracy in this run, while GPTHumanizer had the lowest human false positive risk.\n\n| Detector | Total Items | Evaluable Items | Overall Accuracy | AI Detection Rate | Human False Positive Rate | AI Miss Rate | TP | FP | FN | TN |\n|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| GPTHumanizer | 1,000 | 1,000 | 98.00% | 96.00% | 0.00% | 4.00% | 480 | 0 | 20 | 500 |\n| GPTZero | 1,000 | 998 | 98.70% | 99.60% | 2.20% | 0.40% | 497 | 11 | 2 | 488 |\n| ZeroGPT | 1,000 | 1,000 | 88.20% | 94.80% | 18.40% | 5.20% | 474 | 92 | 26 | 408 |\n| Sapling | 1,000 | 1,000 | 88.60% | 96.60% | 19.40% | 3.40% | 483 | 97 | 17 | 403 |\n\nThe confusion matrix treats AI as the positive class:\n\n- TP: AI text correctly classified as AI\n- FP: human text incorrectly classified as AI\n- FN: AI text incorrectly classified as human\n- TN: human text correctly classified as human\n\nGPTZero returned two API errors during the completed run, so its accuracy and rates are calculated on 998 evaluable records. Those two non-ok records are still preserved in the output file.\n\n## Dataset\n\nThe benchmark uses 1,000 English texts:\n\n| Split | File | Count | Label | Sampling Notes |\n|---|---|---:|---|---|\n| Human | `data\u002Fhuman_detection_test.json` | 500 | `human` | Randomly sampled from Pile-small across different source domains. |\n| AI | `data\u002Fai_detection_test.json` | 500 | `ai` | Random sample of 500 texts from a 2,600-item AI generation pool created from prompts on February 5, 2026. |\n\nThe word-count distribution is balanced across the human and AI splits:\n\n| Word Count Bucket | Human Items | AI Items | Total Items |\n|---|---:|---:|---:|\n| 50-200 words | 150 | 150 | 300 |\n| 200-500 words | 200 | 200 | 400 |\n| 500-1000 words | 150 | 150 | 300 |\n\n### Human Data Credibility\n\nThe human benchmark set contains 500 human-written texts sampled from Pile-small. The final test file stores each sample with its text, label, source, style field, and perplexity score.\n\nHuman source coverage in the final benchmark set:\n\n| Source | Count |\n|---|---:|\n| Wikipedia (en) | 73 |\n| OpenWebText2 | 61 |\n| Pile-CC | 59 |\n| USPTO Backgrounds | 55 |\n| StackExchange | 54 |\n| NIH ExPorter | 40 |\n| HackerNews | 39 |\n| FreeLaw | 36 |\n| PubMed Abstracts | 36 |\n| Enron Emails | 26 |\n| PubMed Central | 11 |\n| ArXiv | 7 |\n| YoutubeSubtitles | 3 |\n\nWhy this matters:\n\n- The human data is not drawn from a single writing style or domain.\n- The sources include encyclopedic, web, legal, technical, biomedical, email, forum, patent, academic, and subtitle-style text.\n- Every final human benchmark record has a `human` label, non-empty text, a source field, and a retained per-item record in the detector outputs.\n\n![Human data credibility and detector performance](img\u002Fhuman_performance.png)\n\n### AI Data Credibility\n\nThe AI benchmark set contains 500 AI-generated texts randomly sampled from a larger pool of 2,600 prompted generations. The AI generation pool was produced on February 5, 2026 by prompting large language models directly. The final benchmark file stores each sampled item with its text, label, prompt, source model, and theme.\n\nAI source model coverage in the final benchmark set:\n\n| Source Model | Count |\n|---|---:|\n| claude-sonnet-4-20250514 | 46 |\n| gpt-3.5-turbo-0613 | 46 |\n| gpt-4.1 | 42 |\n| claude-3-7-sonnet-20250219 | 42 |\n| o3 | 42 |\n| deepseek-chat | 42 |\n| kimi-k2-0905-preview | 40 |\n| gpt-4o | 36 |\n| grok-4 | 36 |\n| claude-sonnet-4-5-20250929 | 36 |\n| gpt-5-chat-latest | 34 |\n| gpt-5-mini | 31 |\n| claude-3-5-sonnet-20241022 | 27 |\n\nWhy this matters:\n\n- The AI split is not produced by a single model family.\n- The model source is retained for every final benchmark item.\n- The prompts are preserved in `data\u002Fai_detection_test.json`, making the sampled AI data auditable at the item level.\n\n## Detector Output Files\n\nAll model test results have complete classification data. Because these output JSON files are large, the full artifacts are stored on Google Drive instead of being committed directly to the GitHub repository. They are publicly shared and may be used for learning or research purposes.\n\nEach output JSON contains aggregate metrics and an `items` object with detailed per-item results, including text, source metadata, true label, predicted label, correctness, word count, word bucket, request status, HTTP status where available, and the detector response fields.\n\n| Detector | Full Classification Data | Original Output File | Items | Status Summary | Notes |\n|---|---|---|---:|---|---|\n| GPTHumanizer | [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1iOvYJBHBS-8ulAN7DxQ_XUu69dmyfokC\u002Fview?usp=sharing) | `output\u002Fdetection_eval_output.json` | 1,000 | 1,000 ok | Full per-item classifier output. |\n| GPTZero | [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1j8xFQ3Rgrg2Xl7W6NpdoMSjMXCiYY9jp\u002Fview?usp=sharing) | `output\u002Fgptzero_detection_eval_output.json` | 1,000 | 998 ok, 2 error | Two API 403 errors are preserved and excluded from evaluable metrics. |\n| ZeroGPT | [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1Qyyc39NX2CprZAXXkviLkwmkdZySC0yQ\u002Fview?usp=sharing) | `output\u002Fzerogpt_detection_eval_output.json` | 1,000 | 1,000 ok | Full per-item feedback and response data. |\n| Sapling | [Google Drive](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1oy9nFC5gLsT81erGc2BF_PLSFqC-BUZz\u002Fview?usp=sharing) | `output\u002Fsapling_detection_eval_output.json` | 1,000 | 1,000 ok | Full per-item Sapling response data. |\n\nGPTZero non-ok records from the completed run:\n\n| Item Key | Status | Error |\n|---|---|---|\n| `human_detection_test:33` | error | 403 Client Error: Forbidden |\n| `ai_detection_test:350` | error | 403 Client Error: Forbidden |\n\n## Prediction Rules\n\nDifferent detectors return different output formats. The benchmark normalizes each detector into a binary `human` or `ai` prediction using explicit rules stored in the output metadata.\n\n| Detector | Normalization Rule |\n|---|---|\n| GPTHumanizer | `human`, `light_edited`, and `lightly_edited` classes are treated as human; all other classes are treated as AI. |\n| GPTZero | `predicted_class = human` is treated as human; `ai` or `mixed` is treated as AI. |\n| ZeroGPT | Feedback containing `Human written` is treated as human; all other feedback is treated as AI. |\n| Sapling | AI score greater than 50% is treated as AI; score less than or equal to 50% is treated as human. |\n\nThese rules are implemented in the evaluation scripts and recorded in each output file's `meta.prediction_rule` field.\n\n## Metrics\n\nThe benchmark reports the following metrics:\n\n- Overall accuracy: `(TP + TN) \u002F evaluable_items`\n- AI detection rate: `TP \u002F (TP + FN)`\n- Human false positive rate: `FP \u002F (FP + TN)`\n- AI miss rate: `FN \u002F (TP + FN)`\n- Evaluable items: records with a valid normalized `human` or `ai` prediction\n\nThe human false positive rate is a central metric because false accusations against human-written text can create serious academic, professional, or institutional risk.\n\n## Performance by Text Length\n\nShorter text is harder to classify because detectors have less linguistic evidence. The benchmark therefore reports results by word-count bucket.\n\n### 50-200 Words\n\n| Detector | Accuracy | Human False Positive Rate | AI Miss Rate | TP | FP | FN | TN |\n|---|---:|---:|---:|---:|---:|---:|---:|\n| GPTHumanizer | 95.33% | 0.00% | 9.33% | 136 | 0 | 14 | 150 |\n| GPTZero | 96.67% | 6.00% | 0.67% | 149 | 9 | 1 | 141 |\n| ZeroGPT | 85.33% | 20.67% | 8.67% | 137 | 31 | 13 | 119 |\n| Sapling | 85.33% | 24.67% | 4.67% | 143 | 37 | 7 | 113 |\n\n### 200-500 Words\n\n| Detector | Accuracy | Human False Positive Rate | AI Miss Rate | TP | FP | FN | TN |\n|---|---:|---:|---:|---:|---:|---:|---:|\n| GPTHumanizer | 99.00% | 0.00% | 2.00% | 196 | 0 | 4 | 200 |\n| GPTZero | 99.25% | 1.00% | 0.50% | 199 | 2 | 1 | 198 |\n| ZeroGPT | 91.00% | 17.50% | 0.50% | 199 | 35 | 1 | 165 |\n| Sapling | 89.25% | 18.50% | 3.00% | 194 | 37 | 6 | 163 |\n\n### 500-1000 Words\n\n| Detector | Accuracy | Human False Positive Rate | AI Miss Rate | TP | FP | FN | TN |\n|---|---:|---:|---:|---:|---:|---:|---:|\n| GPTHumanizer | 99.33% | 0.00% | 1.33% | 148 | 0 | 2 | 150 |\n| GPTZero | 100.00% | 0.00% | 0.00% | 149 | 0 | 0 | 149 |\n| ZeroGPT | 87.33% | 17.33% | 8.00% | 138 | 26 | 12 | 124 |\n| Sapling | 91.00% | 15.33% | 2.67% | 146 | 23 | 4 | 127 |\n\nGPTZero has 298 evaluable records in the 500-1000 word bucket because two records in that bucket returned API errors.\n\n## Main Findings\n\n1. GPTHumanizer produced zero human false positives in this benchmark.\n\n   It classified all 500 evaluable human-written texts as human. This does not prove a universal 0% false positive rate, but it is the strongest human-safety result observed in this dataset.\n\n2. GPTZero produced the strongest raw accuracy and AI recall.\n\n   GPTZero reached 98.70% accuracy on 998 evaluable records and detected 497 of 499 evaluable AI texts. It also incorrectly flagged 11 of 499 evaluable human texts as AI.\n\n3. ZeroGPT and Sapling showed high human false positive risk.\n\n   ZeroGPT flagged 92 of 500 human texts as AI. Sapling flagged 97 of 500 human texts as AI using the benchmark rule of score > 50%.\n\n4. Short text is the most difficult bucket.\n\n   All detectors showed weaker behavior on 50-200 word samples. GPTHumanizer missed more short AI texts, while GPTZero, ZeroGPT, and Sapling showed higher human false positive rates in that range.\n\n## Repository Structure\n\n```text\n.\n|-- data\u002F\n|   |-- human_detection_test.json\n|   |-- ai_detection_test.json\n|-- evaluate_detection_datasets.py\n|-- evaluate_gptzero_datasets.py\n|-- evaluate_zerogpt_datasets.py\n|-- evaluate_sapling_datasets.py\n|-- requirements.txt\n`-- README.md\n```\n\n## Reproducing or Extending the Benchmark\n\nInstall dependencies:\n\n```bash\npython -m pip install -r requirements.txt\n```\n\nRun the GPTHumanizer evaluation:\n\n```bash\nexport GPTHUMANIZER_API_KEY=\"your_key_here\"\npython evaluate_detection_datasets.py --restart\n```\n\nRun GPTZero:\n\n```bash\nexport GPTZERO_API_KEY=\"your_key_here\"\npython evaluate_gptzero_datasets.py --restart\n```\n\nRun ZeroGPT:\n\n```bash\nexport ZEROGPT_API_KEY=\"your_key_here\"\npython evaluate_zerogpt_datasets.py --restart\n```\n\nRun Sapling:\n\n```bash\nexport SAPLING_API_KEY=\"your_key_here\"\npython evaluate_sapling_datasets.py --restart --drop-token-fields\n```\n\nOn Windows PowerShell, set environment variables with `$env:NAME = \"value\"` before running the scripts.\n\nThe scripts also support explicit headers or API keys through command-line options. Do not commit private API keys to the repository.\n\n## Auditability\n\nThis repository is intended to make the benchmark auditable rather than aggregate-only.\n\nEvidence available in this repository and the linked public artifacts:\n\n- The final 500 human test records.\n- The final 500 AI test records.\n- Source metadata for human samples.\n- Prompt, theme, and model-source metadata for AI samples.\n- Evaluation scripts for all four detectors.\n- Public Google Drive links to the complete per-item classifier outputs for all four detectors.\n- Aggregate metrics embedded in each linked output file.\n\nThis means the headline results can be checked against the item-level classifications by downloading the linked output artifacts instead of relying only on summary tables.\n\n## Limitations\n\n- The benchmark is English-only.\n- The sample size is 1,000 texts, balanced as 500 human and 500 AI.\n- Detector APIs can change over time. These results describe the detector behavior observed in the completed May 14, 2026 runs.\n- GPTZero had two API errors in the completed run. They are preserved in the output and excluded from evaluable metrics.\n- The benchmark uses explicit binary normalization rules, but some detector outputs are more nuanced than binary human\u002FAI labels.\n- A measured 0% false positive rate on 500 human samples should be interpreted as an observed benchmark result, not as proof that future false positives cannot occur.\n- Exact regeneration of the original random samples may require the original upstream Pile-small snapshot, AI generation pool, sampling code, and random seeds if those are not otherwise archived.\n\n## Responsible Use\n\nAI detectors should support review, not replace human judgment. A detector result should not be used as the only evidence in high-stakes academic, employment, publishing, or compliance decisions.\n\nThis benchmark suggests that detector trust should be evaluated with special attention to human false positives. Catching AI-generated text is useful, but falsely accusing human writers is often the more serious risk.","该项目旨在对四种AI文本检测系统（GPTHumanizer、GPTZero、ZeroGPT、Sapling）进行基准测试，以评估其在多个数据集上的准确性、人类误报率和风险权衡。项目使用Python语言编写，提供了平衡的英文数据集（1000篇文本），并包括了基准输入数据、评估脚本和汇总指标。它适用于需要评估AI文本检测工具性能的研究者或开发者，特别是在关注误报率和整体准确性的场景中。通过对比分析不同检测器的表现，用户可以更好地理解各工具的优势与局限性，从而做出更合适的选择。",2,"2026-06-11 03:59:04","CREATED_QUERY"]