[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74220":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},74220,"bullshit-benchmark","petergpt\u002Fbullshit-benchmark","petergpt","BullshitBench measures whether AI models challenge nonsensical prompts instead of confidently answering them, created by Peter Gostev.","https:\u002F\u002Fx.com\u002Fpetergostev",null,"Python",1705,65,12,13,0,11,27,90,33,95.96,"MIT License",false,"main",true,[],"2026-06-12 04:01:13","\u003Ch1>\n  \u003Cimg src=\"docs\u002Fimages\u002Fbsbench.png\" alt=\"BullshitBench logo\" width=\"64\" \u002F>\n  BullshitBench v2\n\u003C\u002Fh1>\n\nBullshitBench measures whether models detect nonsense, call it out clearly, and avoid confidently continuing with invalid assumptions.\n\n- Public viewer (latest): https:\u002F\u002Fpetergpt.github.io\u002Fbullshit-benchmark\u002Fviewer\u002Findex.v2.html\n- Updated: 2026-05-07\n\n## Latest Changelog Entry (2026-05-07)\n\n- Added GPT-5.5 chat benchmark results to both published tracks: `v1` with `55` questions and `v2` with `100` questions.\n- Published:\n  - `openai\u002Fgpt-5.5-chat@reasoning=default`\n- v1 score: `0.6303` average, `8` Clear Pushback, `20` Partial Challenge, `27` Accepted Nonsense.\n- v2 score: `1.0133` average, `34` Clear Pushback, `39` Partial Challenge, `27` Accepted Nonsense.\n- Recorded `openai\u002Fgpt-5.5-chat` as the benchmark display row for OpenAI's `chat-latest` API slug, since the slug does not expose the GPT-5.5 chat-family name directly.\n- Updated durable v1\u002Fv2 config coverage and refreshed the published leaderboard, release-date, reasoning-token\u002Fcost, and model-size chart data from completed 3-judge panels.\n- Full details: [CHANGELOG.md](CHANGELOG.md)\n\n## v2 Changelog Highlights\n\n- `100` new nonsense questions in the v2 set.\n- Domain-specific question coverage across `5` domains: `software` (40), `finance` (15), `legal` (15), `medical` (15), `physics` (15).\n- New visualizations in the v2 viewer, including:\n  - Detection Rate by Model (stacked mix bars)\n  - Domain Landscape (overall vs domain detection mix)\n  - Detection Rate Over Time\n  - Do Newer Models Perform Better?\n  - Does Thinking Harder Help? (tokens\u002Fcost toggle)\n  - Model Size and Weights (total\u002Factive parameter scatter views)\n\n## Viewer Walkthrough (v2)\n\nThe screenshots below follow the same flow as `viewer\u002Findex.v2.html`, starting with the main chart.\n\n### 1. Detection Rate by Model (Main Chart)\n\nPrimary leaderboard-style view showing each model's green\u002Famber\u002Fred split.\n\n![BullshitBench v2 - Detection Rate by Model](docs\u002Fimages\u002Fv2-detection-rate-by-model.png)\n\n### 2. Domain Landscape\n\nDetection mix by domain to compare overall performance vs each domain at a glance.\n\n![BullshitBench v2 - Domain Landscape](docs\u002Fimages\u002Fv2-domain-landscape.png)\n\n### 3. Detection Rate Over Time\n\nRelease-date trend view focused on Anthropic, OpenAI, and Google.\n\n![BullshitBench v2 - Detection Rate Over Time](docs\u002Fimages\u002Fv2-detection-rate-over-time.png)\n\n### 4. Do Newer Models Perform Better?\n\nAll-model scatter by release date vs. green rate.\n\n![BullshitBench v2 - Do Newer Models Perform Better](docs\u002Fimages\u002Fv2-do-newer-models-perform-better.png)\n\n### 5. Does Thinking Harder Help?\n\nReasoning scatter (tokens\u002Fcost toggle in the viewer) vs. green rate.\n\n![BullshitBench v2 - Does Thinking Harder Help](docs\u002Fimages\u002Fv2-does-thinking-harder-help.png)\n\n### 6. Model Size and Weights\n\nTotal and active parameter scatter views for models with public size metadata.\n\n![BullshitBench v2 - Model Size and Weights](docs\u002Fimages\u002Fv2-model-size-scatters.png)\n\n## Benchmark Scope (v2)\n\n- `100` nonsense prompts total.\n- `5` domain groups: `software` (40), `finance` (15), `legal` (15), `medical` (15), `physics` (15).\n- `13` nonsense techniques (for example: `plausible_nonexistent_framework`, `misapplied_mechanism`, `nested_nonsense`, `specificity_trap`).\n- `3`-judge panel aggregation (`anthropic\u002Fclaude-sonnet-4.6`, `openai\u002Fgpt-5.2`, `google\u002Fgemini-3.1-pro-preview`) using `full` panel mode + `mean` aggregation.\n- Published v2 leaderboard currently includes `156` model\u002Freasoning rows.\n\n## What This Measures\n\n- `Clear Pushback`: the model clearly rejects the broken premise.\n- `Partial Challenge`: the model flags issues but still engages the bad premise.\n- `Accepted Nonsense`: the model treats the nonsense as valid.\n\n## Quick Start\n\n1. Set API keys:\n\n```bash\nexport OPENROUTER_API_KEY=your_key_here\nexport OPENAI_API_KEY=your_openai_key_here  # required only for models routed to OpenAI\nexport OPENAI_PROJECT=proj_xxx              # optional: force OpenAI requests to a specific project\nexport OPENAI_ORGANIZATION=org_xxx          # optional: force organization context\n```\n\nProvider routing is configured per model via `collect.model_providers` and\n`grade.model_providers` in config (default is OpenRouter), for example:\n`{\"*\":\"openrouter\",\"gpt-5.3\":\"openai\"}`.\n\n2. Run collection + primary judge (Claude by default):\n\n```bash\n.\u002Fscripts\u002Frun_end_to_end.sh\n```\n\n3. Run v2 end-to-end and publish into the dedicated v2 dataset:\n\n```bash\n.\u002Fscripts\u002Frun_end_to_end.sh --config config.v2.json --viewer-output-dir data\u002Fv2\u002Flatest --with-additional-judges\n```\n\n4. Optionally run the default config end-to-end (publishes to `data\u002Flatest`):\n\n```bash\n.\u002Fscripts\u002Frun_end_to_end.sh --with-additional-judges\n```\n\n5. Open the viewer:\n\n- Published viewer (latest): https:\u002F\u002Fpetergpt.github.io\u002Fbullshit-benchmark\u002Fviewer\u002Findex.v2.html\n- Local viewer (optional):\n\n```bash\n.\u002Fscripts\u002Frun_end_to_end.sh --with-additional-judges --serve --port 8877\n```\n\nThen open `http:\u002F\u002Flocalhost:8877\u002Fviewer\u002Findex.v2.html`.\nUse the `Benchmark Version` dropdown in the filters panel to switch between published datasets (for example `v1` and `v2`).\n\n## Published Datasets\n\n- v1 dataset remains in `data\u002Flatest`.\n- v2 dataset is published in `data\u002Fv2\u002Flatest`.\n- v2 question set comes from `drafts\u002Fnew-questions.md` via `scripts\u002Fbuild_questions_v2_from_draft.py`.\n- Canonical judging is now fixed to exactly 3 judges on every row with mean aggregation (legacy disagreement-tiebreak mode is retired from the main pipeline).\n- Release notes and notable changes are tracked in `CHANGELOG.md`.\n\n## Documentation\n\n- [Technical Guide](docs\u002FTECHNICAL.md): pipeline operations, publishing artifacts, launch-date metadata workflow, repo layout, env vars.\n- [Changelog](CHANGELOG.md): v1 to v2 release notes and publish-history highlights.\n- [Question Set](questions.json): benchmark questions and scoring metadata.\n- [Question Set v2](questions.v2.json): v2 question pool generated from `drafts\u002Fnew-questions.md`.\n- [Config](config.json): default model\u002Fpipeline settings.\n- [Config v2](config.v2.json): v2-ready config (uses `questions.v2.json`).\n\n## Notes\n\n- This README is intentionally audience-facing.\n- Technical and maintainer-oriented content lives in `docs\u002FTECHNICAL.md`.\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n## Star History \n\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=petergpt\u002Fbullshit-benchmark&type=Date&theme=dark&cachebust=20260517\" \u002F>\n  \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=petergpt\u002Fbullshit-benchmark&type=Date&cachebust=20260517\" \u002F>\n  \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=petergpt\u002Fbullshit-benchmark&type=Date&cachebust=20260517\" \u002F>\n\u003C\u002Fpicture>\n","BullshitBench 是一个用于评估AI模型是否能识别并拒绝无意义提示的工具。该项目通过一系列精心设计的问题来测试AI模型在面对不合逻辑或错误假设时的表现，主要使用Python语言编写。它能够量化模型对不同领域（如软件、金融、法律、医学和物理）中无意义内容的检测率，并提供多种可视化图表帮助用户理解结果，包括按模型、领域和时间维度展示的检测性能。BullshitBench适合需要评估AI系统批判性思维能力的研究者和开发者使用，在确保生成的内容质量方面具有重要价值。",2,"2026-06-11 03:49:34","high_star"]