[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83297":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":10,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":26,"discoverSource":27},83297,"laibench-public","laudos-ai\u002Flaibench-public","laudos-ai","A public reference benchmark for radiology finding-to-report generation.","https:\u002F\u002Flaibench.vercel.app",null,"TypeScript",54,1,0,2,0.9,"Other",false,"main",true,[22,23],"benchmark","radiology","2026-06-12 02:04:33","# LAIBench\n\nLAIBench is a governance-oriented benchmark framework for AI-assisted radiology reporting.\n\n**LAIBench is a technical benchmark framework, not a medical device, not regulatory approval, and not clinical validation. It must not be used as the sole basis for clinical deployment decisions. All references below are to that technical scope.**\n\nWebsite: [laibench.laudos.ai](https:\u002F\u002Flaibench.vercel.app)  \nCompanion paper (conceptual): *Beyond Templates: A Compositional Model and Lower Bound for Radiology Report Variability* — [PDF](site\u002Flaibench-preprint.pdf), [arXiv source](submissions\u002Farxiv-laibench\u002F)  \nBy: [Laudos.AI](https:\u002F\u002Flaudos.ai)\n\n## Current Status\n\nThis repository is a public-safe technical preview export. It contains code, schemas, documentation, site assets, paper draft materials, and a tiny synthetic demo suite.\n\nIt does not include:\n\n- raw clinical reports;\n- the full clinical corpus;\n- clinical CSV\u002FXLSX\u002FDICOM\u002FNIfTI files;\n- hidden test sets;\n- answer keys;\n- private scoring criteria;\n- gated evaluation artifacts.\n\n## What It Evaluates\n\nLAIBench evaluates reporting behavior from provided text evidence. The current public harness focuses on whether a system can convert an exam descriptor and concise findings into a faithful radiology report under the public contract. It is **not** primary image interpretation.\n\nThe framework is intended to make failure modes visible:\n\n- clinically relevant omissions;\n- hallucinated or unsupported findings;\n- factual contradictions;\n- critical-finding preservation;\n- structured-report compliance;\n- privacy hygiene;\n- auditability of submissions and leaderboard rows.\n\nThe current public demo does not evaluate primary image interpretation from DICOM studies. It is not a diagnostic accuracy study and does not prove clinical safety.\n\n## Public Data Boundary\n\nThe only public cases in this repository are synthetic demo cases under [cases\u002Fpublic\u002Fsynthetic-demo.pt-BR.json](cases\u002Fpublic\u002Fsynthetic-demo.pt-BR.json). They are for installation checks, smoke tests, and harness verification.\n\nThe full clinical corpus is private\u002Fgated. The hidden test set is private. Official evaluation requires hosted evaluation or controlled access under written terms. See [DATA_ACCESS_POLICY.md](DATA_ACCESS_POLICY.md).\n\n## Scoring\n\nLAIBench reports weighted finding-to-report scores and strict gate outcomes. A high average score should not hide critical failures. Critical finding omissions, unsafe negations, contradictions, unsupported normalcy, and structural errors trigger failure gates.\n\n| Dimension | Weight | Purpose |\n| --- | ---: | --- |\n| CRIT | 30% | Critical finding preservation and unsafe-negation checks |\n| QUAL | 25% | Clinical quality, finding preservation, hallucination resistance |\n| TERM | 20% | Locale, modality, section, and report terminology |\n| GUIDE | 15% | Guideline and anatomical coverage expectations |\n| RAG | 10% | Evidence fidelity, section order, laterality, levels, and measurements |\n\n## Quickstart\n\n```bash\nnpm ci\nnpm test\nnpm run typecheck\nnpm run smoke:mock\nnpm run smoke:leaderboard\n```\n\nRun the synthetic demo suite with a local command adapter:\n\n```bash\nnpm run bench -- suite \\\n  --suite suites\u002Flite-public.pt-BR.json \\\n  --provider command \\\n  --cmd \"node examples\u002Fmock-agent.mjs\" \\\n  --run-name mock-agent \\\n  --track mini-agent \\\n  --out runs\u002Fmock-agent.json\n```\n\nA parallel English demo suite is available at `suites\u002Flite-public.en-US.json` (cases in `cases\u002Fpublic\u002Fsynthetic-demo.en-US.json`) — both are synthetic-only.\n\nBuild a local leaderboard from run artifacts:\n\n```bash\nnpm run bench -- leaderboard \\\n  --inputs runs\u002Fmock-agent.json \\\n  --out runs\u002Fleaderboard.json \\\n  --markdown runs\u002Fleaderboard.md\n```\n\n### Reliability (pass^k)\n\nA single-shot critical-finding pass-rate saturates and is gameable by verbose \"restate everything\" reports. The `reliability` command measures **consistency** instead: run the same system on the same suite three times and report how often it produces identical verdicts per case.\n\n```bash\nnpm run bench -- reliability \\\n  --inputs runs\u002Frun-1.json runs\u002Frun-2.json runs\u002Frun-3.json \\\n  --out runs\u002Freliability.json \\\n  --markdown runs\u002Freliability.md\n```\n\n## Frozen Predictions\n\nUse predictions mode when reports were generated outside the harness. The public submission contract is documented in [docs\u002Fpublic-submissions.md](docs\u002Fpublic-submissions.md); each JSONL line follows the `PredictionInput` schema.\n\n```bash\nnpm run bench -- validate-submission \\\n  --suite suites\u002Flite-public.pt-BR.json \\\n  --predictions predictions\u002Fmy-agent.jsonl\n\nnpm run bench -- eval-submission \\\n  --suite suites\u002Flite-public.pt-BR.json \\\n  --predictions predictions\u002Fmy-agent.jsonl \\\n  --run-name my-agent \\\n  --model-label my-agent \\\n  --track mini-agent \\\n  --out runs\u002Fmy-agent.json\n```\n\n## Leaderboard Governance\n\nLeaderboard rows should disclose benchmark version, suite hash, track, scaffold class, judged\u002Ffrozen status, evaluated entity, validation status, cost, latency, and the scoring mode used. Incompatible runs are separated by track, scaffold, locale, and suite hash.\n\nPublic artifacts must not include private prompts, product routes, credentials, private file paths, raw validation ID lists, private case content, hidden judge configuration, answer keys, or proprietary schemas beyond the public contract.\n\n## arXiv Status\n\nThe paper material is draft-ready for human review, not automatic submission. arXiv submission remains blocked until authors, affiliations, corresponding contact, conflicts, ethics\u002FIRB\u002FCEP language, and license choice are finalized.\n\n## License\n\nThis public LAIBench framework repository is released under the MIT License.\n\nThe MIT License applies to the public code, schemas, documentation, examples, synthetic demo cases, and tooling included in this repository. It does not apply to the private clinical corpus, gated datasets, hidden test sets, answer keys, private scoring criteria, or protected evaluation artifacts that are not included in this repository.\n","2026-06-11 04:10:51","CREATED_QUERY"]