[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80922":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":15,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":31,"discoverSource":32},80922,"chi-bench","actava-ai\u002Fchi-bench","actava-ai","Χ-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?","https:\u002F\u002Factava.ai\u002Fbenchmarks",null,"Python",39,6,2,1,0,4,43.44,"Apache License 2.0",false,"main",true,[24,25,26,27],"benchmark","care-management","healthcare-ai","prior-authorization","2026-06-11 04:07:22","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002FX-bench-color@600x.png\" alt=\"χ-Bench\" width=\"300\"\u002F>\n  \u003Ch1>\u003Cins>C\u003C\u002Fins>linical \u003Cins>H\u003C\u002Fins>ealthcare \u003Cins>I\u003C\u002Fins>n-Situ Environment\u003C\u002Fh1>\n  \u003Cp>\u003Cb>Benchmark for long-horizon, policy-rich healthcare workflow agents\u003C\u002Fb>\u003C\u002Fp>\n\n[![Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLeaderboard-chi--bench-blue?style=for-the-badge)](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fleaderboards)\n[![Docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-chi--bench-ff5baf?style=for-the-badge&logo=readthedocs&logoColor=white)](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.16679-b31b1b?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.16679)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset-chi--bench-yellow?style=for-the-badge&logo=huggingface&logoColor=white)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fchi-bench)\n[![Handbook](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSkill-managed--care--operations--handbook-yellow?style=for-the-badge&logo=huggingface&logoColor=white)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fmanaged-care-operations-handbook)\n[![Harbor hub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHarbor_hub-actava--ai\u002Fchi--bench-00b8d4?style=for-the-badge)](https:\u002F\u002Fhub.harborframework.com\u002Fdatasets\u002Factava-ai\u002Fchi-bench)\n\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin_Our_Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white)](https:\u002F\u002Fdiscord.gg\u002FeQfMpUQtda)\n[![Slack](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin_Our_Slack-4A154B?style=for-the-badge&logo=slack&logoColor=white)](https:\u002F\u002Fjoin.slack.com\u002Fshare\u002FenQtMTExMTE4MDYyNTMzOTktMzZiMGE2MjYxYjRmNzYyMTFiMDVkZmJiNzZiYWUwNWMwNzJkMGRiZDIwYmU5ZWM5NDQyY2E2ZDEyNTcxZWQ1ZA)\n[![WeChat](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin_Our_WeChat-07C160?style=for-the-badge&logo=wechat&logoColor=white)](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1FD93bxx4E9C9FZDCQW0o_KoQGi-i8WOa\u002Fview?usp=sharing)\n[![LinkedIn](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLinkedIn-actava-0A66C2?style=for-the-badge&logo=linkedin&logoColor=white)](https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Factava\u002F)\n\n\u003C\u002Fdiv>\n\n## What this benchmark measures\n\n$\\chi$-Bench evaluates AI agents on end-to-end U.S. healthcare workflows across three long-horizon domains: provider prior authorization, payer utilization management, and population care management. Each task hands the agent a clinical case in a high-fidelity simulator of 20 healthcare apps exposed over MCP, with a 1,279-document Managed-Care Operations Handbook skills, and asks it to drive the case through tool calls and artifact authoring.\n\n> [!TIP]\n> Reading on the web: **[Overview & authors](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fchi-bench)** · **[Live leaderboard](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fleaderboards)** · **[All 75 tasks](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Ftasks)** · **[Docs](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs)**.\n\n> [!NOTE]\n> **Headline numbers from the paper:**\n>\n> - Best agent (Claude Code + Claude Opus 4.6): **28.0%** overall pass@1\n> - No agent clears **20%** on strict pass^3\n> - Marathon (all 25 tasks in one session): **3.8%** overall\n> - End-to-end provider–payer arena: **0%** on the best PA agents\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fmain_pass_at_1.png\" alt=\"pass@1 across the three χ-Bench environments\" width=\"780\"\u002F>\n\u003C\u002Fp>\n\n| Domain                               | Tasks | What the agent does                                                                                        |\n| ------------------------------------ | ----- | ---------------------------------------------------------------------------------------------------------- |\n| **Prior Authorization — Provider**   | 25    | Verify coverage, gather evidence, submit the PA packet, work the response (RFIs, peer-to-peer, appeals)    |\n| **Prior Authorization — UM (Payer)** | 25    | Intake the request, check plan policy, escalate through nurse and physician reviewers, issue determination |\n| **Care Management**                  | 25    | Review the chart, contact the patient, administer assessments, author a care plan                          |\n\n## Setup (one-time)\n\n**Prereqs:** Python 3.12+, Docker, [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv).\n\n**1. Clone and install.**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Factava-ai\u002Fchi-bench && cd chi-bench\nuv sync --extra dev\n```\n\n**2. API keys.** Copy `.env.example` to `.env` and fill in:\n\n- `ANTHROPIC_API_KEY` — **required**. The workspace judge (`claude-opus-4-7`) grades every trial; also the default credential for the Claude Code agent harness.\n- `OPENAI_API_KEY` — required for Codex and OAI Agents rows.\n- `GEMINI_API_KEY` — required for Gemini CLI rows.\n- `OPENROUTER_API_KEY` — required for the open-stack rows (Hermes \u002F OpenClaw \u002F OAI Agents \u002F DeepAgents on open-weight models).\n- `CLAUDE_CODE_OAUTH_TOKEN` — _optional_, cheaper alternative for smoke-testing the Claude Code harness. When set, Claude Code authenticates via OAuth instead of `ANTHROPIC_API_KEY`.\n\nProvide whichever provider keys you need for the rows you intend to run. Hugging Face and Modal credentials are handled by their respective CLIs (see steps 3 and the Modal note below) — no tokens go in `.env`.\n\n**3. Task fixtures from Hugging Face.** Authenticate once with the CLI, then download the gated dataset:\n\n```bash\nuv run huggingface-cli login\n\nREV=chi-bench-v1.0.0\nuv run huggingface-cli download actava\u002Fchi-bench --repo-type dataset --revision \"$REV\" --local-dir data\u002F\necho \"$REV\" > data\u002F.chi-bench-version\n```\n\nThe `data\u002F.chi-bench-version` pin is what submission preflight verifies against your config's `dataset.version`; write it whenever you change revisions.\n\n**4. Managed-Care Operations Handbook (gated, request access).**\n\nThe handbook (1,279 markdown documents) is distributed separately as the **gated** Hugging Face dataset **[actava\u002Fmanaged-care-operations-handbook](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fmanaged-care-operations-handbook)** (size + curation provenance with clinical collaborators). Request access on that repo's page; once approved, download it into `data\u002Fskills\u002F` with your HF token:\n\n```bash\nuv run huggingface-cli download actava\u002Fmanaged-care-operations-handbook \\\n    --repo-type dataset --local-dir data\u002Fskills\u002F\n# -> data\u002Fskills\u002Fmanaged-care-operations-handbook\u002F{SKILL.md,references\u002F}\n```\n\n**5. Build the Docker image** (~5 min, one-time).\n\n```bash\nuv run cb docker build\n```\n\n> [!NOTE]\n> `cb` is the short alias for `chi-bench`; both commands resolve to the same CLI. Pick whichever you prefer (the rest of this README uses `cb`). If your shell already aliases `cb` to something else (e.g. a clipboard tool), use `chi-bench`. For the full command surface and flag reference, read [`docs\u002Fcli.md`](docs\u002Fcli.md).\n\nThe image bundles the FastAPI server, the workspace judge, the agent harness, and per-task fixtures.\n\n**Verify setup:**\n\n```bash\nuv run cb data verify\n```\n\nA clean run means you're ready for the quickstart.\n\n> [!TIP]\n> **Modal (optional, recommended).** Modal parallelizes trials across remote sandboxes. Set it up now and you won't have to later:\n>\n> ```bash\n> uv run modal setup                            # default profile, or:\n> uv run modal token set --profile chi-bench    # (optional) named profile\n> ```\n>\n> If you use a named profile, export `MODAL_PROFILE=chi-bench` in your shell before running the matrix.\n\n## Quickstart: run one task\n\nSmoke-test that everything is wired up with a single UM medical-director-review task:\n\n```bash\nuv run cb experiment run \\\n  --dataset data\u002Fprior_auth_um\u002Ftasks\u002Fpa_t008_t008_o002_p01_mdreview_payer \\\n  --agent codex --model openai\u002Fgpt-5.5\n```\n\nTrial output lands under `logs\u002Fexperiments\u002F...\u002Ftrial_*\u002F`. Read `result.json` for the verifier reward and `verifier\u002Fscorecard.json` for per-check verdicts.\n\nFull flag-by-flag CLI reference: [`docs\u002Fcli.md`](docs\u002Fcli.md). Web walkthrough of the same flow: **[actava.ai\u002Fbenchmarks\u002Fdocs\u002Fquickstart](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Fquickstart)**.\n\n## Run from the Harbor hub (no source checkout)\n\nchi-Bench is also published to the [Harbor hub](https:\u002F\u002Fhub.harborframework.com\u002Fdatasets\u002Factava-ai\u002Fchi-bench) as `actava-ai\u002Fchi-bench` — **78 single-agent tasks** (75 single-domain + 3 marathon). Each task ships a self-contained Dockerfile that Harbor builds on demand — cloning this repo and downloading the [fixtures dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fchi-bench) at build — so you can run a trial without cloning anything yourself.\n\n> **The 23 provider↔payer E2E tasks are *not* on the Harbor hub.** The E2E arena needs the two-agent `dual-pa-e2e` harness (provider phase → relay → payer phase), which a stock single-agent `harbor run` can't drive. Run E2E from this repo \u002F the [HF dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fchi-bench) with the `cb` CLI: `cb experiment run --dataset data\u002Fprior_auth_e2e\u002Ftasks\u002F\u003Cid> --agent dual-pa-e2e --provider-model … --payer-model …` (see [`configs\u002Fexperiments\u002Ftable2_e2e_arena.yaml`](configs\u002Fexperiments\u002Ftable2_e2e_arena.yaml)).\n\n**Prerequisites:** Docker + the [Harbor CLI](https:\u002F\u002Fgithub.com\u002Fharbor-framework\u002Fharbor), and an **approved HF token** for the gated [handbook](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fmanaged-care-operations-handbook) (the container downloads it at start; the fixtures dataset itself is public).\n\n```bash\nHF_TOKEN=\u003Cyour-approved-hf-token> harbor run \\\n    -d actava-ai\u002Fchi-bench@v1.0.1 \\\n    -i actava-ai\u002Fpa_t016_t016_o001_p01_p2p_payer \\\n    -a claude-code -m claude-opus-4-7 -y\n```\n\n`HF_TOKEN` is read from your shell (or `--env-file`) and forwarded into the container — `-y` auto-confirms that prompt. Drop `-i …` to run all 78 tasks. Without an approved token the container exits early with a clear message. See [`docs\u002Fharbor-hub.md`](docs\u002Fharbor-hub.md) for how the fetch-at-build environment works and how the listing is regenerated\u002Fpublished. This path is for ad-hoc runs and discovery; the **paper-reproduction and leaderboard-submission flows use the `cb` CLI** described above.\n\n### Reading the verifier output\n\nEach trial's `verifier\u002F` directory has three files: `reward.json` (the single binary `{\"reward\": 0.0 | 1.0}` used for pass@1), `scorecard.json` (per-check breakdown — read this to see _why_ a trial passed or failed), and `exported_state.json` (the world snapshot the verifier scored against).\n\n`scorecard.json` carries two reward axes: **`binary_reward`** (strict — `1.0` only when every non-N\u002FA check passes; this is what the leaderboard publishes) and **`fractional_reward = passed_checks \u002F total_checks`** (partial credit for diagnostics; never published). A `0.0 \u002F 0.91` split means a near-miss. Checks are grouped under `stages` (`md_review`, `outcome`, `cross_stage`, `intake`, `nurse_review`, `p2p`, `appeal`, `provider_*`, `cm_*`, `e2e_consistency`); `failed_checks` lists what broke.\n\n> [!TIP]\n> Field-by-field walkthrough — including check-name namespaces, the `judge.*` LLM rubric format, three check states, the Care Management two-axis schema, a worked example, and `cb verifier rejudge` — lives at **[actava.ai\u002Fbenchmarks\u002Fdocs\u002Fscorecard](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Fscorecard)**.\n\nIf you see a scorecard, you're ready to [submit your agent](#submit-your-agent) or [reproduce the paper](#reproduce-paper-tables).\n\n## Submit your agent\n\n> [!TIP]\n> **Bringing your own agent harness or model endpoint?** The end-to-end recipe lives in [`docs\u002Fextending.md`](docs\u002Fextending.md) and on the web at **[actava.ai\u002Fbenchmarks\u002Fdocs\u002Fextending](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Fextending)**. The rest of this section is identical regardless of whether you submit a built-in agent or a custom one — the packet shape is unchanged.\n\nSubmitting to the [leaderboard](https:\u002F\u002Fgithub.com\u002Factava-ai\u002Fleaderboard) is a 5-command flow: 4 against chi-bench (validate, run, status, prepare) and the final step against the leaderboard repo (commit + open PR). Prefer reading on the web? See the **[in-app submission walkthrough](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fsubmit)** for the same flow with collapsible step UI.\n\n**1. Configure.** Copy `configs\u002Fsubmission_example.yaml` to `configs\u002Fsubmissions\u002F\u003Cyour-id>.yaml` and edit `id`, `team`, `contact`, `agent`, `model`; optionally `notes` and `run.*`.\n\n**2. Run trials and prepare a packet.**\n\n```bash\n# Schema + preflight: dataset pin, Modal token \u002F Docker image, agent name.\nuv run cb submission validate -f configs\u002Fsubmissions\u002F\u003Cyour-id>.yaml\n\n# Run all 3 domains. Default: one trial per task (pass@1).\nuv run cb submission run      -f configs\u002Fsubmissions\u002F\u003Cyour-id>.yaml\n\n# Check progress; safe to run while `submission run` is in flight.\nuv run cb submission status   -f configs\u002Fsubmissions\u002F\u003Cyour-id>.yaml\n\n# Curate the leaderboard-ready packet (a directory you `cp` into the leaderboard repo).\nuv run cb submission prepare  -f configs\u002Fsubmissions\u002F\u003Cyour-id>.yaml\n```\n\nThe final command writes to `logs\u002Fsubmissions\u002F\u003Cid>\u002Fpacket\u002FYYYY-MM-DD-\u003Cid>\u002F`, containing:\n\n```\nsubmission.json                # manifest: agent, model, results, provenance\nresults.csv                    # leaderboard rows (one per domain + overall)\nsub.yaml                       # frozen copy of your config\nprovenance.json                # git SHA, image digest, timestamps\nREADME.md                      # auto-generated headline summary\ntrials\u002F\u003Cdomain>\u002F\u003Ctrial_id>\u002F\n    result.json                # Harbor reward + agent metadata\n    verifier\u002Fscorecard.json    # per-check verdicts\n    verifier\u002Freward.json       # verifier's reward breakdown\n    agent\u002Ftrajectory.jsonl.zst # full agent trace (zstd-compressed; inspect with `zstdcat | jq .`)\n```\n\nWorkspace artifacts and Harbor scratch files are deliberately excluded so the packet stays small (typically \u003C100 MB total).\n\n**3. Submit the packet.** Follow the instructions at **\u003Chttps:\u002F\u002Fgithub.com\u002Factava-ai\u002Fleaderboard>** — either the one-command helper (`python scripts\u002Fsubmit.py \u003Cpacket-path>`) or the manual `cp` + `git` + `gh pr create` flow. Either way, the packet is identical; the leaderboard repo owns the submission workflow.\n\nPacket contract (for benchmark authors building their own producers): [`docs\u002Fsubmission-packet.md`](docs\u002Fsubmission-packet.md).\n\n**Policy notes.**\n\n- **Partial submissions** (`--domain pa | um | cm` on `submission run`) are accepted but flagged as partial on the leaderboard.\n- **Leaderboard is pass@1 only.** Set `run.n_attempts: 3` to keep extra trials on disk for your own pass@3 \u002F pass^3 analysis — the manifest still publishes pass@1.\n\n## Reproduce paper tables\n\n| Result                           | Config                       | Command                         |\n| -------------------------------- | ---------------------------- | ------------------------------- |\n| Main matrix (paper Table 2)      | `table1_main_matrix.yaml`    | `.\u002Fscripts\u002Frun_table.sh table1` |\n| E2E arena (paper Table 3)        | `table2_e2e_arena.yaml`      | `.\u002Fscripts\u002Frun_table.sh table2` |\n| Marathon (paper Table 4)         | `table3_marathon.yaml`       | `.\u002Fscripts\u002Frun_table.sh table3` |\n| Skill ablation (paper Figure 12) | `table4_skill_ablation.yaml` | `.\u002Fscripts\u002Frun_table.sh table4` |\n| MCP vs CLI (paper Table 5)       | `table5_mcp_vs_cli.yaml`     | `.\u002Fscripts\u002Frun_table.sh table5` |\n\nAfter all slices finish, aggregate:\n\n```bash\nuv run python scripts\u002Faggregate.py \\\n  --trials-dir logs\u002Fexperiments\u002Ftable1_main_matrix \\\n  --prices configs\u002Fprices.yaml \\\n  --out-csv logs\u002Ftable1.csv\n```\n\nCSV columns: `agent, model, n_trials, n_tasks, pass_at_1, pass_at_1_lo, pass_at_1_hi, pass_at_3, ..., pass_pow_3, pass_pow_3_hi, mean_cost_usd, mean_walltime_s` with task-level percentile bootstrap 95% CIs (1,000 iterations, seed `0` — matches paper Table 2 \u002F Figure 3 captions; override with `--bootstrap-iters` \u002F `--bootstrap-seed`). v1 emits the numeric tables; paper figures are out of scope — plot from the CSV. See [`docs\u002Freproduce.md`](docs\u002Freproduce.md) for the figure scripts we used.\n\n> [!TIP]\n> Add `--modal` to `run_table.sh` for parallel execution on Modal — matrix reproduction on a single host takes days.\n\nWeb walkthrough of the same flow (single trial, submission lifecycle, paper-table reproduction, Modal vs Docker): **[actava.ai\u002Fbenchmarks\u002Fdocs\u002Frun](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Frun)**.\n\n## Supported agents\n\n| `--agent`       | Example `--model`             | Paper rows  |\n| --------------- | ----------------------------- | ----------- |\n| `claude-code`   | `anthropic\u002Fclaude-opus-4-7`   | Claude Code |\n| `codex`         | `openai\u002Fgpt-5.5`              | Codex       |\n| `gemini-cli`    | `gemini\u002Fgemini-3-pro-preview` | Gemini CLI  |\n| `openclaw`      | `anthropic\u002Fclaude-opus-4-7`   | OpenClaw    |\n| `hermes`        | `openrouter\u002Fz-ai\u002Fglm-5.1`     | Hermes      |\n| `openai-agents` | `deepseek\u002Fdeepseek-v4-pro`    | OAI Agents  |\n| `deepagents`    | `openrouter\u002Fx-ai\u002Fgrok-4.3`    | DeepAgents  |\n\nThe full 30-row matrix (every model × harness reported in the main results table) lives in [`configs\u002Fexperiments\u002Ftable1_main_matrix.yaml`](configs\u002Fexperiments\u002Ftable1_main_matrix.yaml). Browse all 75 tasks at **[actava.ai\u002Fbenchmarks\u002Ftasks](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Ftasks)**.\n\nSee [`docs\u002Fextending.md`](docs\u002Fextending.md) (or **[the web version](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Fextending)**) to plug in your own.\n\n## Architecture\n\nA single Python package (`chi_bench`) hosts a FastAPI server, three MCP servers (provider :8020, payer :8100, CM :8200), and an LLM-based workspace judge. Each trial runs in a fresh Docker container that bundles the server, the judge, the agent harness, and the per-task fixtures. The Managed-Care Operations Handbook (1,279 markdown documents) is mounted into the agent's skill directory at trial start.\n\nSystem diagram and module boundaries: [`docs\u002Farchitecture.md`](docs\u002Farchitecture.md) (web: **[actava.ai\u002Fbenchmarks\u002Fdocs\u002Farchitecture](https:\u002F\u002Factava.ai\u002Fbenchmarks\u002Fdocs\u002Farchitecture)**). Verifier details: [`docs\u002Fjudge.md`](docs\u002Fjudge.md). Full CLI reference: [`docs\u002Fcli.md`](docs\u002Fcli.md). Environment chapter from the paper: [`chi-bench-arxiv-submission\u002Fsections\u002Fapproach.tex`](chi-bench-arxiv-submission\u002Fsections\u002Fapproach.tex).\n\n## Citation\n\nIf you use $\\chi$-Bench, please cite:\n\n```bibtex\n@misc{chen2026chibenchaiagentsautomate,\n      title={CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?},\n      author={Haolin Chen and Deon Metelski and Leon Qi and Tao Xia and Joonyul Lee and Steve Brown and Kevin Riley and Frank Wang and T. Y. Alvin Liu and Hank Capps MD and Zeyu Tang and Xiangchen Song and Lingjing Kong and Fan Feng and Tianyi Zeng and Zhiwei Liu and Zixian Ma and Hang Jiang and Fangli Geng and Yuan Yuan and Chenyu You and Qingsong Wen and Hua Wei and Yanjie Fu and Yue Zhao and Carl Yang and Biwei Huang and Kun Zhang and Caiming Xiong and Sanmi Koyejo and Eric P. Xing and Philip S. Yu and Weiran Yao},\n      year={2026},\n      eprint={2605.16679},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.16679},\n}\n```\n\n## License\n\nCode: Apache-2.0 (see [`LICENSE`](LICENSE)). Data licensing on the [HF dataset card](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Factava\u002Fchi-bench).\n","Χ-Bench是一个用于评估AI代理在端到端、长周期、政策丰富的美国医疗工作流程中表现的基准测试平台。它通过一个高保真的20个医疗应用程序模拟器，结合1,279文档的管理护理操作手册技能，来测试AI在提供者预授权、支付方利用管理和人群健康管理三个领域的任务执行能力。该项目采用Python开发，具备高度仿真环境和丰富文档支持，适用于研究和开发能够自动化复杂医疗业务流程的智能系统。","2026-06-11 04:02:50","CREATED_QUERY"]