[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80767":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":25,"discoverSource":26},80767,"HarnessAudit","UCSB-AI\u002FHarnessAudit","UCSB-AI","Official codebase for the paper \"Auditing Agent Harness Safety\"","https:\u002F\u002Fharnessaudit.github.io\u002F",null,"Python",42,3,40,0,2,1.81,"MIT License",false,"main",[],"2026-06-12 02:04:06","# \u003Cimg src=\"assets\u002Flogo.png\" width=\"85\" align=\"center\"> HarnessAudit: Auditing Agent Harness Safety\n\n[![Project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Website-blue.svg)](https:\u002F\u002Fharnessaudit.github.io\u002F)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.14271)\n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset-Hugging%20Face-yellow.svg)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FLCZZZZ\u002FHarnessAudit)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.11%2B-blue.svg)](https:\u002F\u002Fwww.python.org\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](LICENSE)\n\n\n> HarnessAudit is an evaluation harness for auditing whether agent systems obey\n> tool, resource, and information-flow boundaries while still completing useful\n> tasks.\n\nHarnessAudit evaluates full execution trajectories rather than final answers\nonly. It supports multi-agent runs, single-agent control runs, stateful domain\nbanks, native CLI harnesses, and LLM-as-judge scoring for completion and action\nvalidity.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ftitle2_01.png\" alt=\"HarnessAudit overview\" width=\"92%\">\n\u003C\u002Fp>\n\n## 💡 Update\n\n- [ ] Add integrations for more multi-agent frameworks, e.g.,\n  [JiuwenClaw](https:\u002F\u002Fgithub.com\u002FopenJiuwen-ai\u002Fjiuwenclaw).\n- [ ] Add support for additional harnesses, e.g.,\n  [Nanobot](https:\u002F\u002Fgithub.com\u002FHKUDS\u002Fnanobot) and\n  [Hermes-Agent](https:\u002F\u002Fgithub.com\u002FNousResearch\u002Fhermes-agent).\n- [x] HarnessAudit framework released.\n- [x] HarnessAudit task release is available on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Feric-ai-lab\u002FHarnessAudit).\n\n---\n\nThis GitHub repository contains the runner, framework adapters, stateful mock\nservices, schemas, and reproducibility scripts. Local traces, result JSONs,\nSQLite bank snapshots, workspaces, private keys, and paper build artifacts are\nnot committed.\n\n---\n\n## 📊 What HarnessAudit Measures\n\nHarnessAudit reports paper-facing metrics from normalized traces:\n\n| Metric | Meaning |\n| --- | --- |\n| `sar_tool` | Safety adherence for out-of-scope tool use |\n| `sar_resource` | Safety adherence for protected-resource access |\n| `sar_flow` | Safety adherence for information-flow constraints |\n| `sar_avg` | Average Safety Adherence Rate across the three L1 channels |\n| `avs` | Action Validity Score for L2 execution fidelity |\n| `tcr` | Task Completion Rate from deterministic and LLM completion checks |\n\nThe trace schema records normalized tool calls, communications, access\ndecisions, completion scores, operational judgments, and run-level metadata.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fpipline.png\" width=\"92%\">\n\u003C\u002Fp>\n\n---\n\n## 🚀 Quick Start\n\nUse Python 3.11 or newer. A clean virtual environment is recommended.\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Feric-ai-lab\u002FHarnessAudit.git\ncd HarnessAudit\n\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npython -m pip install --upgrade pip\npython -m pip install -e \".[dev,oai]\"\n```\n\nIf you use ClawTeam-backed harnesses, install the vendored ClawTeam copy in the\nsame environment:\n\n```bash\npython -m pip install -e vendor\u002Fclawteam\n```\n\nThis repository temporarily vendors the working ClawTeam version under\n`vendor\u002Fclawteam\u002F` while the upstream changes are pending merge.\n\nInstall and authenticate any native harness CLIs you plan to evaluate:\n\n- OpenClaw for `HARNESS=openclaw`\n- Codex CLI for `HARNESS=codex`\n- Claude Code CLI for `HARNESS=claude`\n\nOpenClaw requires a recent Node.js runtime. If your system `node` is old, put a\nnewer Node binary earlier in `PATH` before running OpenClaw experiments.\n\n---\n\n## 🔐 Configure Secrets\n\nThe CLIs load `.env` from the repository root. Keep it local; it is ignored by\nGit.\n\n```bash\n# Used by judges, OAI\u002FADK adapters, and Codex API-key mode.\nOPENAI_API_KEY=...\nOPENAI_BASE_URL=...              # optional\n\n# Codex can use either OPENAI_API_KEY or local codex CLI login state.\nCODEX_AUTH_MODE=auto             # auto | api_key | cli_login\n\n# Claude Code can use either ANTHROPIC_API_KEY or local claude CLI login state.\nANTHROPIC_API_KEY=...\nANTHROPIC_BASE_URL=...           # optional\nCLAUDE_CODE_AUTH_MODE=auto       # auto | api_key | cli_login\n\n# OpenClaw model routing through OpenRouter.\nOPENROUTER_API_KEY=...\nOPENROUTER_BASE_URL=https:\u002F\u002Fopenrouter.ai\u002Fapi\u002Fv1\n```\n\nOptional debugging controls:\n\n```bash\nMASP_WORKSPACES_ROOT=\u002Ftmp\u002Fharnessaudit_workspaces\nMASP_KEEP_CLAUDE_ISOLATION=1\nMASP_KEEP_CODEX_ISOLATION=1\nMASP_KEEP_OPENCLAW_ISOLATION=1\n```\n\n---\n\n## 🤝 Reproduce Multi-Agent Results\n\nRun a single task first to validate the environment:\n\n```bash\npython -m multi_agent run multi_agent\u002Ftasks\u002Fdaily_life\u002Fwellness\u002Fdl-t4.yaml \\\n  --framework clawteam \\\n  --harness openclaw \\\n  --model gpt-5.4 \\\n  --judge-model gpt-5.4 \\\n  --judge-workers 4 \\\n  --trace-dir multi_agent\u002Ftraces \\\n  --output multi_agent\u002Fresults\n```\n\nRun the full multi-agent task suite:\n\n```bash\nHARNESS=openclaw \\\nMODEL=gpt-5.4 \\\nTASK_WORKERS=4 \\\nJUDGE_MODEL=gpt-5.4 \\\nJUDGE_WORKERS=4 \\\nbash multi_agent\u002Frun_ma.sh\n```\n\n`run_ma.sh` supports:\n\n```text\nHARNESS=openclaw|claude|codex|oai|adk\nFRAMEWORK=clawteam|oai|adk\nMODEL=\u003Charness model name>\nTASK_WORKERS=\u003Cconcurrent task runs>\nJUDGE_MODEL=gpt-5.4\nJUDGE_WORKERS=\u003Cconcurrent judge calls per task>\nSKIP_JUDGE=1\nSKIP_EXISTING=1\nTASK_FILE=\u003Cone yaml>\nTASK_LIST=\u003Cnewline-delimited yaml list>\nPYTHON_CMD=\"conda run -n harnessaudit python\"\nEXTRA_ARGS=\"...\"\n```\n\nWhen `HARNESS=openclaw`, the script first records the enabled plugin snapshot:\n\n```bash\nopenclaw plugins list --enabled --json > \u002Ftmp\u002Fopenclaw_plugins_enabled.json\n```\n\nArtifacts are written under harness\u002Fmodel-scoped directories, for example:\n\n```text\nmulti_agent\u002Ftraces\u002Fopenclaw\u002Fgpt-5.4\u002F*.jsonl\nmulti_agent\u002Fresults\u002Fopenclaw\u002Fgpt-5.4\u002F*.json\nmulti_agent\u002Fresults\u002Fopenclaw\u002Fgpt-5.4\u002F*.sqlite\n```\n\n---\n\n## 🧍 Reproduce Single-Agent Results\n\nRun a single control task:\n\n```bash\npython -m single_agent run single_agent\u002Ftasks\u002Ffinance\u002Fsa-fin-t1.yaml \\\n  --framework openclaw_local \\\n  --model gpt-5.4 \\\n  --judge-model gpt-5.4 \\\n  --judge-workers 4 \\\n  --trace-dir single_agent\u002Ftraces \\\n  --output single_agent\u002Fresults\n```\n\nRun the full single-agent task suite:\n\n```bash\nMODEL=gpt-5.4 \\\nTASK_WORKERS=1 \\\nJUDGE_MODEL=gpt-5.4 \\\nJUDGE_WORKERS=4 \\\nbash single_agent\u002Frun_sa.sh\n```\n\n`run_sa.sh` supports the same `TASK_WORKERS`, `JUDGE_MODEL`, `JUDGE_WORKERS`,\n`SKIP_JUDGE`, `SKIP_EXISTING`, `TASK_FILE`, `TASK_LIST`, `PYTHON_CMD`, and\n`EXTRA_ARGS` controls. Single-agent evaluation intentionally supports only the\n`openclaw_local` framework:\n\n```text\nFRAMEWORK=openclaw_local\nMODEL=\u003COpenClaw model name>\n```\n\n`run_sa.sh` writes `\u002Ftmp\u002Fopenclaw_plugins_enabled.json` before launching tasks.\n\n---\n\n## 📁 Output Files\n\nEach completed run writes:\n\n| File | Purpose |\n| --- | --- |\n| `*.jsonl` | Normalized append-only trajectory trace |\n| `*.json` | Run summary with SAR, AVS, TCR, violations, warnings, and errors |\n| `*.sqlite` | Per-run stateful bank snapshot for post-hoc inspection |\n| `_run_logs\u002F\u003Charness>\u002F\u003Cmodel>\u002F*.log` | Per-task stdout\u002Fstderr from full-suite scripts |\n\nLocal output directories are ignored by Git.\n\n---\n\n## 🧪 How the Evaluation Works\n\nHarnessAudit evaluates a harness from the full trajectory it produces, not from\nthe final response alone. A run has five stages:\n\n1. **Load a task and tool catalog.** Each task defines agent roles, a user goal,\n   domain tools, boundary rules, completion checkpoints, and optional\n   perturbation variants. The tool catalog defines the callable task tools and\n   which tools expose protected resources.\n2. **Instantiate run state.** Most domains receive an isolated SQLite-backed\n   mock service, called a domain bank. Tool calls mutate this bank, and the final\n   database snapshot is saved for auditing. SDE tasks are handled separately:\n   each run receives its own disposable git worktree of the target fixture\n   repository, so code edits, tests, and file-system changes are isolated from\n   the source fixture.\n3. **Run the target harness.** The selected harness, for example OpenClaw,\n   Claude Code, or Codex, executes the task through the configured framework.\n   HarnessAudit records normalized observable actions: tool calls,\n   communications, tool arguments, tool results, final output, and harness\n   metadata.\n4. **Score the trajectory.**\n   - **L1 Boundary Compliance** applies deterministic access rules to every\n     normalized action. It reports `sar_tool`, `sar_resource`, `sar_flow`, and\n     `sar_avg`.\n   - **L2 Execution Fidelity** reports `avs` and `tcr`. `avs` is a post-hoc\n     operational judge over tool-path and scoped-resource behavior. `tcr` is the\n     original completion score: deterministic rule checkpoints keep their YAML\n     weights, and LLM completion checkpoints are pooled into one\n     trajectory-level judge.\n   - **L3 Perturbation Stability** reruns selected tasks under perturbations\n     such as indirect injection, ambiguous goals, and robustness failures, then\n     scores whether the harness preserves safe and useful behavior.\n5. **Write reproducibility artifacts.** Each run writes a JSONL trace, a JSON\n   report, and, when applicable, a SQLite bank snapshot. SDE reports also record\n   the per-run workspace path so code diffs and test-side effects can be audited.\n   These artifacts are sufficient to inspect violations, recompute metrics,\n   debug task completion, and audit tool-side state transitions.\n\n---\n\n## 🧮 Metric Calculation and Re-scoring\n\nHarnessAudit writes per-run metrics into each result JSON during\n`python -m multi_agent run`. The main metric sources are:\n\n| Metric | How it is computed | Implementation |\n| --- | --- | --- |\n| `sar_tool`, `sar_resource`, `sar_flow`, `sar_avg` | Deterministic rule matching over normalized actions. Tool violations, protected-resource violations, and communication\u002Fdata-leak violations are counted from access decisions. | `multi_agent\u002Fchecker.py`, `multi_agent\u002Fschemas\u002Ftrace.py` |\n| Aggregate SAR tables | Post-hoc aggregation over existing `multi_agent\u002Ftraces\u002F` and paired `multi_agent\u002Fresults\u002F`. This helper recomputes channel-level SAR series for trace groups. | `multi_agent\u002Fsar_calculate.py` |\n| `avs` | LLM-as-judge score over each role's actual tool path, compared against task `ground_truth_tool_paths` and resource-scope constraints from access rules. Role scores are averaged. | `multi_agent\u002Foperational_judge.py` |\n| `tcr` | Weighted task-completion score. Rule checkpoints are evaluated deterministically; LLM checkpoints are pooled into one trajectory-level completion judge. | `multi_agent\u002Fcompletion_judge.py` |\n| L3 perturbation score | Perturbed runs are scored for delivery, hard safety caps, perturbation-specific rubrics, and optional LLM stability judgment. | `multi_agent\u002Fperturbation_eval.py` |\n\nTo inspect aggregate SAR and the paired AVS\u002FTCR means for a completed\nharness\u002Fmodel group:\n\n```bash\npython multi_agent\u002Fsar_calculate.py --group openclaw\u002Fgpt-5.4\n```\n\nThe helper reads from the default artifact roots:\n\n```text\nmulti_agent\u002Ftraces\u002F\u003Charness>\u002F\u003Cmodel>\u002F*.jsonl\nmulti_agent\u002Fresults\u002F\u003Charness>\u002F\u003Cmodel>\u002F*.json\n```\n\nThe public CLI computes AVS\u002FTCR as part of a run and does not overwrite old\nresult JSONs in place. To refresh judge-dependent scores, run the task again\nwith the desired judge model and write to a clean output directory:\n\n```bash\npython -m multi_agent run multi_agent\u002Ftasks\u002Fdaily_life\u002Fwellness\u002Fdl-t4.yaml \\\n  --framework clawteam \\\n  --harness openclaw \\\n  --model gpt-5.4 \\\n  --judge-model gpt-5.4 \\\n  --judge-workers 4 \\\n  --trace-dir \u002Ftmp\u002Fharnessaudit_rejudge\u002Ftraces \\\n  --output \u002Ftmp\u002Fharnessaudit_rejudge\u002Fresults\n```\n\nFor a full-suite refresh, use the sweep script with a clean output location or\nwith `SKIP_EXISTING=0`:\n\n```bash\nHARNESS=openclaw \\\nMODEL=gpt-5.4 \\\nJUDGE_MODEL=gpt-5.4 \\\nTASK_WORKERS=4 \\\nJUDGE_WORKERS=4 \\\nOUTPUT_DIR=\u002Ftmp\u002Fharnessaudit_rejudge\u002Fresults \\\nTRACE_DIR=\u002Ftmp\u002Fharnessaudit_rejudge\u002Ftraces \\\nbash multi_agent\u002Frun_ma.sh\n```\n\nUse `--skip-judge` or `SKIP_JUDGE=1` only for cheap smoke tests. In that mode,\ndeterministic completion checks still run, but AVS is not evaluated and LLM\ncompletion checkpoints contribute no score.\n\nTo run one perturbation variant:\n\n```bash\npython -m multi_agent run multi_agent\u002Ftasks\u002Fdaily_life\u002Fdining\u002Fdl-t11.yaml \\\n  --framework clawteam \\\n  --harness openclaw \\\n  --model gpt-5.4 \\\n  --judge-model gpt-5.4 \\\n  --perturbation-id dl-t11-inj-1 \\\n  --trace-dir multi_agent\u002Ftraces\u002Fperturbations \\\n  --output multi_agent\u002Fresults\u002Fperturbations\n```\n\nPerturbation result JSONs include a `perturbation` object with fields such as\n`attack_type`, `delivered`, `stable`, `stability_score`, and `pb`. Aggregate L3\nnumbers are computed by grouping those reports by `attack_type` and averaging\nthe non-null `stability_score` \u002F `pb` values.\n\n## 🗂️ Repository Layout\n\n```text\n.\n├── multi_agent\u002F             # Multi-agent runner, tasks, banks, adapters, traces\n│   ├── run_ma.sh            # Full multi-agent sweep script\n│   ├── banks\u002F               # Stateful SQLite-backed domain services\n│   ├── frameworks\u002F          # ClawTeam, OpenClaw, Claude Code, Codex, OAI, ADK\n│   ├── schemas\u002F             # Task, action, access-rule, and trace schemas\n│   ├── tasks\u002F               # Materialized task YAMLs when using the HF release\n│   └── tools\u002F               # Domain tool catalogs\n├── single_agent\u002F            # Single-agent control runner and task format\n│   ├── run_sa.sh            # Full single-agent sweep script\n│   ├── banks\u002F               # Single-agent fixture-aware bank factory\n│   ├── tasks\u002F               # Materialized task YAMLs when using the HF release\n│   └── tools\u002F               # Single-agent tool catalogs\n├── fixtures\u002F                # SDE workspace fixtures used by task runs\n├── pyproject.toml           # Package metadata and CLI entry points\n└── README.md\n```\n\n---\n\n## 📚 Citation\n\nIf you use HarnessAudit in research, please cite:\n\n```bibtex\n@misc{liu2026auditingagentharnesssafety,\n      title={Auditing Agent Harness Safety}, \n      author={Chengzhi Liu and Yichen Guo and Yepeng Liu and Yuzhe Yang and Qianqi Yan and Xuandong Zhao and Wenyue Hua and Sheng Liu and Sharon Li and Yuheng Bu and Xin Eric Wang},\n      year={2026},\n      eprint={2605.14271},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.14271}, \n}\n```\n\n##  💬 Acknowledgments\nWe thank the contributors of the open-source project [ClawTeam](https:\u002F\u002Fgithub.com\u002FHKUDS\u002FClawTeam).\n","HarnessAudit 是一个用于审计代理系统是否遵守工具、资源和信息流边界的安全评估框架。它通过分析完整的执行轨迹而非仅最终答案来实现这一目标，支持多代理运行、单代理控制运行、状态域库、原生CLI套件以及使用大语言模型作为裁判进行任务完成度和动作有效性的评分。该项目采用Python编写，并提供了详细的文档与快速入门指南，适用于需要确保AI代理在执行任务时符合安全规范的场景，如开发测试环境下的多智能体系统安全性验证。","2026-06-11 04:01:57","CREATED_QUERY"]