[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84144":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":13,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":10,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":34,"discoverSource":35},84144,"PawBench","agentscope-ai\u002FPawBench","agentscope-ai","A benchmark for evaluating LLM × harness performance.","https:\u002F\u002Fagentscope-ai.github.io\u002FPawBench\u002F",null,"Python",56,3,52,2,0,4,10,1.81,"Apache License 2.0",false,"main",true,[25,26,27,28,29,30,31],"agent","benchmark","harness","hermes","llm","openclaw","qwenpaw","2026-06-12 02:04:38","\u003Ch1 align=\"center\">🐾 PawBench\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"README.md\">\u003Cstrong>English\u003C\u002Fstrong>\u003C\u002Fa> ·\n  \u003Ca href=\"README.zh-CN.md\">简体中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#tasks\">\n    \u003Cimg alt=\"tasks\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftasks-150-2ea44f\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fagentscope-ai.github.io\u002FPawBench\u002F\">\n    \u003Cimg alt=\"models\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmodels-9-0969da\">\n  \u003C\u002Fa>\n  \u003Ca href=\"#harnesses\">\n    \u003Cimg alt=\"harnesses\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fharnesses-3-8250df\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fagentscope-ai.github.io\u002FPawBench\u002F\">\n    \u003Cimg alt=\"leaderboard\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fleaderboard-live-cf222e\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002FOpenJudge\">\n    \u003Cimg alt=\"OpenJudge Ecosystem\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fecosystem-OpenJudge-blue?logo=github&color=0969da\">\n  \u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\n    \u003Cimg alt=\"license\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>A Model × Harness co-evaluation benchmark for agentic AI.\u003C\u002Fstrong>\u003Cbr>\n  150 agent tasks · 9 models · 3 harnesses · task slices · diagnostic traces\n\u003C\u002Fp>\n\n---\n\nThe same model can behave very differently once it is placed inside a real agent runtime. A failure may come from model reasoning, missing tools, weak skill discovery, poor workspace awareness, brittle web access, or a completion check that is too loose. A single final pass rate cannot separate these causes.\n\nPawBench is built around one claim:\n\n$$\\text{Agent Performance} = f(\\text{Model}, \\text{Harness})$$\n\n> [!NOTE]\n> PawBench is part of the [OpenJudge](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002FOpenJudge) ecosystem. It shares OpenJudge's philosophy of evaluation-driven optimization, but focuses specifically on the interaction between LLMs and agent harnesses.\n\nIt evaluates **the model and the harness together** while keeping enough metadata to read both dimensions independently. v1.0 covers **9 models × 3 harnesses × 150 tasks**, with public prompts, graders, task labels, submissions, and leaderboard slices.\n\n![PawBench overview and taxonomy](site\u002Fpublic\u002Fpawbench-overview-taxonomy.png)\n\nWith PawBench, you can:\n\n- **Select models & harnesses** for text, multimodal, skill-heavy, and web-search workloads.\n- **Diagnose** whether a regression comes from the model, the harness, or the grader.\n- **Iterate** on a harness change, rerun the same task slice, and check whether the targeted score actually moves.\n- **Contribute** new harnesses, tasks, graders, submissions, and bug fixes back into a shared evaluation loop.\n\n## Core Findings\n\nThe initial PawBench v1.0 runs show that harness design is not a minor implementation detail. It can change the realized capability of the same model by a margin comparable to many model upgrades.\n\nUnless otherwise noted, the numbers below come from this evaluation setting: **150 PawBench v1.0 tasks**, **9 models**, **3 harnesses** (`qwenpaw`, `openclaw`, `hermes`), and **claude opus 4.6 as judge**. Scores are reported as overall percentages.\n\n![Harness gap analysis](site\u002Fpublic\u002Fpawbench-harness-gap.png)\n\n- **Harness gaps are visible even when the model is fixed.** With the same `qwen3.6-35b-a3b` model on the same 150 tasks, QwenPaw scores **68.3**, OpenClaw **68.2**, and Hermes **56.7**, leaving an **11.5-point** spread. This is not isolated to one model: `qwen3.6-max-preview` has a **10.3-point** harness spread, `glm-5.1` has a **9.9-point** spread, and six of the nine tested models move by more than three points across harnesses.\n- **Average performance differs across harnesses.** Averaged across the 27 model × harness submissions in this run, QwenPaw scores **74.9**, OpenClaw **72.9**, and Hermes **69.3**. The overall leaderboard is only the first view; slice analysis is what shows which harness is brittle on which capability, source, scenario, or modality.\n\n![Slice diagnostics](site\u002Fpublic\u002Fpawbench-slice-diagnostics.png)\n\nSlice numbers below are macro-averages across the same 27 model × harness submissions. They point to several high-value improvement areas:\n\n- **Skill-heavy tasks are the hardest.** `Skill_Use` averages **47.2**, and `skillsbench` tasks average **40.9**, suggesting that skill discovery, skill loading, and procedural execution are still fragile.\n- **Multimodal tasks remain harder than text.** Text-only tasks average **74.1**, while multimodal tasks average **64.0**.\n- **Open environments add real friction.** Closed, reproducible tasks average **72.9**; open-environment tasks average **68.9**.\n- **Some domains expose much larger harness differences than the overall score.** Finance, information retrieval, manufacturing quality control, and software-engineering slices are useful targets for harness debugging.\n\nSee the [live leaderboard](https:\u002F\u002Fagentscope-ai.github.io\u002FPawBench\u002F) for the full Model × Harness matrix and all slice views.\n\n## Evaluation Workflows\n\nPawBench is intended to be used as a diagnostic benchmark, not just a ranking table.\n\n| Goal | Recommended setup | What to inspect |\n| :--- | :--- | :--- |\n| Choose a model | Fix one harness, run multiple models | Overall score, text\u002Fmultimodal split, cost and trace quality |\n| Choose a harness | Fix one model, run multiple harnesses | Harness gap, task errors, tool-use traces, workspace artifacts |\n| Debug a harness | Rerun targeted slices after a change | Capability\u002Fsource\u002Fscenario deltas, failed graders, transcripts |\n| Add a dataset | Add tasks with the five-label taxonomy | Coverage balance, grader reliability, task detail page |\n| Submit results | Aggregate run logs into `submissions\u002F*.json` | Leaderboard row, slice payloads, task error count |\n\n> **💡 Optimize Your Evaluation Logic with OpenJudge**\n> To build your own evaluation system beyond the LLM × Harness vertical, you can leverage **[OpenJudge](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002FOpenJudge)**'s 50+ production-ready graders (relevance, tool selection, trajectory, etc.) to evaluate and optimize your custom agents.\n\n## Quick Start\n\n### Requirements\n\nPython 3.11+ and Docker are required. Node.js 20+ is only needed for the leaderboard site.\n\nInstall dependencies and add credentials. DashScope is the recommended provider for the default setup:\n\n```bash\npip install -r requirements.txt\n\ncat > .env \u003C\u003C'EOF'\nDASHSCOPE_API_KEY=...\nJUDGE_API_KEY=...\nJUDGE_BASE_URL=...\nEOF\n```\n\nFor OpenAI-compatible or custom providers, set `OPENAI_API_KEY` \u002F `OPENAI_BASE_URL` or `CUSTOM_API_KEY` \u002F `CUSTOM_BASE_URL` as needed.\n\n### Run Evaluation\n\nBefore the first run, build the default Docker harness image:\n\n```bash\ndocker build -f docker\u002FDockerfile.pawbench-qwenpaw -t qwenclawbench-qwenpaw:latest .\n```\n\n```bash\n# Smoke test: run one PawBench v1.0 task with the default qwenpaw harness\npython run_bench.py --tasks T053 --model dashscope\u002Fqwen3.6-plus\n\n# Pick a different harness\npython run_bench.py --agents openclaw --tasks T053 --model dashscope\u002Fqwen3.6-plus\n\n# Compare harnesses on a task subset\npython run_bench.py \\\n  --agents qwenpaw openclaw hermes \\\n  --model dashscope\u002Fqwen3.6-plus \\\n  --tasks T002 T006\n\n# Sequentially evaluate multiple models\npython run_bench.py \\\n  --model dashscope\u002Fqwen3.6-plus \\\n  --model anthropic\u002Fclaude-sonnet-4-6\n```\n\nSee `python run_bench.py --help` for all flags, including `--no-results-version-path`, `--save-workspace`, and `--save-docker-image`.\n\n### View the Leaderboard\n\nThe website exposes the Model × Harness matrix, sortable leaderboard, slice analyzer, task library, and per-task pages.\n\n```bash\ncd site\nnpm install\nnpm run build:data    # aggregate raw run logs into submissions\u002F and JSON for the UI\nnpm run dev           # http:\u002F\u002Flocalhost:4321\u002FPawBench\u002F\n```\n\nFor submission formats and site data generation details, see [site\u002FREADME.md](site\u002FREADME.md).\n\n## PawBench Design\n\n### Tasks\n\nPawBench follows a **Reuse & Tag** methodology. Instead of writing every task from scratch, it pulls tasks from established agent benchmark suites, normalizes them into one format, and tags each task across five orthogonal dimensions.\n\n| Dimension | Field | Values |\n| :--- | :--- | :--- |\n| Scenario | `scenario` | L1 categories such as `Office_Productivity`, `Software_Engineering`, `Safety_Alignment` |\n| Capability | `capabilities` | `Logic_Reasoning`, `Math_Computation`, `Code_Manipulation`, `Tool_Use`, `Skill_Use`, `Planning`, `Self_Verification` |\n| Complexity | `complexity` | `L1` (1-2 steps), `L2` (3-5 steps), `L3` (>5 steps with branches or backtracking) |\n| Modality | `modality` | `text` or `multimodal` (`image`, `audio`, `video`) |\n| Environment | `environment` | `closed` (offline, reproducible) or `open` (live internet \u002F SaaS APIs) |\n\nv1.0 contains **150 tasks** from `claweval`, `qwenclawbench`, `pinchbench`, PawBench self-built tasks, `skillsbench`, and `wildclawbench`.\n\n| Source                                                           | # | Main coverage |\n|:-----------------------------------------------------------------| ---: | :--- |\n| `self-built`                                                     | 21 | Self-built tasks covering automation, information retrieval, and safety alignment |\n| [`claweval`](https:\u002F\u002Fgithub.com\u002Fclaw-eval\u002Fclaw-eval)             | 52 | Office productivity, data analytics, content creation |\n| [`qwenclawbench`](https:\u002F\u002Fgithub.com\u002FSKYLENAGE-AI\u002FQwenClawBench) | 29 | Automation, software engineering, safety alignment |\n| [`pinchbench`](https:\u002F\u002Fgithub.com\u002Fpinchbench\u002Fskill)              | 23 | Office workflows, software engineering, information retrieval |\n| [`skillsbench`](https:\u002F\u002Fgithub.com\u002Fbenchflow-ai\u002Fskillsbench)     | 15 | Long-horizon skills, domain automation |\n| [`wildclawbench`](https:\u002F\u002Fgithub.com\u002FInternLM\u002FWildClawBench)     | 10 | Office workflows, safety alignment |\n\nEach task page on the site shows its prompt, expected behavior, grading criteria, automated checker code, LLM judge rubric, workspace files, and metadata.\n\n### Harnesses\n\n| Harness | Link | Current role |\n| :--- | :--- | :--- |\n| QwenPaw | [agentscope-ai\u002FQwenPaw](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002FQwenPaw) | Default PawBench harness and primary baseline |\n| OpenClaw | [openclaw\u002Fopenclaw](https:\u002F\u002Fgithub.com\u002Fopenclaw\u002Fopenclaw) | General-purpose open agent runtime |\n| Hermes | [NousResearch\u002Fhermes-agent](https:\u002F\u002Fgithub.com\u002FNousResearch\u002Fhermes-agent) | Alternative community agent harness |\n\nHarnesses are treated as first-class benchmark subjects. A harness contribution should preserve the same task prompt, workspace contract, timeout behavior, transcript format, and result schema so model and harness effects remain comparable.\n\n### Grading\n\nEach task declares one of three grading modes:\n\n- `automated`: task-specific checks and assertions.\n- `llm_judge`: LLM-as-judge for semantic outputs.\n- `hybrid`: automated checks plus LLM judgment.\n\nRuns can be sliced by source, scenario, capability, complexity, modality, environment, grading type, model, and harness. PawBench also stores transcripts and metrics for each task. With `--save-workspace` and `--save-docker-image`, it can preserve the agent workspace and final Docker image for deeper replay.\n\n## Roadmap\n\n- [ ] **Harness coverage:** add Claude Code, Cursor Agent, CoPaw, and more community scaffolds.\n- [ ] **Dataset expansion:** add more open-environment, multimodal, skill-heavy, long-horizon, and real-world SaaS\u002FAPI tasks.\n- [ ] **Controlled studies:** turn the current findings into experiments around tool count, workspace awareness, skill discovery, web tools, and artifact-level completion checks.\n- [ ] **Diagnostics:** improve trace replay, workspace diffs, failure attribution, and slice-level regression reports.\n- [ ] **Evaluation reliability:** calibrate LLM judge prompts, strengthen automated graders, and document known failure modes.\n\n## Contributing\n\nWe welcome contributions that make PawBench a better shared testbed for Model × Harness evaluation.\n\n| Contribution | What to add |\n| :--- | :--- |\n| New harness | Agent adapter, Dockerfile if needed, environment setup, transcript capture, result normalization |\n| New tasks | Task markdown, workspace assets, five-label taxonomy, automated checks and\u002For LLM judge rubric |\n| New results | Raw run logs or `submissions\u002F*.json` with overall and slice scores |\n| Grader fixes | More deterministic checks, clearer rubrics, bug fixes for false positives\u002Ffalse negatives |\n| Site improvements | Better leaderboard views, slice analysis, task explorer, trace replay, and documentation |\n\nGood first contributions include adding missing task labels, improving task rubrics, reproducing a failed slice, integrating a new harness behind `--agents`, or submitting evaluation results for an untested model × harness pair.\n\n## Citation\n\nIf you use PawBench in your research or project, please cite it as:\n\n```bibtex\n@misc{pawbench,\n  title  = {PawBench: A benchmark for evaluating LLM × harness performance},\n  author = {The OpenJudge Team},\n  url    = {https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002FPawBench},\n  month  = {06},\n  year   = {2026}\n}\n```\n\n## Acknowledgments\n\nPawBench is built on top of the open-source agent evaluation community, including [Claw-Eval](https:\u002F\u002Fgithub.com\u002Fclaw-eval\u002Fclaw-eval), [QwenClawBench](https:\u002F\u002Fgithub.com\u002FSKYLENAGE-AI\u002FQwenClawBench), [WildClawBench](https:\u002F\u002Fgithub.com\u002FInternLM\u002FWildClawBench), [PinchBench](https:\u002F\u002Fgithub.com\u002Fpinchbench\u002Fskill), [skillsbench](https:\u002F\u002Fgithub.com\u002Fbenchflow-ai\u002Fskillsbench), and others.\n","2026-06-11 04:12:24","CREATED_QUERY"]