[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74286":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74286,"skill","pinchbench\u002Fskill","pinchbench","PinchBench is a benchmarking system for evaluating LLM models as OpenClaw coding agents. Made with 🦀 by the humans at https:\u002F\u002Fkilo.ai","https:\u002F\u002Fpinchbench.com",null,"Python",1226,138,10,18,0,3,16,89,9,19.43,"MIT License",false,"main",[],"2026-06-12 02:03:25","# 🦀 PinchBench\n\n**Real-world benchmarks for AI coding agents**\n\n[![Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fleaderboard-pinchbench.com-blue)](https:\u002F\u002Fpinchbench.com)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green)](LICENSE)\n\u003C!-- task-count-badge -->![Tasks](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftasks-53-orange)\u003C!-- \u002Ftask-count-badge -->\n\n> **Note:** This repository contains the benchmark skill\u002Ftasks. It is NOT the source of official leaderboard results. To add models to the official results, modify [pinchbench\u002Fscripts\u002Fdefault-models.yml](https:\u002F\u002Fgithub.com\u002Fpinchbench\u002Fscripts\u002Fblob\u002Fmain\u002Fdefault-models.yml).\n\nPinchBench measures how well LLM models perform as the brain of an [OpenClaw](https:\u002F\u002Fgithub.com\u002Fopenclaw\u002Fopenclaw) agent. Instead of synthetic tests, we throw real tasks at agents: scheduling meetings, writing code, triaging email, researching topics, and managing files.\n\nResults are collected on a public leaderboard at **[pinchbench.com](https:\u002F\u002Fpinchbench.com)**.\n\n![PinchBench](pinchbench.png)\n\n## Why PinchBench?\n\nMost LLM benchmarks test isolated capabilities. PinchBench tests what actually matters for coding agents:\n\n- **Tool usage** — Can the model call the right tools with the right parameters?\n- **Multi-step reasoning** — Can it chain together actions to complete complex tasks?\n- **Real-world messiness** — Can it handle ambiguous instructions and incomplete information?\n- **Practical outcomes** — Did it actually create the file, send the email, or schedule the meeting?\n\n## Quick Start\n\n```bash\n# Clone the skill\ngit clone https:\u002F\u002Fgithub.com\u002Fpinchbench\u002Fskill.git\ncd skill\n\n# Run benchmarks with your model of choice\n.\u002Fscripts\u002Frun.sh --model openrouter\u002Fanthropic\u002Fclaude-sonnet-4\n\n# Or run specific tasks\n.\u002Fscripts\u002Frun.sh --model openrouter\u002Fopenai\u002Fgpt-4o --suite task_calendar,task_stock\n```\n\n> **Note:** Model IDs must include their provider prefix (e.g. `openrouter\u002F`, `anthropic\u002F`). [OpenRouter](https:\u002F\u002Fopenrouter.ai) is the default provider used for routing.\n\n**Requirements:**\n\n- Python 3.10+\n- [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) package manager\n- A running OpenClaw instance\n\n## What Gets Tested\n\n\u003C!-- task-count-text -->PinchBench includes 53 tasks across real-world categories:\u003C!-- \u002Ftask-count-text -->\n\n| Category         | Tasks                                   | What's tested                            |\n| ---------------- | --------------------------------------- | ---------------------------------------- |\n| **Productivity** | Calendar, daily summaries               | Event creation, time parsing, scheduling |\n| **Research**     | Stock prices, conferences, markets      | Web search, data extraction, synthesis   |\n| **Writing**      | Blog posts, emails, humanization        | Content generation, tone, formatting     |\n| **Coding**       | Weather scripts, file structures        | Code generation, file operations         |\n| **Analysis**     | Spreadsheets, PDFs, documents           | Data processing, summarization           |\n| **Email**        | Triage, search                          | Inbox management, filtering              |\n| **Memory**       | Context retrieval, knowledge management | Long-term memory, recall                 |\n| **Skills**       | ClawHub, skill discovery                | OpenClaw ecosystem integration           |\n\nEach task is graded automatically, by an LLM judge, or both — ensuring both objective and nuanced evaluation.\n\n## Submitting Results\n\nTo get your results on the leaderboard:\n\n```bash\n# Register for an API token (one-time)\n.\u002Fscripts\u002Frun.sh --register\n\n# Run benchmark — results auto-upload with your token\n.\u002Fscripts\u002Frun.sh --model openrouter\u002Fanthropic\u002Fclaude-sonnet-4\n```\n\nSkip uploading with `--no-upload` if you just want local results.\n\n### Official Results\n\nTo submit an official run (marked on the leaderboard):\n\n```bash\n# Using environment variable\nexport PINCHBENCH_OFFICIAL_KEY=your_official_key\n.\u002Fscripts\u002Frun.sh --model anthropic\u002Fclaude-sonnet-4\n\n# Using command line flag\n.\u002Fscripts\u002Frun.sh --model anthropic\u002Fclaude-sonnet-4 --official-key your_official_key\n```\n\n## Command Reference\n\n| Flag                     | Description                                                                   |\n| ------------------------ | ----------------------------------------------------------------------------- |\n| `--model MODEL`          | Model to test (e.g., `openrouter\u002Fanthropic\u002Fclaude-sonnet-4`)                  |\n| `--judge MODEL`          | Judge model for LLM grading; uses direct API when set (see below)             |\n| `--suite SUITE`          | `all`, `automated-only`, or comma-separated task IDs                          |\n| `--runs N`               | Number of runs per task for averaging                                         |\n| `--timeout-multiplier N` | Scale timeouts for slower models                                              |\n| `--thinking LEVEL`       | Reasoning depth: `off`, `minimal`, `low`, `medium`, `high`, `xhigh`, `adaptive` |\n| `--output-dir DIR`       | Where to save results (default: `results\u002F`)                                   |\n| `--no-upload`            | Skip uploading to leaderboard                                                 |\n| `--register`             | Request an API token for submissions                                          |\n| `--upload FILE`          | Upload a previous results JSON                                                |\n| `--official-key KEY`     | Mark submission as official (or use `PINCHBENCH_OFFICIAL_KEY` env var)        |\n\n### Judge\n\nBy default (no `--judge` flag), the LLM judge runs as an OpenClaw agent session. When `--judge` is specified, it calls the model API directly instead, bypassing OpenClaw personality injection.\n\n```bash\n# Default: OpenClaw agent session (no --judge needed)\n.\u002Fscripts\u002Frun.sh --model openrouter\u002Fanthropic\u002Fclaude-sonnet-4\n\n# Direct API via OpenRouter\n.\u002Fscripts\u002Frun.sh --model openai\u002Fgpt-4o --judge openrouter\u002Fanthropic\u002Fclaude-sonnet-4-5\n\n# Direct API via Kilo Gateway\n.\u002Fscripts\u002Frun.sh --model openai\u002Fgpt-4o --judge kilo\u002Fanthropic\u002Fclaude-sonnet-4-5\n\n# Direct API via Anthropic\n.\u002Fscripts\u002Frun.sh --model openai\u002Fgpt-4o --judge anthropic\u002Fclaude-sonnet-4-5-20250514\n\n# Direct API via OpenAI\n.\u002Fscripts\u002Frun.sh --model openai\u002Fgpt-4o --judge openai\u002Fgpt-4o\n\n# Headless Claude CLI\n.\u002Fscripts\u002Frun.sh --model openai\u002Fgpt-4o --judge claude\n```\n\nRequired env vars: `OPENROUTER_API_KEY`, `KILO_API_KEY`, `ANTHROPIC_API_KEY`, or `OPENAI_API_KEY` depending on the judge model prefix.\n\n## Contributing Tasks\n\nWe welcome new tasks! Check out [`tasks\u002FTASK_TEMPLATE.md`](tasks\u002FTASK_TEMPLATE.md) for the format. Good tasks are:\n\n- **Real-world** — Something an actual user would ask an agent to do\n- **Measurable** — Clear success criteria that can be graded\n- **Reproducible** — Same task should produce consistent grading\n- **Challenging** — Tests agent capabilities, not just LLM knowledge\n\n### Transcript Archive\n\nSession transcripts are automatically saved to `results\u002F{run_id}_transcripts\u002F` alongside the results JSON. Each task's full agent conversation is preserved as a JSONL file (e.g. `task_calendar.jsonl`) for post-run analysis.\n\n## Links\n\n- **Leaderboard:** [pinchbench.com](https:\u002F\u002Fpinchbench.com)\n- **OpenClaw:** [github.com\u002Fopenclaw\u002Fopenclaw](https:\u002F\u002Fgithub.com\u002Fopenclaw\u002Fopenclaw)\n- **Issues:** [github.com\u002Fpinchbench\u002Fskill\u002Fissues](https:\u002F\u002Fgithub.com\u002Fpinchbench\u002Fskill\u002Fissues)\n\n## Star History\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=pinchbench%2Fskill&type=date&logscale=&legend=top-left\">\n \u003Cpicture>\n \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=pinchbench\u002Fskill&type=date&theme=dark&legend=top-left\" \u002F>\n \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=pinchbench\u002Fskill&type=date&legend=top-left\" \u002F>\n \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=pinchbench\u002Fskill&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>\n\n## License\n\nMIT — see [LICENSE](LICENSE) for details.\n\n---\n\n_Claw-some AI agent testing_ 🦞\n","PinchBench 是一个用于评估大语言模型作为 OpenClaw 编码代理性能的基准测试系统。它通过实际任务来衡量模型在工具使用、多步骤推理、处理现实世界的复杂性和实现实际成果方面的能力，包括日程安排、代码编写、邮件分类、主题研究和文件管理等53项具体任务。该系统采用Python开发，支持多种模型，并要求与OpenClaw实例配合使用。适用于希望深入了解不同AI编码助手在真实世界应用中表现的研究者或开发者。",2,"2026-06-11 03:49:49","high_star"]