[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11185":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},11185,"agent-skills-eval","darkrishabh\u002Fagent-skills-eval","darkrishabh","A test runner for agentskills.io-style AI agent skills","https:\u002F\u002Fdarkrishabh.github.io\u002Fagent-skills-eval\u002F",null,"TypeScript",581,30,3,1,0,16,19,137,48,87.97,"MIT License",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37],"agent-evals","agent-skills","agentskills","ai-agents","cli","jsonl","llm-evals","llm-evaluation","openai-compatible","typescript","yaml","2026-06-12 04:00:54","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F094b8e11-e19e-4c96-ae82-ba701cfcf7e3\" alt=\"agent-skills-eval — a test runner for Agent Skills\" width=\"100%\" \u002F>\n\n\u003Cbr \u002F>\n\n# agent-skills-eval\n\n**A test runner for [Agent Skills](https:\u002F\u002Fagentskills.io).**\n\nWrite a `SKILL.md`, drop in some evals, and find out — empirically — whether your skill actually makes the model better at the task.\n\n[![npm version](https:\u002F\u002Fimg.shields.io\u002Fnpm\u002Fv\u002Fagent-skills-eval.svg?style=flat-square&logo=npm&label=npm)](https:\u002F\u002Fwww.npmjs.com\u002Fpackage\u002Fagent-skills-eval)\n[![CI](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fdarkrishabh\u002Fagent-skills-eval\u002Fci.yml?style=flat-square&logo=github&label=ci)](https:\u002F\u002Fgithub.com\u002Fdarkrishabh\u002Fagent-skills-eval\u002Factions\u002Fworkflows\u002Fci.yml)\n[![license: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green?style=flat-square)](LICENSE)\n[![node](https:\u002F\u002Fimg.shields.io\u002Fnode\u002Fv\u002Fagent-skills-eval.svg?style=flat-square&logo=nodedotjs&logoColor=white)](package.json)\n[![docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-GitHub%20Pages-0f766e?style=flat-square)](https:\u002F\u002Fdarkrishabh.github.io\u002Fagent-skills-eval\u002F)\n[![TypeScript](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTypeScript-3178C6?style=flat-square&logo=typescript&logoColor=white)](https:\u002F\u002Fwww.typescriptlang.org\u002F)\n\n[Documentation](https:\u002F\u002Fdarkrishabh.github.io\u002Fagent-skills-eval\u002F) · [Quickstart](#quickstart) · [SDK](#sdk) · [agentskills.io](https:\u002F\u002Fagentskills.io)\n\n\u003C\u002Fdiv>\n\n---\n\n## Why this exists\n\n[Agent Skills](https:\u002F\u002Fagentskills.io) — the open standard from Anthropic for giving agents domain knowledge — make it easy to ship a `SKILL.md` and assume your agent is now better at the task. The hard part is *proving* it.\n\n`agent-skills-eval` is the missing piece. It runs your skill against the same prompts twice — once `with_skill` loaded into context, once `without_skill` (baseline) — has a judge model grade both outputs, and gives you a side-by-side report. If the skill doesn't make a measurable difference, you'll see it. If it does, you have receipts.\n\nIt's the test framework for the Agent Skills ecosystem, separated from any specific agent runtime so it works wherever your skills do.\n\n## Quickstart\n\n```bash\nnpx agent-skills-eval .\u002Fskills \\\n  --target gpt-4o-mini \\\n  --judge gpt-4o-mini \\\n  --baseline \\\n  --strict\n```\n\nThat's it. Point it at a folder of skills, give it a target model and a judge model, and it produces a workspace with full artifacts and a static HTML report.\n\n```text\nagent-skills-workspace\u002F\n└── iteration-1\u002F\n    ├── meta.json            # run metadata\n    ├── benchmark.json       # rolled-up pass\u002Ffail per skill\n    ├── eval-basic\u002F\n    │   ├── with_skill\u002F      # output, timing, judge grading\n    │   └── without_skill\u002F   # ↑ same, with the skill stripped\n    └── report\u002F\n        └── index.html       # the visual report\n```\n\nOpen `iteration-1\u002Freport\u002Findex.html` and you have a real, evidence-backed answer to \"is my skill working?\"\n\n## What you get\n\n|  |  |\n|---|---|\n| **`with_skill` vs `without_skill`** | Every eval runs both ways so you can see the actual lift from the skill — or its absence. |\n| **Judge-graded outputs** | Use any chat model as a judge. Pass\u002Ffail with cited assertions, not vibes. |\n| **TypeScript SDK + CLI** | One-liner CLI for CI, full SDK for custom pipelines, custom providers, and dashboards. |\n| **OpenAI-compatible by default** | Works out of the box with OpenAI, Together, Groq, Anthropic via OpenAI-compat layers, local Llama servers — anything that speaks the OpenAI chat API. |\n| **Tool-call assertions** | Deterministic checks for agents that call tools, not just generate text. |\n| **Portable artifacts** | JSON + JSONL all the way down. Run today, diff tomorrow. Plug into your own dashboard. |\n| **Static HTML reports** | A drop-in report site you can publish anywhere — no infrastructure. |\n| **Fully spec-compliant** | Implements the full [agentskills.io specification](https:\u002F\u002Fagentskills.io\u002Fspecification): `SKILL.md` validation, `evals\u002Fevals.json`, official `iteration-N` artifact layout, frontmatter rules. |\n\n## Install\n\n```bash\nnpm install agent-skills-eval\n```\n\nOr run directly without installing:\n\n```bash\nnpx agent-skills-eval --help\n```\n\n## How it works\n\nThe mental model is straightforward. For every eval defined in your skill:\n\n```\n                ┌─────────────────────────────┐\n                │       same prompt           │\n                └───────────────┬─────────────┘\n                                │\n                ┌───────────────┴─────────────┐\n                ▼                             ▼\n        ┌──────────────┐              ┌──────────────┐\n        │ with_skill   │              │without_skill │\n        │ SKILL.md in  │              │ baseline,    │\n        │ context      │              │ no skill     │\n        └──────┬───────┘              └──────┬───────┘\n               │                             │\n               ▼                             ▼\n          target model                  target model\n               │                             │\n               ▼                             ▼\n            output                        output\n               │                             │\n               └──────────┬──────────────────┘\n                          ▼\n                   ┌─────────────┐\n                   │  judge      │  scores both against\n                   │  model      │  the same assertions\n                   └──────┬──────┘\n                          ▼\n                  pass \u002F fail per side\n```\n\nThe judge sees the eval's `expected_output` and `assertions` and grades each side independently. The `--baseline` flag is what enables the comparison; without it you only get the `with_skill` run.\n\n## YAML config\n\nFor anything beyond a quick command, drop a config file at the root of your project:\n\n```yaml\n# agent-skills-eval.yaml\nroot: .\u002Fskills\nworkspace: .\u002Fagent-skills-workspace\nbaseline: true\ntarget: gpt-4o-mini\njudge: gpt-4o-mini\nbaseUrl: https:\u002F\u002Fapi.openai.com\u002Fv1\napiKeyEnv: OPENAI_API_KEY\ninclude:\n  - \"skills\u002F**\"\nexclude:\n  - \"**\u002Fdraft-*\"\nconcurrency: 4\nlayout: iteration\nstrict: true\nreport:\n  enabled: true\n  title: Agent Skills Report\nlogging:\n  format: pretty   # pretty | jsonl | silent\n  verbose: false\n  color: auto\ntargetParams:\n  temperature: 0\njudgeParams:\n  temperature: 0\n```\n\n```bash\nOPENAI_API_KEY=... npx agent-skills-eval --config agent-skills-eval.yaml\n```\n\nCLI flags always override config values.\n\n## SDK\n\nFor programmatic use — CI pipelines, custom dashboards, multi-skill rollups — drive the evaluator from TypeScript:\n\n```ts\nimport {\n  OpenAICompatibleProvider,\n  consoleReporter,\n  evaluateSkills,\n} from \"agent-skills-eval\";\n\nconst provider = new OpenAICompatibleProvider({\n  baseUrl: \"https:\u002F\u002Fapi.openai.com\u002Fv1\",\n  apiKey: process.env.OPENAI_API_KEY!,\n  model: \"gpt-4o-mini\",\n  providerName: \"openai\",\n});\n\nconst result = await evaluateSkills({\n  root: \".\u002Fskills\",\n  workspace: \".\u002Fagent-skills-workspace\",\n  baseline: true,\n  concurrency: 4,\n  workspaceLayout: \"iteration\",\n  strict: true,\n  target: { model: provider.model, provider },\n  judge: { model: provider.model, provider },\n  onEvent: consoleReporter(),\n});\n\nconsole.log(result);\n```\n\nStream events to a file as JSONL for downstream analysis:\n\n```ts\nimport { jsonlReporter } from \"agent-skills-eval\";\n\nconst reporter = jsonlReporter({ file: \".\u002Fevents.jsonl\" });\n\nawait evaluateSkills({ \u002F* ... *\u002F onEvent: reporter.onEvent });\nawait reporter.close();\n```\n\nLoad YAML config programmatically:\n\n```ts\nimport { loadConfigFile } from \"agent-skills-eval\";\n\nconst config = loadConfigFile(\".\u002Fagent-skills-eval.yaml\");\n```\n\n## Custom providers\n\nBring any backend by implementing the `Provider` interface — five fields, one method:\n\n```ts\nimport type { Provider, ProviderResult } from \"agent-skills-eval\";\n\nexport const provider: Provider = {\n  name: \"my-provider\",\n  model: \"my-model\",\n  async complete(prompt: string): Promise\u003CProviderResult> {\n    return {\n      provider: \"my-provider\",\n      model: \"my-model\",\n      output: \"model output\",\n      latencyMs: 0,\n      inputTokens: 0,\n      outputTokens: 0,\n      costUsd: 0,\n    };\n  },\n};\n```\n\nUseful for: local model servers (Ollama, vLLM, llama.cpp), proprietary internal APIs, mock providers in unit tests, or routing layers in front of multiple providers.\n\n## Skill layout\n\nA skill is a folder. The minimum is a `SKILL.md`. Add `evals\u002Fevals.json` and you can evaluate it.\n\n```text\nmy-skill\u002F\n├── SKILL.md\n├── references\u002F\n│   └── notes.md\n├── scripts\u002F\n│   └── helper.sh\n└── evals\u002F\n    ├── evals.json\n    └── files\u002F\n        └── input.csv\n```\n\n`SKILL.md`:\n\n```markdown\n---\nname: my-skill\ndescription: Analyze small CSV files.\nlicense: MIT\ncompatibility: Works with text-capable chat models.\n---\n\nWhen given a CSV file, identify the most important trend and cite the\nrelevant rows.\n```\n\n`evals\u002Fevals.json`:\n\n```json\n{\n  \"skill_name\": \"my-skill\",\n  \"evals\": [\n    {\n      \"id\": \"basic\",\n      \"name\": \"basic behavior\",\n      \"prompt\": \"Use the attached data to summarize revenue.\",\n      \"files\": [\"evals\u002Ffiles\u002Finput.csv\"],\n      \"expected_output\": \"The response identifies the highest revenue month.\",\n      \"assertions\": [\n        \"The output identifies the highest revenue month.\"\n      ]\n    }\n  ]\n}\n```\n\nIf you skip `assertions` but provide `expected_output`, the SDK promotes the expected output into a judge assertion automatically — so a minimal agentskills.io eval file produces meaningful pass\u002Ffail grading without extra work.\n\n## CLI options\n\n```bash\nnpx agent-skills-eval [root] \\\n  --config agent-skills-eval.yaml \\\n  --workspace .\u002Fagent-skills-workspace \\\n  --baseline \\\n  --target gpt-4o-mini \\\n  --judge gpt-4o-mini \\\n  --base-url https:\u002F\u002Fapi.openai.com\u002Fv1 \\\n  --api-key-env OPENAI_API_KEY \\\n  --include \"skills\u002F**\" \\\n  --exclude \"**\u002Fdraft-*\" \\\n  --concurrency 4 \\\n  --layout iteration \\\n  --strict \\\n  --log-format pretty \\\n  --report\n```\n\n**Logging modes**: `pretty` for humans, `jsonl` for machines, `silent` for quiet CI.\n\n## Reports\n\nThe static HTML report is built from disk artifacts and shows everything you'd want for skill iteration:\n\n- Pass rate by skill and by eval\n- Assertion-by-assertion grading evidence with judge reasoning\n- Full target output, side by side for `with_skill` and `without_skill`\n- Prompt and judge prompt details\n- Timing and token usage\n- Tool calls when present\n\nUse `--report-output` (or `report.output` in YAML) to choose where the report lands.\n\n## agentskills.io compatibility\n\nImplements the [agentskills.io](https:\u002F\u002Fagentskills.io) specification end to end:\n\n- `SKILL.md` YAML frontmatter — required `name` and `description`, optional `license`, `compatibility`, `metadata`, `allowed-tools`\n- Strict validation: name length, lowercase-hyphenated format, parent-directory match, description length, compatibility length\n- Optional `scripts\u002F`, `references\u002F`, and `assets\u002F` directories — markdown references included in skill context, scripts exposed by manifest\n- `evals\u002Fevals.json` schema: `skill_name`, `evals[].id`, `prompt`, `expected_output`, `files`, `assertions`\n- Official artifact layout: `iteration-N\u002F\u003Ceval>\u002F\u003Cmode>\u002Foutputs`, `timing.json`, `grading.json`, `benchmark.json`\n- Baseline comparison via `with_skill` and `without_skill`\n\nBeyond the spec, this SDK adds: per-eval `defaults`, model `params`, tool definitions, deterministic `tool_assertions`, and a flat `workspaceLayout: \"flat\"` for multi-skill dashboards.\n\n## Examples\n\nSee [`examples\u002Fbasic-skill`](examples\u002Fbasic-skill) for a complete skill folder, and [`examples\u002Fagent-skills-eval.yaml`](examples\u002Fagent-skills-eval.yaml) for a reference config.\n\n## Development\n\n```bash\nnpm ci\nnpm test\nnpm pack --dry-run\n```\n\n## Documentation\n\nFull docs live at **[darkrishabh.github.io\u002Fagent-skills-eval](https:\u002F\u002Fdarkrishabh.github.io\u002Fagent-skills-eval\u002F)** (sources in [`docs\u002F`](docs)). Local preview:\n\n```bash\npython3 -m http.server 8080 --directory docs\n```\n\n## Contributing\n\nIssues, PRs, and skill examples are all welcome. See [CONTRIBUTING.md](CONTRIBUTING.md), [CODE_OF_CONDUCT.md](CODE_OF_CONDUCT.md), and [SECURITY.md](SECURITY.md).\n\n## License\n\nMIT. See [LICENSE](LICENSE).\n\n---\n\n\u003Cdiv align=\"center\">\n\nBuilt for the [Agent Skills](https:\u002F\u002Fagentskills.io) ecosystem.\n\n\u003C\u002Fdiv>\n","`agent-skills-eval` 是一个用于评估 AI 代理技能的测试运行器。其核心功能包括通过对比加载技能前后的模型输出，使用评判模型对结果进行评分，并生成详细的侧边报告，帮助开发者验证特定技能是否有效提升了模型在特定任务上的表现。该项目采用 TypeScript 编写，支持 JSONL 和 YAML 格式的配置文件，兼容 OpenAI API，适用于需要验证和优化 AI 代理技能的各种场景，如自然语言处理、对话系统等。简洁的命令行接口设计使得用户能够轻松地将项目集成到现有的开发流程中。",2,"2026-06-11 03:31:18","CREATED_QUERY"]