[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-70519":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":9,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},70519,"waza","microsoft\u002Fwaza","microsoft","CLI \u002F Framework for Agent Skills - create, test, measure and improve skill quality and effectiveness",null,"https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fwaza","Go",990,56,2,13,0,14,36,314,42,9.27,false,"main","2026-06-12 02:02:34","# Waza\n\nA Go CLI for evaluating AI agent skills — scaffold eval suites, run benchmarks, and compare results across models.\n\n📖 **[Getting Started \u002F Docs](https:\u002F\u002Fmicrosoft.github.io\u002Fwaza\u002F)**\n\n## Installation\n\n### Binary Install (recommended)\n\nDownload and install the latest pre-built binary with the install script:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fmicrosoft\u002Fwaza\u002Fmain\u002Finstall.sh | bash\n```\n\nThe script auto-detects your OS and architecture (linux\u002Fdarwin\u002Fwindows, amd64\u002Farm64), downloads the binary, verifies the checksum, and installs to `\u002Fusr\u002Flocal\u002Fbin` (or `~\u002Fbin` if not writable).\n\nOr download binaries directly from the [latest release](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fwaza\u002Freleases\u002Flatest).\n\n### Install from Source\n\nRequires Go 1.26+:\n\nNOTE, due to the use of LFS artifacts you cannot install waza using `go install`. To install waza, outside of a normal release, you'll need to clone the repository:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fwaza.git\ncd waza\n\n# ensure git LFS-based artifacts are available (for embedded copilot binaries)\ngit lfs install\ngit lfs pull\n\ngo build -o waza .\u002Fcmd\u002Fwaza\n.\u002Fwaza \u003Cwaza command line>\n```\n\n### Azure Developer CLI (azd) Extension\n\nWaza is also available as an [azd extension](https:\u002F\u002Flearn.microsoft.com\u002Fazure\u002Fdeveloper\u002Fazure-developer-cli\u002Fextensions\u002Foverview):\n\n```bash\n# Add the waza extension registry\nazd ext source add -n waza -t url -l https:\u002F\u002Fraw.githubusercontent.com\u002Fmicrosoft\u002Fwaza\u002Fmain\u002Fregistry.json\n\n# Install the extension\nazd ext install microsoft.azd.waza\n\n# Verify it's working\nazd waza --help\n```\n\nOnce installed, all waza commands are available under `azd waza`. For example:\n\n```bash\nazd waza init my-eval --interactive\nazd waza run examples\u002Fcode-explainer\u002Feval.yaml -v\n```\n\n## Update Notifications\n\nWaza automatically checks for new versions in the background. If an update is available, a notice appears after command output:\n\n```\nA newer version of waza is available: v0.24.0 → v0.28.0. Run: curl -fsSL ... | bash\n```\n\nThe check is non-blocking (never slows commands), cached for 24 hours, and can be disabled with `--no-update-check` or `WAZA_NO_UPDATE_CHECK=1`.\n\n## Quick Start\n\n### For New Users: Get Started in 5 Minutes\n\nSee **[Getting Started Guide](docs\u002FGETTING-STARTED.md)** for a complete walkthrough:\n\n```bash\n# Initialize a new project\nwaza init my-project && cd my-project\n\n# Create a new skill\nwaza new skill my-skill\n\n# Define the skill in skills\u002Fmy-skill\u002FSKILL.md\n# Write evaluation tasks in evals\u002Fmy-skill\u002Ftasks\u002F\n# Add test fixtures in evals\u002Fmy-skill\u002Ffixtures\u002F\n\n# Run evaluations\nwaza run my-skill\n\n# Check skill readiness\nwaza check my-skill\n```\n\n### All Commands\n\n```bash\n# Build\nmake build\n\n# Initialize a project workspace\nwaza init [directory]\n\n# Create a new skill\nwaza new skill skill-name\n\n# Create a new eval scaffold from an existing SKILL.md\nwaza new eval skill-name\n\n# Generate a task YAML by recording a prompt run\nwaza new task from-prompt \"Explain this code and suggest fixes\" evals\u002Fcode-explainer\u002Ftasks\u002Frecorded-task.yaml\n\n# Check if a skill is ready for submission\nwaza check skills\u002Fmy-skill\n\n# Suggest an eval suite from SKILL.md\nwaza suggest skills\u002Fmy-skill --dry-run\nwaza suggest skills\u002Fmy-skill --apply\n\n# Note: 'generate' is available as an alias for 'new' (see below for new command)\n# Note: Custom agents (.agent.md) are supported — see https:\u002F\u002Fmicrosoft.github.io\u002Fwaza\u002Fguides\u002Fcustom-agents\u002F\n\n# Run evaluations (works with both skills and custom agents)\nwaza run examples\u002Fcode-explainer\u002Feval.yaml --context-dir examples\u002Fcode-explainer\u002Ffixtures -v\n\n# Grade output from a previous `waza run --output results.json ...`\nwaza grade eval.yaml --results results.json\n\n# Compare results across models\nwaza compare results-gpt4.json results-sonnet.json\n\n# Generate eval coverage grid\nwaza coverage --format markdown\n\n# Count tokens in skill files\nwaza tokens count skills\u002F\n\n# Compare skill token budgets vs main\nwaza tokens compare main --skills --threshold 10\n\n# Suggest token optimizations\nwaza tokens suggest skills\u002F\n```\n\n## Commands\n\n### `waza init [directory]`\n\nInitialize a waza project workspace with separated `skills\u002F` and `evals\u002F` directories. Idempotent — creates only missing files.\n\n| Flag | Description |\n|------|-------------|\n| `--no-skill` | Skip the first-skill creation prompt |\n\nCreates:\n- `skills\u002F` — Skill definitions directory\n- `evals\u002F` — Evaluation suites directory\n- `.github\u002Fworkflows\u002Feval.yml` — CI\u002FCD pipeline for running evals on PR\n- `.gitignore` — Waza-specific exclusions\n- `README.md` — Getting started guide for your project\n\n**Example:**\n```bash\nwaza init my-project\n# Optionally creates first skill interactively\n\nwaza init my-project --no-skill\n# Skip skill creation prompt\n```\n\n### `waza new skill \u003Cskill-name>`\n\nCreate a new skill with scaffolded structure and evaluation suite. Detects workspace context and adapts output.\n\n| Flag | Short | Description |\n|------|-------|-------------|\n| `--template` | `-t` | Template pack (coming soon) |\n\n**Modes:**\n\n*Project mode* (detects `skills\u002F` directory):\n```\nproject\u002F\n├── skills\u002F{skill-name}\u002FSKILL.md\n└── evals\u002F{skill-name}\u002F\n    ├── eval.yaml\n    ├── tasks\u002F*.yaml\n    └── fixtures\u002F\n```\n\n*Standalone mode* (no `skills\u002F` detected):\n```\n{skill-name}\u002F\n├── SKILL.md\n├── evals\u002F\n│   ├── eval.yaml\n│   ├── tasks\u002F*.yaml\n│   └── fixtures\u002F\n├── .github\u002Fworkflows\u002Feval.yml\n├── .gitignore\n└── README.md\n```\n\n**Example:**\n```bash\n# In project mode (explained Modes section, above): creates skills\u002Fcode-explainer\u002FSKILL.md + evals\u002Fcode-explainer\u002F\nwaza new skill code-explainer\n\n# In standalone mode (explained Modes section, above): creates code-explainer\u002F self-contained directory\nwaza new skill code-explainer\n```\n\n### `waza new eval \u003Cskill-name>`\n\nScaffold an eval suite from an existing `SKILL.md` (reads frontmatter trigger hints from `USE FOR` and `DO NOT USE FOR`).\n\nCreates:\n- `evals\u002F\u003Cskill-name>\u002Feval.yaml`\n- `evals\u002F\u003Cskill-name>\u002Ftasks\u002Fpositive-trigger-1.yaml`\n- `evals\u002F\u003Cskill-name>\u002Ftasks\u002Fpositive-trigger-2.yaml`\n- `evals\u002F\u003Cskill-name>\u002Ftasks\u002Fnegative-trigger-1.yaml`\n\n| Flag | Description |\n|------|-------------|\n| `--output \u003Cpath>` | Custom path for `eval.yaml` (tasks are generated under sibling `tasks\u002F`) |\n\n**Example:**\n```bash\n# Default output location\nwaza new eval code-explainer\n\n# Custom eval path\nwaza new eval code-explainer --output evals\u002Fcustom-code-explainer\u002Feval.yaml\n```\n\n### `waza new task from-prompt \u003Cprompt> \u003Ctask-path>`\n\nRun a prompt through Copilot and generate a task YAML with inferred validators based on observed behavior (response text, tool usage, and invoked skills).\n\n| Flag | Description |\n|------|-------------|\n| `--model \u003Cname>` | Copilot model to run for recording (default: `claude-sonnet-4.5`) |\n| `--testname \u003Cname>` | Test name and ID written into the generated task (default: `auto-generated-test`) |\n| `--tags \u003Ca,b,...>` | Comma-separated tags to attach to the generated task |\n| `--timeout \u003Cduration>` | Max time for prompt execution (default: `5m`) |\n| `--overwrite` | Overwrite the output task file if it already exists |\n| `--root \u003Cdir>` | Root directory used for skill discovery (default: `.`) |\n\n**Example:**\n```bash\n# Record a prompt and generate a reusable task YAML\nwaza new task from-prompt \"Refactor this function for readability\" evals\u002Fcode-explainer\u002Ftasks\u002Frefactor-readability.yaml\n\n# Add metadata and overwrite an existing file\nwaza new task from-prompt \"Explain this diff and risks\" evals\u002Fcode-explainer\u002Ftasks\u002Fdiff-analysis.yaml \\\n  --testname diff-analysis \\\n  --tags recorded,regression \\\n  --overwrite\n```\n\n### `waza run \u003Ceval.yaml>`\n\nRun an evaluation benchmark from a spec file.\n\n| Flag | Short | Description |\n|------|-------|-------------|\n| `--context-dir \u003Cdir>` | | Fixture directory (default: `.\u002Ffixtures` relative to spec) |\n| `--output \u003Cfile>` | `-o` | Save results to JSON |\n| `--output-dir \u003Cdir>` | | Directory for structured output; each run creates a UTC-timestamped subdirectory of `\u003Cdir>`. Mutually exclusive with `--output`. |\n| `--verbose` | `-v` | Detailed progress output |\n| `--transcript-dir \u003Cdir>` | | Save per-task transcript JSON files |\n| `--task \u003Cglob>` | | Filter tasks by name\u002FID pattern (repeatable) |\n| `--parallel` | | Run tasks concurrently |\n| `--workers \u003Cn>` | | Concurrent workers (default: 4, requires `--parallel`) |\n| `--trials \u003Cn>` | | Run each task `n` times to detect flakiness (omit to use `config.trials_per_task`; if provided, `n` must be >= 1) |\n| `--interpret` | | Print plain-language result interpretation |\n| `--format \u003Cfmt>` | | Output format: `default` or `github-comment` (default: `default`) |\n| `--cache` | | Enable result caching to speed up repeated runs |\n| `--no-cache` | | Explicitly disable result caching |\n| `--cache-dir \u003Cdir>` | | Cache directory (default: `.waza-cache`) |\n| `--reporter \u003Cspec>` | | Output reporters: `json` (default), `junit:\u003Cpath>` (repeatable) |\n| `--baseline` | | A\u002FB testing mode — runs each task twice (without skill = baseline, with skill = normal) and computes improvement scores |\n| `--discover` | | Auto skill discovery — walks directory tree for SKILL.md + eval.yaml (root\u002Ftests\u002Fevals) |\n| `--strict` | | Fail if any SKILL.md lacks eval coverage (use with `--discover`) |\n| `--suggest` | | Generate a Copilot suggestion report based on test outcomes (`mock` engine emits a deterministic fake report) |\n| `--output-dir \u003Cdir>` | | Directory for structured output; each run creates a UTC timestamped subdirectory. Mutually exclusive with `--output`. |\n| `--tags \u003Cpatterns>` | | Filter tasks by tags, using glob patterns (repeatable) |\n| `--model \u003Cname>` | | Override model (repeatable for multi-model comparison) |\n| `--recommend` | | Generate heuristic recommendation after multi-model run |\n| `--judge-model \u003Cmodel>` | | Model for LLM-as-judge graders (overrides execution model) |\n| `--session-log` | | Enable session event logging (NDJSON) |\n| `--session-dir \u003Cdir>` | | Directory for session log files (default: current directory) |\n| `--no-summary` | | Skip writing combined summary.json for multi-skill runs |\n| `--update-snapshots` | | Update or create diff grader snapshot files to match current output |\n| `--skip-graders` | | Skip grading (execution only); grade later with `waza grade` |\n| `--keep-workspace` | | Preserve temp workspaces after execution for debugging |\n\n**Result Caching**\n\nEnable caching with `--cache` to store test results and skip re-execution on repeated runs:\n\n```bash\n# First run executes all tests and caches results\nwaza run eval.yaml --cache\n\n# Second run uses cached results (much faster)\nwaza run eval.yaml --cache\n\n# Clear the cache when needed\nwaza cache clear\n```\n\nCached results are automatically invalidated when:\n- Spec configuration changes (model, timeout, graders, etc.)\n- Task definitions change\n- Fixture files change\n\n**Note:** Caching is automatically disabled for evaluations using non-deterministic graders (`behavior`, `prompt`).\n\n**Exit Codes**\n\nThe `run` command uses exit codes to enable CI\u002FCD integration:\n\n| Exit Code | Condition | Description |\n|-----------|-----------|-------------|\n| `0` | Success | All tests passed |\n| `1` | Test failure | One or more tests failed validation |\n| `2` | Configuration error | Invalid spec, missing files, or runtime error |\n\nExample CI usage:\n\n```bash\n# Fail the build if any tests fail\nwaza run eval.yaml || exit $?\n\n# Capture specific exit codes\nwaza run eval.yaml\nEXIT_CODE=$?\nif [ $EXIT_CODE -eq 1 ]; then\n  echo \"Tests failed - check results\"\nelif [ $EXIT_CODE -eq 2 ]; then\n  echo \"Configuration error\"\nfi\n\n# Post results as PR comment (GitHub Actions)\nwaza run eval.yaml --format github-comment > comment.md\ngh pr comment $PR_NUMBER --body-file comment.md\n\n# Generate JUnit XML for CI test reporting\nwaza run eval.yaml --reporter junit:results.xml\n\n# Both JSON output and JUnit XML\nwaza run eval.yaml -o results.json --reporter junit:results.xml\n```\n\n**Note:** `waza generate` is an alias for `waza new`. Both commands support the same functionality with the `--output-dir` flag for specifying custom output locations.\n\n### `waza compare \u003Cfile1> \u003Cfile2> [files...]`\n\nCompare results from multiple evaluation runs side by side — per-task score deltas, pass rate differences, and aggregate statistics.\n\n| Flag | Short | Description |\n|------|-------|-------------|\n| `--format \u003Cfmt>` | `-f` | Output format: `table` or `json` (default: `table`) |\n\n### `waza coverage [root]`\n\nGenerate a skill-to-eval coverage grid showing which skills are fully covered, partially covered, or missing evals.\n\n**Note**: Full coverage requires tasks (via `tasks:` or `tasks_from:`) and 2+ grader types. The coverage percentage reflects only fully covered skills.\n\n| Flag | Short | Description |\n|------|-------|-------------|\n| `--format \u003Cfmt>` | `-f` | Output format: `text`, `markdown`, or `json` (default: `text`) |\n| `--path \u003Cdir>` | | Additional directory to scan for skills\u002Fevals (repeatable) |\n\n### `waza models`\n\nList models available for evaluation via the Copilot SDK. Shows model IDs and metadata that can be used with `--model` flags in `waza run`, `waza quality`, and other commands.\n\nRequires authentication via `copilot login`.\n\n| Flag | Description |\n|------|-------------|\n| `--json` | Output as JSON |\n\n**Examples:**\n\n```bash\n# List available models in table format\nwaza models\n\n# Output available models as JSON\nwaza models --json\n```\n\n### `waza cache clear`\n\nClear all cached evaluation results to force re-execution on the next run.\n\n| Flag | Description |\n|------|-------------|\n| `--cache-dir \u003Cdir>` | Cache directory to clear (default: `.waza-cache`) |\n\n### `waza dev [skill-path]`\n\nIteratively score and improve skill frontmatter in a SKILL.md file.\n\nUse `--copilot` for a non-interactive, single-pass markdown report that:\n1. Summarizes current skill details and token usage\n2. Loads trigger test prompts as examples (when `trigger_tests.yaml` exists)\n3. Requests Copilot suggestions for improving skill selection\n4. Prints the report to stdout without applying any changes\n\nWhen `--copilot` is set, iterative mode flags (`--target`, `--max-iterations`, `--auto`) are invalid.\n\n| Flag | Description |\n|------|-------------|\n| `--target \u003Clevel>` | Target adherence level for iterative mode: `low`, `medium`, `medium-high`, `high` (default: `medium-high`) |\n| `--max-iterations \u003Cn>` | Maximum improvement iterations for iterative mode (default: 5) |\n| `--auto` | Apply improvements without prompting in iterative mode |\n| `--copilot` | Generate a non-interactive markdown report with Copilot suggestions |\n| `--model \u003Cid>` | Model to use with `--copilot` |\n\n### `waza check [skill-path]`\n\nCheck if a skill is ready for submission with a comprehensive readiness report.\n\nPerforms five types of checks:\n1. **Compliance scoring** — Validates frontmatter adherence (Low\u002FMedium\u002FMedium-High\u002FHigh)\n2. **Token budget** — Checks if SKILL.md is within token limits (configurable in `.waza.yaml` `tokens.limits`)\n3. **Evaluation suite** — Checks for the presence of eval.yaml\n4. **Spec compliance** — Validates the skill against the agentskills.io spec (frontmatter structure, required fields, naming rules, directory match, description length, compatibility, license, and version)\n5. **Advisory checks** — Detects quality and maintainability issues (reference module count, complexity classification, negative delta risk patterns, procedural content, and over-specificity)\n\nProvides a plain-language summary and actionable next steps to improve the skill.\n\n**Example output:**\n```\n🔍 Skill Readiness Check\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\nSkill: code-explainer\n\n📋 Compliance Score: High\n   ✅ Excellent! Your skill meets all compliance requirements.\n\n📊 Token Budget: 450 \u002F 500 tokens\n   ✅ Within budget (50 tokens remaining).\n\n🧪 Evaluation Suite: Found\n   ✅ eval.yaml detected. Run 'waza run eval.yaml' to test.\n\n📐 Spec Compliance (agentskills.io)\n   ✅ spec-frontmatter    Frontmatter structure valid with required fields\n   ✅ spec-allowed-fields All frontmatter fields are spec-allowed\n   ✅ spec-name           Name follows spec naming rules\n   ✅ spec-dir-match      Directory name matches skill name\n   ✅ spec-description    Description is valid\n   ✅ spec-license        License field present\n   ✅ spec-version        metadata.version present\n\n🔬 Advisory Checks\n   ✅ module-count        Found 2 reference modules (2-3 is optimal)\n   ✅ complexity          Complexity: detailed (350 tokens, 2 modules)\n   ✅ negative-delta-risk No negative delta risk patterns detected\n   ✅ procedural-content  Description contains procedural language\n   ✅ over-specificity    No over-specificity patterns detected\n\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n📈 Overall Readiness\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\n✅ Your skill is ready for submission!\n\n🎯 Next Steps\n━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\n\n✨ No action needed! Your skill looks great.\n\nConsider:\n  • Running 'waza run eval.yaml' to verify functionality\n  • Sharing your skill with the community\n```\n\n**Usage:**\n```bash\n# Check current directory\nwaza check\n\n# Check specific skill\nwaza check skills\u002Fmy-skill\n\n# Suggested workflow\nwaza check skills\u002Fmy-skill     # Check readiness\nwaza dev skills\u002Fmy-skill       # Improve compliance if needed\nwaza check skills\u002Fmy-skill     # Verify improvements\n```\n\n### `waza quality \u003Cskill-path>`\n\nUse an LLM-as-Judge to evaluate skill content quality across five dimensions:\nclarity, completeness, trigger precision, scope coverage, and anti-patterns.\n\n| Flag | Description |\n|------|-------------|\n| `--model \u003Cmodel>` | Model to use as judge (default: project default model) |\n| `--format table\\|json` | Output format (default: `table`) |\n| `--rubric \u003Cpath>` | Path to custom rubric file (reserved for future use) |\n\n**Examples:**\n```bash\n# Evaluate skill quality (table output)\nwaza quality skills\u002Fcode-explainer\n\n# JSON output for CI integration\nwaza quality skills\u002Fcode-explainer --format json\n\n# Use a specific model as judge\nwaza quality skills\u002Fcode-explainer --model gpt-4o\n```\n\n### `waza suggest \u003Cskill-path>`\n\nUse an LLM to analyze `SKILL.md` and generate suggested evaluation artifacts.\n\n| Flag | Description |\n|------|-------------|\n| `--model \u003Cmodel>` | Model to use for suggestions (default: project default model) |\n| `--dry-run` | Print suggested output to stdout (default) |\n| `--apply` | Write files to disk |\n| `--output-dir \u003Cdir>` | Output directory (default: `\u003Cskill-path>\u002Fevals`) |\n| `--format yaml\\|json` | Output format (default: `yaml`) |\n\n**Examples:**\n```bash\n# Preview generated eval\u002Ftask\u002Ffixture files as YAML\nwaza suggest skills\u002Fcode-explainer --dry-run\n\n# Write generated files to disk\nwaza suggest skills\u002Fcode-explainer --apply\n\n# Print JSON-formatted suggestion payload\nwaza suggest skills\u002Fcode-explainer --format json\n```\n\n### `waza tokens count [paths...]`\n\nCount tokens in markdown files. Paths may be files or directories (scanned recursively for `.md`\u002F`.mdx`).\n\n| Flag | Description |\n|------|-------------|\n| `--format \u003Cfmt>` | Output format: `table` or `json` (default: `table`) |\n| `--sort \u003Cfield>` | Sort by: `tokens`, `name`, or `path` (default: `path`) |\n| `--min-tokens \u003Cn>` | Filter files below n tokens |\n| `--no-total` | Hide total row in table output |\n\n### `waza tokens compare [refs...]`\n\nCompare markdown token counts between git refs.\n\nWith no arguments, compares HEAD to the working tree.\nWith one ref, compares that ref to the working tree.\nWith two refs, compares the first ref to the second.\n\n| Flag | Description |\n|------|-------------|\n| `--format \u003Cfmt>` | Output format: `table` or `json` (default: `table`) |\n| `--show-unchanged` | Include unchanged files in output |\n| `--strict` | Exit with code 1 if any file exceeds its absolute token limit |\n| `--skills` | Only compare SKILL.md files under configured skill roots |\n| `--threshold \u003Cn>` | Fail when any existing file increases by more than n percent (0 = disabled) |\n\nUse `--skills` to restrict comparison to SKILL.md files under configured skill\nroots (`skills\u002F`, `.github\u002Fskills\u002F`, and `paths.skills` from `.waza.yaml`). In\nskills mode the default base ref is `origin\u002Fmain` (falling back to `main`).\n\nUse `--threshold` for CI gating — newly added files are exempt from threshold\nchecks (no baseline) but still subject to absolute limit checks with `--strict`.\n\n```bash\n# Compare all markdown tokens between HEAD and working tree\nwaza tokens compare\n\n# Skill-aware comparison vs main with CI threshold\nwaza tokens compare main --skills --threshold 10\n\n# JSON output for CI pipelines\nwaza tokens compare main --skills --threshold 10 --strict --format json\n```\n\n### `waza tokens profile [skill-name | path]`\n\nStructural analysis of SKILL.md files — reports token count, section count, code block count, and workflow step detection with a one-line summary and warnings.\n\n| Flag | Description |\n|------|-------------|\n| `--format \u003Cfmt>` | Output format: `text` or `json` (default: `text`) |\n| `--tokenizer \u003Ct>` | Tokenizer: `bpe` or `estimate` (default: `bpe`) |\n\n**Example output:**\n```\n📊 my-skill: 1,722 tokens (detailed ✓), 8 sections, 4 code blocks\n   ⚠️  no workflow steps detected\n```\n\n### `waza tokens suggest [paths...]`\n\nSuggest ways to reduce token usage in markdown files. Paths may be files or\ndirectories (scanned recursively for `.md`\u002F`.mdx`).\n\n| Flag | Description |\n|------|-------------|\n| `--format \u003Cfmt>` | Output format: `text` or `json` (default: `text`) |\n| `--min-savings \u003Cn>` | Minimum estimated token savings for heuristic suggestions |\n| `--copilot` | Enable Copilot-powered suggestions |\n| `--model \u003Cid>` | Model to use with `--copilot` |\n\n### `waza serve`\n\nStart the waza dashboard server to visualize evaluation results. The HTTP server opens in your browser automatically and scans the specified directory for `.json` result files.\n\nOptionally, run a JSON-RPC 2.0 server (for IDE integration) instead of the HTTP dashboard using the `--tcp` flag.\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `--port \u003Cport>` | `3000` | HTTP server port |\n| `--no-browser` | `false` | Don't auto-open the browser |\n| `--results-dir \u003Cdir>` | `.` | Directory to scan for result files |\n| `--tcp \u003Caddr>` | (off) | TCP address for JSON-RPC (e.g., `:9000`); defaults to loopback for security |\n| `--tcp-allow-remote` | `false` | Allow TCP binding to non-loopback addresses (⚠️ no authentication) |\n\n**Examples:**\n\nStart the HTTP dashboard on port 3000:\n```bash\nwaza serve\n```\n\nStart the HTTP dashboard on a custom port and scan a results directory:\n```bash\nwaza serve --port 8080 --results-dir .\u002Fresults\n```\n\nStart the dashboard without auto-opening the browser:\n```bash\nwaza serve --no-browser\n```\n\nStart a JSON-RPC server for IDE integration:\n```bash\nwaza serve --tcp :9000\n```\n\n**Dashboard Views:**\n\nThe dashboard displays evaluation results with:\n- Task-level pass\u002Ffail status\n- Score distributions across trials\n- Model comparisons\n- Aggregated metrics and trends\n\nFor detailed documentation on the dashboard and result visualization, see [docs\u002FGUIDE.md](docs\u002FGUIDE.md).\n\n### `waza results`\n\nManage evaluation results stored in cloud or local storage.\n\n#### `waza results list`\n\nList all evaluation runs from configured cloud storage or local results directory.\n\n| Flag | Description |\n|------|-------------|\n| `--limit \u003Cn>` | Maximum results to display (default: 20) |\n| `--format \u003Cfmt>` | Output format: `table` or `json` (default: `table`) |\n\n```bash\n# List recent results\nwaza results list\n\n# List with custom limit\nwaza results list --limit 20\n\n# Output as JSON\nwaza results list --format json\n```\n\n#### `waza results compare \u003Cid1> \u003Cid2>`\n\nCompare two evaluation runs side by side. Displays per-task score deltas, pass rate differences, and key metrics.\n\n| Flag | Description |\n|------|-------------|\n| `--format \u003Cfmt>` | Output format: `table` or `json` (default: `table`) |\n\n```bash\n# Compare two runs\nwaza results compare run-20250226-001 run-20250226-002\n\n# Output as JSON for further processing\nwaza results compare run-20250226-001 run-20250226-002 --format json\n```\n\n### `waza grade \u003Ceval.yaml>`\n\nRun graders against agent output without executing an agent. Designed for standalone grading of previous eval runs.\n\n| Flag | Description |\n|------|-------------|\n| `--task \u003Cid>` | Task ID to grade |\n| `--results \u003Cfile>` | Path to waza run output JSON |\n| `--workspace \u003Cdir>` | Agent workspace directory for file-based graders; must point to the agent's actual workspace (default: `.`) |\n| `--judge-model \u003Cmodel>` | Model for prompt graders |\n| `-o, --output \u003Cfile>` | Write full EvaluationOutcome JSON (compatible with `waza compare`) |\n| `-v, --verbose` | Verbose output |\n\n```bash\nwaza run eval.yaml --output results.json\nwaza grade eval.yaml --results results.json\n```\n\n### `waza session list`\n\nList session event logs in a directory.\n\n| Flag | Description |\n|------|-------------|\n| `--dir \u003Cdir>` | Directory to search for session logs (default: `.`) |\n\n```bash\nwaza session list\nwaza session list --dir .\u002Fsessions\n```\n\n### `waza session view \u003Csession-file>`\n\nRender a session timeline from an NDJSON event log.\n\n```bash\nwaza session view session-2025-06-15.ndjson\n```\n\n## Cloud Storage\n\nWaza can automatically upload evaluation results to Azure Blob Storage for team collaboration and historical tracking.\n\n### Configuration\n\nAdd a `storage:` section to your `.waza.yaml`:\n\n```yaml\nstorage:\n  provider: azure-blob\n  accountName: \"myteamwaza\"\n  containerName: \"waza-results\"\n  enabled: true\n```\n\n| Field | Description | Required |\n|-------|-------------|----------|\n| `provider` | Cloud provider (`azure-blob` currently supported) | Yes |\n| `accountName` | Azure Storage account name | Yes |\n| `containerName` | Blob container name (default: `waza-results`) | No |\n| `enabled` | Enable\u002Fdisable uploads (default: `true` when configured) | No |\n\n### Authentication\n\nWaza uses **DefaultAzureCredential** — it automatically detects and uses available credentials in this order:\n\n1. **Environment variables** (`AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`, `AZURE_TENANT_ID`)\n2. **Managed Identity** (on Azure services)\n3. **Azure CLI** (`az login`)\n4. **Visual Studio Code** (if signed in)\n5. **Azure PowerShell** (if signed in)\n\nIn most cases, running `az login` is all you need:\n\n```bash\naz login\nwaza run eval.yaml  # Results auto-upload to Azure Storage\n```\n\n### How It Works\n\n1. **Auto-upload on run:** When `storage:` is configured, `waza run` automatically uploads results to Azure Blob Storage\n2. **Organized by skill:** Results are stored as `{skill-name}\u002F{run-id}.json`\n3. **Local copy kept:** Results are also saved locally (via `-o` flag)\n4. **List remote results:** Use `waza results list` to browse uploaded runs\n5. **Compare runs:** Use `waza results compare` to diff two remote results\n\n### Example Workflow\n\n```bash\n# Configure once (edit .waza.yaml)\ncat > .waza.yaml \u003C\u003CEOF\nstorage:\n  provider: azure-blob\n  accountName: \"myteamwaza\"\n  containerName: \"waza-results\"\n  enabled: true\nEOF\n\n# Authenticate\naz login\n\n# Run evaluations — results auto-upload\nwaza run evals\u002Fmy-skill\u002Feval.yaml -v\n\n# Browse uploaded results\nwaza results list\n\n# Compare two runs\nwaza results compare run-id-1 run-id-2\n```\n\nFor step-by-step setup and troubleshooting, see [Getting Started with Azure Storage](..\u002Fdocs\u002Fguides\u002Fazure-storage\u002F) guide.\n\n## Building\n\n```bash\nmake build          # Compile binary to .\u002Fwaza\nmake test           # Run tests with coverage\nmake lint           # Run golangci-lint\nmake fmt            # Format code and tidy modules\nmake install        # Install to GOPATH\n```\n\n## Project Structure\n\n```\ncmd\u002Fwaza\u002F              CLI entrypoint and command definitions\n  tokens\u002F              Token counting subcommand\ninternal\u002F\n  config\u002F              Configuration with functional options\n  execution\u002F           AgentEngine interface (mock, copilot)\n  graders\u002F             Validator registry and built-in graders\n  metrics\u002F             Scoring metrics\n  models\u002F              Data structures (EvalSpec, TestCase, EvaluationOutcome)\n  orchestration\u002F       EvalRunner for coordinating execution\n  reporting\u002F           Result formatting and output\n  transcript\u002F          Per-task transcript capture\n  wizard\u002F              Interactive init wizard\nexamples\u002F              Example eval suites\nskills\u002F                Example skills\n```\n\n## Eval Spec Format\n\n```yaml\nname: my-eval\nskill: my-skill\nversion: \"1.0\"\n\nconfig:\n  trials_per_task: 3\n  max_attempts: 3          # Retry failed graders up to 3 times (default: 1, no retries)\n  timeout_seconds: 300\n  parallel: false\n  executor: mock          # or copilot-sdk\n  model: claude-sonnet-4-20250514\n  group_by: model          # Group results by model (or other dimension)\n\n# Custom input variables available as {{.Vars.key}} in tasks and hooks\ninputs:\n  api_version: v2\n  environment: production\n  max_retries: 3\n\nhooks:\n  before_run:\n    - command: \"echo 'Starting evaluation'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\n  after_run:\n    - command: \"echo 'Evaluation complete'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\n  before_task:\n    - command: \"echo 'Running task: {{.TaskName}}'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\n  after_task:\n    - command: \"echo 'Task {{.TaskName}} completed'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\ngraders:\n  - type: text\n    name: pattern_check\n    config:\n      regex_match: [\"\\\\d+ tests passed\"]\n\n  - type: behavior\n    name: efficiency\n    config:\n      max_tool_calls: 20\n      max_duration_ms: 300000\n\n  - type: action_sequence\n    name: workflow_check\n    config:\n      matching_mode: in_order_match\n      expected_actions: [\"bash\", \"edit\", \"report_progress\"]\n\n# Task definitions: glob patterns or CSV dataset\ntasks:\n  - \"tasks\u002F*.yaml\"\n\n# Optional: Generate tasks from CSV dataset\n# tasks_from: .\u002Ftest-cases.csv\n# range: [1, 10]  # Only include rows 1-10 (0-indexed, skips header)\n```\n\n### Custom Input Variables\n\nUse the `inputs` section to define key-value variables available throughout your evaluation as `{{.Vars.key}}`:\n\n```yaml\ninputs:\n  api_endpoint: https:\u002F\u002Fapi.example.com\n  timeout: 30\n  environment: staging\n\nhooks:\n  before_run:\n    - command: \"echo 'Testing against {{.Vars.environment}}'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n```\n\nVariables are accessible in:\n- Hook commands\n- Task prompts and fixtures (via template rendering)\n- Grader configurations\n\n### CSV Dataset Support\n\nGenerate tasks dynamically from a CSV file using `tasks_from`:\n\n```yaml\n# eval.yaml\ntasks_from: .\u002Ftest-cases.csv\nrange: [0, 50]  # Optional: limit to rows 0-50 (skip header at 0)\n```\n\n**CSV Format:**\n```csv\nprompt,expected_output,language\n\"Explain this function\",\"Function explanation\",python\n\"Review this code\",\"Code review\",javascript\n```\n\n**Task Generation:**\n- **First row** is treated as column headers\n- **Each subsequent row** becomes a task\n- **Column values** are available as `{{.Vars.column_name}}`\n- **Range filtering** (optional) allows limiting to a subset of rows\n\n**Example task prompt using CSV variables:**\n\nIn your task file or inline prompt:\n```yaml\nprompt: \"{{.Vars.prompt}}\"\nexpected_output: \"{{.Vars.expected_output}}\"\nlanguage: \"{{.Vars.language}}\"\n```\n\nTasks can also be mixed — use both explicit task files and CSV-generated tasks:\n\n```yaml\ntasks:\n  - \"tasks\u002F*.yaml\"        # Explicit tasks\n\ntasks_from: .\u002Ftest-cases.csv    # CSV-generated tasks\nrange: [0, 20]                  # Only first 20 rows\n```\n\n**CSV vs Inputs:**\n- `inputs`: Static key-value pairs defined once in eval.yaml\n- `tasks_from`: Generates multiple tasks from CSV rows\n- **Conflict resolution**: CSV column values override `inputs` for the same key\n\n### Retry\u002FAttempts\n\nUse `max_attempts` to retry failed grader validations within each trial:\n\n```yaml\nconfig:\n  max_attempts: 3  # Retry failed graders up to 3 times (default: 1, no retries)\n```\n\nWhen a grader fails, waza will retry the task execution up to `max_attempts` times. The evaluation outcome includes an `attempts` field showing how many executions were needed to pass. This is useful for handling transient failures in external services or non-deterministic grader behavior.\n\n**Output:** JSON results include `attempts` per task showing the number of executions performed.\n\n### Grouping Results\n\nUse `group_by` to organize results by a dimension (e.g., model, environment). Results are grouped in CLI output and JSON results include group statistics:\n\n```yaml\nconfig:\n  group_by: model\n```\n\nGrouped results in JSON output include `GroupStats`:\n```json\n{\n  \"group_stats\": [\n    {\n      \"name\": \"claude-sonnet-4-20250514\",\n      \"passed\": 8,\n      \"total\": 10,\n      \"avg_score\": 0.85\n    }\n  ]\n}\n```\n\n### Lifecycle Hooks\n\nUse `hooks` to run commands before\u002Fafter evaluations and tasks:\n\n```yaml\nhooks:\n  before_run:\n    - command: \"npm install\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: true\n\n  after_run:\n    - command: \"rm -rf node_modules\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\n  before_task:\n    - command: \"echo 'Task: {{.TaskName}}'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n\n  after_task:\n    - command: \"echo 'Done: {{.TaskName}}'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n```\n\n**Hook Fields:**\n- `command` — Shell command to execute\n- `working_directory` — Directory to run command in (relative to eval.yaml)\n- `exit_codes` — List of acceptable exit codes (default: `[0]`)\n- `error_on_fail` — Fail entire evaluation if hook fails (default: `false`)\n\n**Lifecycle Points:**\n- `before_run` — Execute once before all tasks\n- `after_run` — Execute once after all tasks\n- `before_task` — Execute before each task\n- `after_task` — Execute after each task\n\n**Template Variables in Hooks and Commands:**\n\nAvailable variables in hook commands and task execution contexts:\n- `{{.JobID}}` — Unique evaluation run identifier\n- `{{.TaskName}}` — Name\u002FID of the current task (available in `before_task`\u002F`after_task` only)\n- `{{.Iteration}}` — Current trial number (1-indexed)\n- `{{.Attempt}}` — Current attempt number (1-indexed, used for retries)\n- `{{.Timestamp}}` — ISO 8601 timestamp of execution\n- `{{.Vars.key}}` — User-defined variables from the `inputs` section or CSV columns\n\nCustom variables can be defined in the `inputs` section and referenced in hooks:\n\n```yaml\ninputs:\n  environment: production\n  api_version: v2\n  debug_mode: \"true\"\n\nhooks:\n  before_run:\n    - command: \"echo 'Starting eval {{.JobID}} in {{.Vars.environment}}'\"\n      working_directory: \".\"\n      exit_codes: [0]\n      error_on_fail: false\n```\n\nWhen using CSV-generated tasks, each row's column values are also available as `{{.Vars.column_name}}`.\n\n## CI\u002FCD Integration\n\nWaza is designed to work seamlessly with CI\u002FCD pipelines.\n\n### Integrating Waza in CI\n\nWaza can validate your skill in CI before publishing:\n\n#### Installation in CI\n\n**Option 1: Binary install (recommended)**\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fmicrosoft\u002Fwaza\u002Fmain\u002Finstall.sh | bash\n```\n\n**Option 2: Install from source**\n```bash\n# Requires Go 1.26+\ngo install github.com\u002Fmicrosoft\u002Fwaza\u002Fcmd\u002Fwaza@latest\n```\n\n**Option 3: Use Docker**\n```bash\ndocker build -t waza:local .\ndocker run -v $(pwd):\u002Fworkspace waza:local run eval\u002Feval.yaml\n```\n\n#### Quick Workflow Setup\n\nCopy [`.github\u002Fworkflows\u002Fskills-ci-example.yml`](.github\u002Fworkflows\u002Fskills-ci-example.yml) to your skill repository:\n\n```yaml\njobs:\n  evaluate-skill:\n    runs-on: ubuntu-latest\n    steps:\n      - uses: actions\u002Fcheckout@v4\n      - name: Install waza\n        run: curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fmicrosoft\u002Fwaza\u002Fmain\u002Finstall.sh | bash\n      - run: waza run eval\u002Feval.yaml --verbose --output results.json\n      - uses: actions\u002Fupload-artifact@v4\n        with:\n          name: waza-evaluation-results\n          path: results.json\n```\n\n#### Environment Requirements\n\n| Requirement | Details |\n|-------------|---------|\n| **Go Version** | 1.26 or higher |\n| **Executor** | Use `mock` executor for CI (no API keys needed) |\n| **GitHub Token** | Only required for `copilot-sdk` executor: set `GITHUB_TOKEN` env var |\n| **Exit Codes** | 0=success, 1=test failure, 2=config error |\n\n#### Expected Skill Structure\n\n```\nyour-skill\u002F\n├── SKILL.md              # Skill definition\n└── eval\u002F                 # Evaluation suite\n    ├── eval.yaml         # Benchmark spec\n    ├── tasks\u002F            # Task definitions\n    │   └── *.yaml\n    └── fixtures\u002F         # Context files\n        └── *.txt\n```\n\n### For Waza Repository\n\nThis repository includes reusable workflows:\n\n1. **[`.github\u002Fworkflows\u002Fwaza-eval.yml`](.github\u002Fworkflows\u002Fwaza-eval.yml)** - Reusable workflow for running evals\n   ```yaml\n   jobs:\n     eval:\n       uses: .\u002F.github\u002Fworkflows\u002Fwaza-eval.yml\n       with:\n         eval-yaml: 'examples\u002Fcode-explainer\u002Feval.yaml'\n         verbose: true\n   ```\n\n2. **[`examples\u002Fci\u002Feval-on-pr.yml`](examples\u002Fci\u002Feval-on-pr.yml)** - Matrix testing across models\n\n3. **[`examples\u002Fci\u002Fbasic-example.yml`](examples\u002Fci\u002Fbasic-example.yml)** - Minimal workflow example\n\nSee [`examples\u002Fci\u002FREADME.md`](examples\u002Fci\u002FREADME.md) for detailed documentation and more examples.\n\n### Available Grader Types\n\nWaza supports multiple grader types for comprehensive evaluation:\n\n| Grader | Purpose | Documentation |\n|--------|---------|---------------|\n| `code` | Python\u002FJavaScript assertion-based validation | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#code---assertion-based-grader) |\n| `text` | Substring and pattern matching in output | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#text---text-matching-grader) |\n| `file` | File existence and content validation | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#file---file-system-validation) |\n| `diff` | Workspace file comparison with snapshots and fragments | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#diff---workspace-file-comparison) |\n| `behavior` | Agent behavior constraints (tool calls, tokens, duration) | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#behavior---agent-behavior-validation) |\n| `action_sequence` | Tool call sequence validation with F1 scoring | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#action_sequence---tool-call-sequence-validation) |\n| `skill_invocation` | Skill orchestration sequence validation | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#skill_invocation---skill-invocation-sequence-validation) |\n| `prompt` | LLM-as-judge evaluation with rubrics | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#prompt---llm-based-evaluation) |\n| `trigger_tests` | Prompt trigger accuracy detection | [docs\u002FGRADERS.md](docs\u002FGRADERS.md#trigger-tests) |\n\nSee the complete [Grader Reference](docs\u002FGRADERS.md) for detailed configuration options and examples.\n\n## Documentation\n\n- **[Getting Started](docs\u002FGETTING-STARTED.md)** - Complete walkthrough: init → new → run → check\n- **[Demo Guide](docs\u002FDEMO-GUIDE.md)** - 7 live demo scenarios for presentations\n- **[Grader Reference](docs\u002FGRADERS.md)** - Complete grader types and configuration\n- **[Tutorial](docs\u002FTUTORIAL.md)** - Getting started with writing skill evals\n- **[CI Integration](docs\u002FSKILLS_CI_INTEGRATION.md)** - GitHub Actions workflows for skill evaluation\n- **[Token Management](docs\u002FTOKEN-LIMITS.md)** - Tracking and optimizing skill context size\n\n## Contributing\n\nSee [AGENTS.md](AGENTS.md) for coding guidelines.\n\n- Use [conventional commits](https:\u002F\u002Fwww.conventionalcommits.org\u002F) (`feat:`, `fix:`, `docs:`, etc.)\n- Go CI is required: `Build and Test Go Implementation` and `Lint Go Code` must pass\n- Add tests for new features\n- Update docs when changing CLI surface\n\n## Legacy Python Implementation\n\nThe Python implementation has been superseded by the Go CLI. The last Python release is available at [v0.3.2](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fwaza\u002Freleases\u002Ftag\u002Fv0.3.2). Starting with v0.4.0-alpha.1, waza is distributed exclusively as pre-built Go binaries.\n\n## License\n\nSee [LICENSE](LICENSE).\n","Waza 是一个用于评估AI代理技能的Go语言CLI工具，支持创建、测试、度量并改进技能的质量和效果。其核心功能包括构建评估套件、运行基准测试及跨模型比较结果，通过简洁的命令行界面实现对AI技能的全面管理和优化。该项目特别适用于需要系统性地提升AI代理能力的开发者或团队，无论是进行初步技能开发还是持续性能监控都能提供强大支持。此外，Waza还提供了多种安装方式以适应不同用户环境，并且可以作为Azure Developer CLI的一个扩展来使用，进一步增强了其在云开发场景下的灵活性与便捷性。","2026-06-11 03:32:36","trending"]