[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11359":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":8,"pushedAt":8,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":28,"discoverSource":29},11359,"cwc-long-running-agents","anthropics\u002Fcwc-long-running-agents","anthropics",null,"Shell",408,46,2,1,0,14,15,133,42,5.02,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:02:31","# Harness Primitives for Long-Running Claude Agents\n\nClaude Code's built-in [`\u002Fgoal`](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fgoal) command gives you a generator\u002Fevaluator loop out of the box: set a completion condition and a separate fast model checks it after every turn until it's met. This repo ships the same underlying primitives as short, readable [hooks](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fhooks) and a [subagent](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fsub-agents), so you can see how each mechanism works and assemble a harness tuned to your project. The patterns come from [Effective Harnesses for Long-Running Agents](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Feffective-harnesses-for-long-running-agents) (Nov 2025) and [Harness Design for Long-Running Application Development](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps) (Mar 2026). We recommend trying both the in-product features and a custom harness to see which fits your workflow.\n\n| | In-product | Custom harness (this repo) |\n|---|---|---|\n| **What runs the loop** | [`\u002Fgoal`](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fgoal) | the [primitives below](#the-quality-loop) + a [loop you write](#running-the-loop) |\n| **Who judges \"done\"** | a separate fast model checking your condition | your [`agents\u002Fevaluator.md`](.\u002Fclaude-code-config\u002F.claude\u002Fagents\u002Fevaluator.md) with your prompt |\n| **Where it works** | Claude Code interactive, [`-p`](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fheadless), Remote Control | Claude Code, headless, or [Agent SDK](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fsdk) |\n\nThree primitives form the quality loop:\n\n- **Default-FAIL contract.** Every criterion starts `false`; the agent can't mark it passing without opening evidence first.\n- **Fresh-context evaluator.** A separate agent with no Write\u002FEdit tools grades the work from a context window that never saw the build.\n- **Agent-maintained handoff.** The agent writes its own progress notes and commits to git so the next session picks up cleanly.\n\nTwo more operator-control hooks are included for when you want to watch or intervene. The same patterns translate directly to `PreToolUse`\u002F`Stop` callbacks in the [Agent SDK](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fsdk).\n\n> Built as the take-home for the Long-Running Agents station at Code with Claude 2026. **These are example ingredients, not a turnkey harness.** Event demo; not maintained and not accepting contributions.\n\n## How to use this repo\n\n**Read and cherry-pick.** Each primitive is one standalone file with no dependency on the others. The [quality-loop table](#the-quality-loop) below maps every one to its example here and its Agent SDK equivalent. Open the file, see how the mechanism works, copy what fits.\n\n**Or copy all of them as a starting point.** In your project, run `claude` and paste:\n\n> Clone github.com\u002Fanthropics\u002Fcwc-long-running-agents into \u002Ftmp, copy its `claude-code-config\u002F.claude\u002F` directory into this project's root, make the hook scripts executable, then walk me through what each hook and the evaluator subagent does and what I'll need to adapt for this project.\n\nor do it yourself:\n\n```bash\ncp -r claude-code-config\u002F.claude \u002Fpath\u002Fto\u002Fyour\u002Fproject\u002F\nchmod +x \u002Fpath\u002Fto\u002Fyour\u002Fproject\u002F.claude\u002Fhooks\u002F*.sh\n```\n\nEither way, this gives you all the examples wired into `.claude\u002F` at once. Before relying on them: point `RESULTS_FILE` at your project's actual results file, adjust the evidence-file pattern in `track-read.sh`, and run `claude` from the directory that contains `.claude\u002F` (hooks are not loaded when launching from a subdirectory).\n\n**If you're on the Agent SDK,** this repo is a pattern reference. The shell hooks here translate one-to-one to `PreToolUse`\u002F`Stop` callbacks. To scaffold an SDK agent from inside Claude Code, install the [`agent-sdk-dev` plugin](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-plugins-official\u002Ftree\u002Fmain\u002Fplugins\u002Fagent-sdk-dev) and ask Claude to build an agent that implements whichever of these primitives you want. For a hand-written starting point, see the [autonomous-coding quickstart](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-quickstarts\u002Ftree\u002Fmain\u002Fautonomous-coding) or the [agent-sdk-workshop](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fagent-sdk-workshop) curriculum.\n\n## The quality loop\n\n| Primitive | Claude Code example | Agent SDK equivalent | Enforcement |\n|---|---|---|---|\n| **Default-FAIL contract** | [`hooks\u002Ftrack-read.sh`](.\u002Fclaude-code-config\u002F.claude\u002Fhooks\u002Ftrack-read.sh) + [`hooks\u002Fverify-gate.sh`](.\u002Fclaude-code-config\u002F.claude\u002Fhooks\u002Fverify-gate.sh) | `PreToolUse` callback | hook |\n| **Fresh-context evaluator** | [`agents\u002Fevaluator.md`](.\u002Fclaude-code-config\u002F.claude\u002Fagents\u002Fevaluator.md) subagent (no Write\u002FEdit) | [`evaluator_optimizer.ipynb`](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-cookbooks\u002Fblob\u002Fmain\u002Fpatterns\u002Fagents\u002Fevaluator_optimizer.ipynb) | you invoke it |\n| **Agent-maintained handoff** | [`CLAUDE.md`](.\u002Fclaude-code-config\u002F.claude\u002FCLAUDE.md) + [`hooks\u002Fcommit-on-stop.sh`](.\u002Fclaude-code-config\u002F.claude\u002Fhooks\u002Fcommit-on-stop.sh) | system prompt + `Stop` callback | convention + hook backstop |\n\n### Default-FAIL contract\n\nAgents will mark a feature \"passing\" after a unit test or a curl when the UI is visibly broken. Asking nicely in the prompt doesn't reliably stop this. The harness makes \"done\" structural. Every feature is a row in a `test-results.json` file you create in your project:\n\n```json\n{ \"feature-1\": { \"passes\": false }, \"feature-2\": { \"passes\": false } }\n```\n\nThe only evidence that counts is a file matching the patterns in `track-read.sh` (screenshots, console logs, result files), and a `PreToolUse` hook denies any write to the results file unless the agent has first opened one with the Read tool. The agent can't claim success it hasn't observed. (The shipped hook is intentionally simple; see the comments in `verify-gate.sh` for the gaps a production version would close.)\n\n### Fresh-context evaluator\n\nThe builder shouldn't grade its own work. After each feature, you (or your wrapper script) invoke a separate subagent (`agents\u002Fevaluator.md`) with no Write\u002FEdit tools that reviews the diff and the screenshots from a context window that never saw the build, then returns `PASS` or `NEEDS_WORK` with specific findings. On `NEEDS_WORK` the findings become the next builder session's starting prompt, closing the build\u002Fevaluate\u002Frebuild loop. Invoke it from your wrapper with `claude --agent evaluator -p \"\u003Creview prompt>\"`; what your loop does with the verdict is up to you.\n\n### Agent-maintained handoff\n\nA fresh session has no memory of what the previous one did, and when a long session fills its context window Claude Code summarizes the history, which loses detail. So the agent maintains the handoff itself: it scopes each session to one feature, writes to a structured `PROGRESS.md` as it works and re-reads it first thing on every restart, and `git add`s and commits at meaningful checkpoints so `git log` is a second record. `commit-on-stop.sh` is the backstop that catches whatever's still uncommitted at session end. This is the layer most sensitive to model capability; newer models drift less and self-scope better, so re-evaluate how much of `CLAUDE.md` you still need after each model release (see [Re-simplify on model upgrades](#going-further)).\n\nA fourth core piece, a **rubric for subjective work**, isn't shipped here because it's project-specific; see [Going further](#going-further) for how to add one to the evaluator.\n\n## Running the loop\n\nTwo ways to keep the build → evaluate → rebuild cycle going. `\u002Fgoal` is built into Claude Code and works with or without this repo's primitives; the second path wires the contract file and your own `evaluator.md` directly into the loop.\n\n### `\u002Fgoal`: built-in completion checker\n\n```\n\u002Fgoal every feature in PROGRESS.md is implemented, committed, and its tests pass\n```\n\nAfter every turn a separate fast model checks the condition and keeps the session going until it's met. One line, no contract file or hooks. Works the same in interactive Claude Code, [`claude -p`](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fheadless), and Remote Control. See the docs for [writing an effective condition](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fgoal#write-an-effective-condition) and how `\u002Fgoal` [compares to `\u002Floop` and Stop hooks](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fgoal#compare-to-other-autonomous-workflows).\n\n### Your `evaluator.md` as the gate\n\nWhen you want the default-FAIL contract (`test-results.json` + `verify-gate`) enforcing evidence and your own evaluator prompt deciding `PASS`\u002F`NEEDS_WORK`, the loop lives in your code. How depends on where you're running:\n\n| Surface | How |\n|---|---|\n| **Claude Code** | custom [Stop hook](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fhooks#stop) runs the evaluator as a fresh `claude` process after each turn and blocks on anything but `PASS` |\n| **Headless** ([`claude -p`](https:\u002F\u002Fcode.claude.com\u002Fdocs\u002Fen\u002Fheadless)) | wrapper script calls `claude --agent evaluator -p` between builds (example below) |\n| [**Agent SDK**](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fsdk) | separate generator and evaluator `query()` calls; see [`evaluator_optimizer`](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-cookbooks\u002Fblob\u002Fmain\u002Fpatterns\u002Fagents\u002Fevaluator_optimizer.ipynb) |\n\n```bash\nwhile grep -q '\"passes\": false' test-results.json; do\n  claude -p \"Read PROGRESS.md and build the next unfinished feature per CLAUDE.md.\"\n  VERDICT=$(claude --agent evaluator -p \"Review the most recent commit against its spec.\")\n  [ \"$(echo \"$VERDICT\" | head -1)\" = \"PASS\" ] || echo \"$VERDICT\" > NEXT_FINDINGS.md\ndone\n```\n\nEach pass is a fresh context. Exit when the contract file has nothing left failing, a cycle makes no changes, or a budget is hit; `touch AGENT_STOP` to stop early. The builder writes `test-results.json` (verify-gate enforces evidence-read first) and the evaluator returns a separate verdict; have your wrapper write the contract on `PASS` instead if you want \"all true\" to mean \"independently confirmed.\"\n\n## Operator controls\n\nTwo more hooks for when you want to watch the loop run or intervene mid-stream:\n\n| Hook | What it does |\n|---|---|\n| [`hooks\u002Fkill-switch.sh`](.\u002Fclaude-code-config\u002F.claude\u002Fhooks\u002Fkill-switch.sh) | Halts every tool call while an `AGENT_STOP` file exists at the project root. |\n| [`hooks\u002Fsteer.sh`](.\u002Fclaude-code-config\u002F.claude\u002Fhooks\u002Fsteer.sh) | Surfaces the contents of `STEER.md` (project root) to the agent once and clears it, so you can redirect mid-run without restarting. |\n\n## Watching it work\n\nEverything the agent does lands on disk, so you can observe a long run from a terminal without any dashboard code:\n\n```bash\nwatch -n 2 'tail -20 PROGRESS.md'                          # its own notes\nwatch -n 5 'git log --oneline -8'                          # work saved\nwatch -n 5 'find screenshots -name \"*.png\" | tail -5'      # what it sees\nwatch -n 2 'wc -l \u003C .claude\u002F.evidence-reads 2>\u002Fdev\u002Fnull'   # evidence reads (resets to 0 after each gated write)\n```\n\nRun them in a `tmux` grid and you have a live view of every primitive.\n\n## Going further\n\nThe next layer of patterns, each covered in depth in [Harness Design for Long-Running Application Development](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps):\n\n| Pattern | What it adds | Reference |\n|---|---|---|\n| **Unattended loop** | Cap session length and have an outer script start the next one (pick next feature, build, evaluate, reset) | [Why naive implementations fall short](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#why-naive-implementations-fall-short); [`ralph-loop`](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-plugins-official\u002Ftree\u002Fmain\u002Fplugins\u002Fralph-loop) plugin |\n| **Planner agent** | A first session that expands a one-line ask into a `BUILD_PLAN.md` the loop then runs against | [The architecture](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#the-architecture) |\n| **Sprint contracts** | Builder and evaluator agree per-feature on what \"done\" means and write it to a file the hook enforces | [The architecture](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#the-architecture) and [Removing the sprint construct](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#removing-the-sprint-construct) |\n| **Grading rubrics** | Give the evaluator scoring principles (functionality, design, craft, originality) with few-shot examples instead of binary pass\u002Ffail | [Frontend design: making subjective quality gradable](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#frontend-design-making-subjective-quality-gradable); [`frontend-design`](https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fclaude-plugins-official\u002Ftree\u002Fmain\u002Fplugins\u002Ffrontend-design) plugin |\n| **Browser-verified evaluator** | Let the evaluator open the running app itself instead of trusting the builder's screenshots | [Frontend design](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#frontend-design-making-subjective-quality-gradable) (Playwright MCP usage); add [`@playwright\u002Fmcp`](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fplaywright-mcp) or Claude in Chrome to `tools:` in `agents\u002Fevaluator.md` |\n| **Re-simplify on model upgrades** | After each model release, comment out harness pieces one at a time and see what's still load-bearing | [What comes next](https:\u002F\u002Fwww.anthropic.com\u002Fengineering\u002Fharness-design-long-running-apps#what-comes-next); [Harnessing Claude's Intelligence](https:\u002F\u002Fclaude.com\u002Fblog\u002Fharnessing-claudes-intelligence) |\n| **Hosted runtime** | Anthropic hosts the loop, sandbox, and scheduling so you don't run any of this yourself | [Claude Managed Agents](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fmanaged-agents) |\n\n---\n","该项目提供了用于构建长时间运行的Claude代理的基础组件。核心功能包括默认失败契约、新鲜上下文评估器和代理维护的手动交接，这些机制确保了任务执行的质量与一致性。技术上，通过简洁可读的钩子和子代理形式提供上述功能，支持在Claude Code内交互式或无头模式下使用，同时也兼容Agent SDK。适用于需要定制化控制循环逻辑、评估标准及状态传递的应用场景，如复杂任务自动化处理等。项目示例源自Anthropic关于有效长周期代理框架的研究成果，旨在为开发者提供灵活且强大的构建基础而非现成解决方案。","2026-06-11 03:31:44","CREATED_QUERY"]