[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2232":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":30,"discoverSource":31},2232,"agentic-harness-engineering","china-qijizhifeng\u002Fagentic-harness-engineering","china-qijizhifeng","Official AHE code — Agentic Harness Engineering: observability-driven automatic evolution of coding-agent harnesses (concurrent w\u002F meta-harness). NexAU-AHE reaches 84.7% ± 2.1 pass@1 on Terminal-Bench 2 (GPT-5.5). Lifts GPT-5.4 69.7→77.0% over 10 iters, beats Codex\u002FACE\u002FTraining-Free GRPO; frozen harness transfers to SWE-bench-Verified.","https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.25850",null,"Python",538,61,2,1,0,24,59,375,72,9.38,"MIT License",false,"main",true,[],"2026-06-12 02:00:39","# Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses\n\n\u003Cdiv align=\"left\">\n\n\u003Cp align=\"left\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.25850\">\u003Cimg alt=\"Paper\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-b31b1b.svg?logo=arxiv&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"agentic_harness_engineering.pdf\">\u003Cimg alt=\"PDF\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPDF-Download-ec1c24.svg?logo=adobeacrobatreader&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdawning-road.github.io\u002Fblog\u002Fagentic-harness-engineering\">\u003Cimg alt=\"Blog\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Dawning_Road-ff7e1b.svg?logo=readthedocs&logoColor=white\">\u003C\u002Fa>\n  \u003Cimg alt=\"License: MIT\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg\">\n  \u003Cimg alt=\"Python\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-%E2%89%A53.13-blue.svg\">\n  \u003Cimg alt=\"Managed with uv\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmanaged_with-uv-261230?logo=python&logoColor=white\">\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fbanner.jpg\" alt=\"Agentic Harness Engineering\" width=\"100%\">\n\u003C\u002Fp>\n\n\u003Cp align=\"left\">\n  English | \u003Ca href=\"README_zh.md\">简体中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## 📰 News\n\n- **[2026-05-14]** 🏆 AHE (on GPT-5.5) ranked **#3** on the [Terminal-Bench 2.0 leaderboard](https:\u002F\u002Fwww.tbench.ai\u002Fleaderboard\u002Fterminal-bench\u002F2.0) with **84.7%** — ranking as of 2026-05-15\n- **[2026-04-30]** ✍️ Blog post on Dawning Road (English & Chinese) — a more detailed account of the exploration behind AHE: [Agentic Harness Engineering](https:\u002F\u002Fdawning-road.github.io\u002Fblog\u002Fagentic-harness-engineering)\n- **[2026-04-28]** 📄 Paper released on arXiv: [Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.25850)\n- **[2026-04]** 🎉 Framework released\n\n---\n\n## 🎯 Overview\n\n**AHE (Agentic Harness Engineering)** is an open **observability system** for automatically evolving the harness around a coding agent. The base model is held fixed; what evolves are the harness components — system prompts, tool descriptions, tool implementations, middleware, skills, sub-agents, and long-term memory.\n\nAHE rests on three observability layers:\n\n- **Component observability** — [**NexAU**](https:\u002F\u002Fgithub.com\u002Fnex-agi\u002FNexAU.git) decomposes the harness into seven orthogonal, file-level components, each git-tracked so every edit is auditable and revertible.\n- **Experience observability** — *Agent Debugger* distills ~10M-token raw traces into layered, sourced reports; the optimizer reads digests by default but can always drill back to any rollout's raw trace.\n- **Decision observability** — *Evolve Agent* proposes evidence-backed edits, predicts their impact, and is automatically falsified by the next iteration's flipped tasks.\n\nAcross ten `evaluate → analyze → improve` iterations, **AHE (Agentic Harness Engineering)** lifts Terminal-Bench 2 pass@1 from **69.7% to 77.0%** on GPT-5.4, surpasses the hand-written Codex (71.9%) and the self-evolving ACE and TF-GRPO baselines, and produces a frozen harness that transfers without re-evolution to SWE-bench-verified and to four alternate base models, indicating that the evolved components encode general engineering experience rather than benchmark-specific tuning.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Ftransfer_model.png\" alt=\"Cross-Model Transfer\" width=\"28%\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fcase_study.png\" alt=\"Case Study\" width=\"31%\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Ftraining_curve.png\" alt=\"Training Curve\" width=\"39%\">\n\u003C\u002Fp>\n\n---\n\n## 🚀 Quick Start\n\n### 0. Prerequisites\n\n- Python ≥ 3.13\n- [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)\n- tmux\n\n```bash\n# macOS\nbrew install uv tmux\n\n# Linux\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nsudo apt install -y tmux\n```\n\n### 1. Clone + install dependencies\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FCurry09\u002Fagentic-harness-engineering.git\ncd agentic-harness-engineering\nuv sync\n```\n\n> `uv sync` installs every dependency declared in `pyproject.toml`.\n\n### 2. Configure environment variables\n\n```bash\ncp .env.example .env\n```\n\nEdit `.env`. At minimum, set:\n\n| Variable | Purpose |\n|---|---|\n| `LLM_API_KEY` \u002F `LLM_BASE_URL` | Main LLM endpoint (`code_agent` and `evolve_agent` both consume it) |\n| `E2B_API_KEY` | [E2B](https:\u002F\u002Fe2b.dev\u002F) sandbox — see the next subsection for SaaS vs. self-hosted |\n| `SERPER_API_KEY` | Web search used by `evolve_agent` |\n\n`ADB_LLM_*` and `GPT54_LLM_*` are optional — leave them unset to fall back to `LLM_*`, or set them to point ADB \u002F the gpt-5.4 experiment at a stronger model. `LANGFUSE_*`, `BP_HTML_PARSER_*`, and `FEISHU_WEBHOOK` are all optional observability \u002F convenience hooks; see `.env.example` for the full list.\n\n#### E2B sandbox: SaaS vs. self-hosted\n\nAHE runs every rollout inside an E2B sandbox. Two deployment modes are supported:\n\n- **SaaS E2B (default).** Set **only** `E2B_API_KEY` and leave `E2B_API_URL` \u002F `E2B_DOMAIN` unset (or commented out). The SDK talks to `e2b.dev` automatically.\n\n  > ⚠️ **Concurrency cap.** SaaS E2B enforces a per-account **concurrent sandbox limit** tied to your tier. If harbor tries to spawn more sandboxes than the cap allows, the extra sandboxes fail to start and the iteration stalls. Before raising parallelism in your harbor \u002F experiment config, check your tier's quota and stay safely under it.\n\n- **Self-hosted E2B cluster.** Set `E2B_API_KEY` **and** point the SDK at your cluster:\n\n  ```dotenv\n  E2B_API_KEY=\"your_e2b_key\"\n  E2B_API_URL=\"https:\u002F\u002Fyour-e2b-host.example.com\"\n  E2B_DOMAIN=\"your-e2b-host.example.com\"\n  ```\n\n  No shared concurrency cap applies, but the cluster's hardware capacity still does.\n\n### 3. Build E2B templates (one-time per dataset)\n\nThe dataset here is a pack from [`laude-institute\u002Fharbor-datasets`](https:\u002F\u002Fgithub.com\u002Flaude-institute\u002Fharbor-datasets) — clone the subset you need and point `--dataset-dir` at its directory.\n\nEvery rollout runs inside an E2B sandbox spawned from a prebuilt template that already has `uv` and the NexAU\u002Fharbor venv at `\u002Fopt\u002Fnexau-venv`. Build those templates once before launching:\n\n```bash\n# Build every template declared by the dataset, 16 in parallel\nuv run python scripts\u002Fbuild_templates.py --dataset-dir \u002Fpath\u002Fto\u002Fdataset -j 16\n\n# Resume after a failure: only retry tasks whose latest E2B build status is ERROR\nuv run python scripts\u002Fbuild_templates.py --dataset-dir \u002Fpath\u002Fto\u002Fdataset --retry-failed\n\n# Build a specific subset of tasks\nuv run python scripts\u002Fbuild_templates.py --dataset-dir \u002Fpath\u002Fto\u002Fdataset task_a task_b\n```\n\nThe dataset directory must contain one subdir per task with a `task.toml` declaring `[environment].docker_image` (or an `environment\u002FDockerfile` fallback). Each task's template alias is `\u003Ctask_name>` with `.` replaced by `-`.\n\nThe default packages baked into each template come from `scripts\u002Fbuild_templates.py:DEFAULT_NEXAU_PACKAGES` (a public NexAU + the in-sandbox `NexAU-harbor` variant, intentionally distinct from the host-side `harbor-LJH` in `pyproject.toml`). Override with one or more `--nexau-package \u003Cgit-or-pip-spec>` flags if you need a different revision in the sandbox.\n\nIf your tasks pull from a private Docker registry, also export `DOCKER_REGISTRY_USERNAME` and `DOCKER_REGISTRY_PASSWORD` before invoking the script.\n\n### 4. Launch\n\n```bash\n# Run a single experiment in the background via tmux\n.\u002Fscripts\u002Fevolve.sh configs\u002Fexperiments\u002Fexp-003-simple-code-gpt54.yaml\n\n# Launch and auto-attach to the log stream\n.\u002Fscripts\u002Fevolve.sh --attach configs\u002Fexperiments\u002Fexp-003-simple-code-gpt54.yaml\n\n# Batch: launch every experiment under configs\u002Fexperiments\u002F\n.\u002Fscripts\u002Fevolve.sh --batch\n```\n\nCommon tmux operations after launch:\n\n```bash\ntmux ls                         # list sessions\ntmux attach -t \u003Csession>        # attach to a session\n# Ctrl-b d                      # detach (keeps running in background)\ntmux kill-session -t \u003Csession>  # terminate\n```\n\n---\n\n## 🔧 How It Works\n\nThe base model is held fixed; what evolves is the **harness around it**. Each outer iteration is `evaluate → analyze → improve`, built on the three observability layers from the Overview.\n\n### 1. Evaluate — emit traces, not just scores\n\n`harbor` runs the current `code_agent` over the dataset inside isolated E2B sandboxes. Per task it writes:\n\n- `agent\u002Fnexau_in_memory_tracer.cleaned.json` — full step-level trace (messages, tool calls, middleware events)\n- `agent\u002Fnexau.txt` — runtime log (middleware errors, crashes, warnings)\n- `verifier\u002Freward.txt` — pass\u002Ffail outcome\n\nThe **trace, not the pass rate**, is the unit every later step operates on.\n\n### 2. Analyze — distill ~10M-token traces into sourced evidence\n\n*Agent Debugger* compresses each iteration's raw traces (routinely >10M tokens) into layered reports:\n\n- `analysis\u002Foverview.md` — cross-task root-cause summary\n- `analysis\u002Fdetail\u002F{task}.md` — per-task deep analysis\n\nThe optimizer reads digests by default, but every claim links back to the originating raw trace, so it can drill down before committing to a change.\n\n> **Note on Agent Debugger licensing.** The current release ships a *partially* open-sourced Agent Debugger; due to company strategy, it cannot be fully open-sourced at this time.\n\n### 3. Improve — evidence-backed, falsifiable edits\n\n*Evolve Agent* may only write inside `workspace\u002F`, which exposes the seven NexAU components: `systemprompt.md`, `code_agent.yaml`, `tool_descriptions\u002F`, `tools\u002F`, `middleware\u002F`, `skills\u002F`, `sub_agents\u002F` (plus `LongTermMEMORY.md`). For every edit it must commit four fields:\n\n1. **Failure evidence** — the failing tasks and trace excerpts that motivate the change\n2. **Root cause** — *why* it failed, not just *what* failed\n3. **Targeted fix** — the change that directly addresses that cause\n4. **Predicted impact** — which tasks should flip to pass, and which are at risk\n\n### 4. Loop — staggered generations enable falsification\n\nEach `runs\u002Fiteration_NNN\u002F` mixes two generations: `input\u002F` holds the workspace produced by loop `NNN-1` (just evaluated), `evolve\u002F` holds what loop `NNN` writes (evaluated next loop). Flips (pass↔fail) on the next eval are attributed back to this loop's edits in `change_evaluation.json` — predictions that don't hold get rolled back or revised. The loop terminates on `target_pass_rate` or `max_iterations`.\n\n### Main components\n\n| Component | Role |\n|---|---|\n| `evolve.py` | Main-loop orchestrator |\n| `agents\u002Fcode_agent_simple\u002F` | The coding agent that is being evaluated and evolved |\n| `agents\u002Fevolve_agent\u002F` | The meta-agent that performs the improvement step (built on the [NexAU](https:\u002F\u002Fgithub.com\u002Fnex-agi\u002FNexAU.git) framework) |\n| `agents\u002Fexplore_agent\u002F` | Upstream dataset \u002F source-code exploration agent |\n| `configs\u002F` | `base.yaml` (shared defaults) + `experiments\u002F` (per-experiment overlays) |\n| `scripts\u002F` | tmux launcher wrappers (`evolve.sh`, `evolve-resume.sh`) |\n\n### Directory layout\n\n```\nagentic-harness-engineering\u002F\n├── evolve.py                       # main loop\n├── trace_converter.py              # rollout trace → debugger-friendly JSON\n├── agents\u002F\n│   ├── code_agent_simple\u002F          # the coding agent under evolution\n│   ├── evolve_agent\u002F               # the evolution meta-agent\n│   │   ├── evolve_prompt.md\n│   │   ├── middleware\u002F             # context compaction \u002F failover \u002F ralph loop …\n│   │   ├── skills\u002F                 # agent-debugger-cli \u002F nexau-evolution-guide\n│   │   └── tools\u002F                  # file \u002F shell \u002F web \u002F session tools\n│   └── explore_agent\u002F              # exploration agent (sources + web)\n├── configs\u002F\n│   ├── base.yaml                   # shared defaults\n│   └── experiments\u002F                # one overlay per experiment\n├── scripts\u002F\n│   ├── evolve.sh                   # tmux launcher\n│   └── evolve-resume.sh            # resume helper\n└── .env.example\n```\n\n---\n\n## Configuration (base + overlay)\n\n`configs\u002Fbase.yaml` holds the shared defaults. Each `configs\u002Fexperiments\u002Fexp-*.yaml` inherits it via a leading `_base: ..\u002Fbase.yaml` line and overrides only the fields that differ. Any `${ENV_NAME}` reference inside a YAML file is substituted from `.env`.\n\n**Key fields in `base.yaml`:**\n\n| Field | Description |\n|---|---|\n| `path` | Dataset path |\n| `target_pass_rate` | Stop once reached (default 0.95) |\n| `max_iterations` | Maximum number of iterations (default 100) |\n| `harbor_job_timeout_minutes` | Per-harbor-evaluation timeout (0 = unlimited) |\n| `experiment_timeout_minutes` | Total wall-clock budget for the experiment (0 = unlimited) |\n| `llm.api_key \u002F base_url \u002F model` | Main LLM config (usually left as `${LLM_*}`) |\n| `agent_debugger.llm` | Dedicated LLM for ADB (can use a stronger model for debugging) |\n| `notify.feishu_webhook` | Optional Feishu webhook for experiment milestones |\n\n### Dataset configuration\n\nAn experiment's data source is specified via `path` **or** `dataset` — pick one:\n\n| Form | Meaning | Example |\n|---|---|---|\n| `path: \".\u002Fdataset\u002Fxxx\"` | Local dataset directory (relative to the AHE root) | `.\u002Fdataset\u002Fterminal-bench-2` |\n| `path: \"\u002Fabs\u002Fpath\u002Fxxx\"` | Local dataset directory (absolute path) | `\u002Froot\u002Fdataset\u002Fterminal-bench-2` |\n| `dataset: \"\u003Cname>@\u003Cver>\"` | Reference a harbor built-in dataset (no local files required) | `terminal-bench@2.0` |\n\nPublic dataset packs (in the layout AHE expects under `path:`) are published at [`laude-institute\u002Fharbor-datasets`](https:\u002F\u002Fgithub.com\u002Flaude-institute\u002Fharbor-datasets) — clone or download the subset you need and point `path` at its directory.\n\nThe default `path` values in `base.yaml` and `configs\u002Fexperiments\u002F*.yaml` are **placeholders only** — adjust them for your environment, or comment out `path` and uncomment the `dataset` line to use a harbor built-in dataset instead.\n\n---\n\n## CLI reference\n\n### `python evolve.py`\n\n| Flag | Description |\n|---|---|\n| `--config \u003Cfile>` | Config file (overlay takes precedence) |\n| `--batch [dir\\|files...]` | Batch mode; defaults to scanning `configs\u002Fexperiments\u002F` |\n| `--experiment \u003Cname>` | Resume an existing experiment (pass the directory name under `experiments\u002F`) |\n| `--start-iteration N` | Start from iteration N (default 1) |\n| `--skip-eval` | Skip evaluation and reuse existing rollouts (for debugging) |\n\n### `.\u002Fscripts\u002Fevolve.sh`\n\nA thin wrapper around `uv run python evolve.py` + tmux.\n\n| Flag | Description |\n|---|---|\n| `\u003Cconfig_file>` | Positional argument: path to the config file |\n| `--experiment \u003Cname>` | Resume an existing experiment |\n| `--start-iteration N` | Starting iteration |\n| `--skip-eval` | Skip evaluation |\n| `--session \u003Cname>` | Custom tmux session name |\n| `--batch` | Launch every overlay in batch mode |\n| `--attach` | Auto-attach after launch |\n\n---\n\n## Common scenarios\n\n**Resume an interrupted experiment from iteration 16:**\n\n```bash\n.\u002Fscripts\u002Fevolve.sh \\\n  --experiment 2026-04-10__23-20-14__gpt54 \\\n  --start-iteration 16 \\\n  configs\u002Fexperiments\u002Fexp-003-simple-code-gpt54.yaml\n```\n\n**Run only evolve_agent without re-running evaluation:**\n\n```bash\n.\u002Fscripts\u002Fevolve.sh \\\n  --experiment \u003Cexisting-exp-dir> \\\n  --skip-eval \\\n  configs\u002Fexperiments\u002Fexp-003-simple-code-gpt54.yaml\n```\n\n---\n\n## License\n\nMIT\n\n---\n\n## Star History\n\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=china-qijizhifeng\u002Fagentic-harness-engineering&type=Date&theme=dark\" \u002F>\n  \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=china-qijizhifeng\u002Fagentic-harness-engineering&type=Date\" \u002F>\n  \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=china-qijizhifeng\u002Fagentic-harness-engineering&type=Date\" \u002F>\n\u003C\u002Fpicture>\n","Agentic Harness Engineering (AHE) 是一个用于自动进化编码代理周围框架的可观察性系统。该项目的核心功能包括通过三个可观察性层（组件、经验和决策）来优化固定基础模型的外围组件，如系统提示、工具描述和实现等。AHE 利用 NexAU 将框架分解为七个正交文件级组件，并通过 Git 进行版本控制，确保每次修改都是可审计和可回滚的。此外，它还提供了一个代理调试器，能够将大量原始追踪数据提炼成层次化的报告，帮助优化器更好地理解并改进模型性能。AHE 适用于需要对现有 AI 编码助手进行持续优化以提升其在特定任务上表现的场景，特别是在软件工程验证等领域。","2026-06-11 02:48:58","CREATED_QUERY"]