[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2275":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},2275,"retort","adrianco\u002Fretort","adrianco","Platform Evolution Engine. Distill the best from the combinatorial mess.",null,"Python",117,6,1,0,4,12,45.74,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:14","# Retort\n\n**Platform Evolution Engine** — Distill the best from the combinatorial mess.\n\nRetort applies statistical Design of Experiments (DoE) to systematically evaluate AI-assisted development tooling stacks. It generates fractional factorial designs across languages, coding agents, and frameworks, executes experiments in isolated playpens, scores the results, and promotes or retires stacks based on measured confidence.\n\n## Status: 1.0 Beta\n\n> Retort 1.0 beta is **feature-complete for single-agent `claude-code` experiments** with the `LocalRunner`. The CLI surface, scoring metrics, and storage schema are stable. This is the version we used to run three complete experiments totaling 111 runs and $110 in API costs.\n>\n> **What works:** `LocalRunner`, all 8 built-in scorers, fractional-factorial design generation, ANOVA + effects reporting, SQLite storage, resumable sharded runs, parallel bulk evaluation, auto-evaluation skills pipeline, `cost_limit_usd` budget enforcement, and the full experiment lifecycle (screening → trial → production).\n>\n> **What is not yet implemented:** `DockerRunner` (skeleton only — `LocalRunner` is the supported path), agents other than `claude-code` (unsupported agents now raise an error at startup), the `intake`\u002F`scheduler` paths.\n>\n> **Scoring gate:** A run where tests don't execute scores **0 across all metrics** — a Starlette-incompatible Python run that writes perfect code still fails if pytest can't import. `test_coverage == 0` vetoes the entire `ScoreVector`. The `findings` scorer reads `assessment.json` produced by the `evaluate-run` + `file-run-issues` skill pipeline and applies a weighted penalty for critical\u002Fhigh\u002Fmedium\u002Flow findings.\n\n## Experiment 1 Results\n\n📊 **[Browse the live web report →](https:\u002F\u002Frawcdn.githack.com\u002Fadrianco\u002Fretort\u002Fmain\u002Fexperiment-1\u002Freports\u002Fweb\u002Findex.html)** (sortable leaderboard with per-stack drill-downs, token\u002Fcost data, and links to per-run code reviews)\n\nFull data is also in [`experiment-1\u002Freports\u002F`](experiment-1\u002Freports\u002F) — ANOVA, per-stack maturity, full CSV, and the same static-HTML web report. See [`experiment-1\u002Freports\u002Fcomparison.md`](experiment-1\u002Freports\u002Fcomparison.md) for the complete per-run analysis.\n\n**Setup:** 6 languages (python, typescript, go, rust, java, clojure) × 2 models (opus, sonnet) × 2 tooling (none, beads) × 2–3 replicates = 73 runs against the bundled `rest-api-crud` task (CRUD book collection API). **Final tally: 67 of 73 runs completed, 6 failed. Total cost ≈ $25, ≈ 25.8M tokens.**\n\n**Evaluation scores** (from `retort evaluate` + `evaluate-run` skill): `PenScore` = 1.0 minus weighted findings penalty (critical×0.25, high×0.10, medium×0.03, low×0.01); `ReqCov` = fraction of TASK.md requirements implemented. Runs where tests didn't execute score 0.\n\n**Multiplicative ANOVA** (default: log10 transform, since cost\u002Ftokens\u002Fduration scale by ratios not constants):\n\n| Response | Significant factors |\n|---|---|\n| `code_quality` | language only (p \u003C 1e-18) |\n| `_tokens` | language + model + tooling |\n| `_cost_usd` | language + model + tooling |\n| `_duration_seconds` | language + model + tooling |\n\nSwitching from additive to multiplicative ANOVA surfaced model + tooling effects on the cost-like metrics that the additive model treated as noise. See [`reports\u002Fanova.txt`](experiment-1\u002Freports\u002Fanova.txt).\n\n**Top stacks by maturity:** **Java sweeps quality 1.000** in all four model\u002Ftooling combinations. Go's `sonnet\u002Fbeads` ties at quality 1.000 too. Generate the full maturity report with `retort maturity --db experiment-1\u002Fretort.db`.\n\n### Per-stack means (live data)\n\nSortable + drill-downable in the [web report](https:\u002F\u002Frawcdn.githack.com\u002Fadrianco\u002Fretort\u002Fmain\u002Fexperiment-1\u002Freports\u002Fweb\u002Findex.html). `PenScore` and `ReqCov` are from the `evaluate-run` bulk evaluation. Bold = perfect penalty score.\n\n| Language | Model | Tooling | n | Quality (mean) | Tokens (mean) | Cost (mean) | PenScore | ReqCov |\n|---|---|---|---|---|---|---|---|---|\n| clojure | opus | beads | 2\u002F3 | 0.556 | 723,724 | $0.762 | **1.000** | 1.000 |\n| clojure | opus | none | 3\u002F3 | 0.833 | 409,366 | $0.579 | **1.000** | 1.000 |\n| clojure | sonnet | beads | 3\u002F3 | 0.556 | 722,939 | $0.520 | 0.830 | 0.939 |\n| clojure | sonnet | none | 3\u002F3 | 0.556 | 665,636 | $0.575 | 0.967 | 0.972 |\n| go | opus | beads | 3\u002F3 | 0.985 | 346,215 | $0.491 | **1.000** | 1.000 |\n| go | opus | none | 3\u002F3 | 0.963 | 230,498 | $0.361 | **1.000** | 1.000 |\n| go | sonnet | beads | 3\u002F3 | **1.000** | 476,955 | $0.311 | 0.995 | 0.500 |\n| go | sonnet | none | 3\u002F3 | 0.956 | 435,373 | $0.303 | **1.000** | 1.000 |\n| java | opus | beads | 3\u002F3 | **1.000** | 325,112 | $0.552 | **1.000** | 1.000 |\n| java | opus | none | 3\u002F3 | **1.000** | 217,162 | $0.436 | **1.000** | — |\n| java | sonnet | beads | 3\u002F3 | **1.000** | 611,395 | $0.365 | **1.000** | 1.000 |\n| java | sonnet | none | 3\u002F3 | **1.000** | 494,115 | $0.326 | **1.000** | 1.000 |\n| python | opus | beads | 3\u002F3 | 0.672 | 280,359 | $0.373 | **1.000** | 1.000 |\n| python | opus | none | 3\u002F3 | 0.582 | 91,698 | $0.203 | 0.500 | 0.727 |\n| python | sonnet | beads | 3\u002F3 | 0.474 | 436,753 | $0.262 | 0.400 | 0.800 |\n| python | sonnet | none | 3\u002F3 | 0.430 | 332,390 | $0.226 | 0.995 | 1.000 |\n| rust | opus | beads | 3\u002F3 | 0.833 | 355,099 | $0.481 | **1.000** | 1.000 |\n| rust | opus | none | 3\u002F3 | 0.833 | 150,702 | $0.331 | **1.000** | 0.500 |\n| rust | sonnet | beads | 3\u002F3 | 0.556 | 643,793 | $0.414 | **1.000** | 1.000 |\n| rust | sonnet | none | 3\u002F3 | 0.833 | 395,257 | $0.355 | **1.000** | 1.000 |\n| typescript | opus | beads | 3\u002F3 | 0.733 | 454,220 | $0.512 | **1.000** | 1.000 |\n| typescript | opus | none | 3\u002F3 | 0.733 | 168,703 | $0.319 | **1.000** | 1.000 |\n| typescript | sonnet | beads | 3\u002F3 | 0.489 | 637,682 | $0.381 | 0.900 | 0.950 |\n| typescript | sonnet | none | 3\u002F3 | 0.489 | 835,319 | $0.531 | 0.500 | 0.591 |\n\n**Headlines:**\n- **Java, Go, Rust, and Clojure\u002Fopus consistently hit PenScore 1.000** on this task — no findings above threshold.\n- **Python is the outlier.** `python\u002Fopus\u002Fnone` and `python\u002Fsonnet\u002Fbeads` scored 0.50 and 0.40 due to a Starlette 1.0 compatibility break that prevented tests from executing (which zeroes all scores under the new test-gate rule). Other Python cells pass cleanly.\n- **Requirement coverage diverges from penalty score.** `go\u002Fsonnet\u002Fbeads` has PenScore 0.995 but ReqCov 0.500 — the code is high quality but only implements half the spec. The old `code_quality` scorer missed this.\n- **Beads helps only for Go.** Pattern consistent with experiment-1 ANOVA findings.\n\n## Experiment 2 Results — brazil-bench (cross-task)\n\n📊 **[Web report →](https:\u002F\u002Frawcdn.githack.com\u002Fadrianco\u002Fretort\u002Fmain\u002Fexperiment-2\u002Freports\u002Fweb\u002Findex.html)**\n\nFull per-run analysis: [`experiment-2\u002Freports\u002Fcomparison.md`](experiment-2\u002Freports\u002Fcomparison.md).\n\nA second experiment run against [`brazil-bench\u002Fbenchmark-template`](https:\u002F\u002Fgithub.com\u002Fbrazil-bench\u002Fbenchmark-template) — a much harder task: MCP server, CSV ingest of Kaggle data, BDD tests with 16 canonical requirements. **24 completed runs (24 cells), 1 replicate each, screening pass. Total cost $29.85, 33.6M tokens (avg $1.24\u002Frun).**\n\n**Single-task ANOVA on `code_quality`:** only language significant (consistent with experiment-1).\n\n**Cross-task ANOVA** (91 rows = experiment-1's 67 + experiment-2's 24, `task` as a factor):\n\n| Response | Significant factors |\n|---|---|\n| `code_quality` | language |\n| `_tokens` | language + model + tooling + **task** + language:tooling + **language:task** + model:tooling + **model:task** |\n| `_cost_usd` | similar + model:tooling |\n| `_duration_seconds` | every main effect + 5 interactions (incl. **model:task**, **tooling:task**) |\n\n**The `model:task` interaction is the headline finding.** Opus vs sonnet behaves *differently* on hard (brazil-bench) vs easy (rest-api-crud) tasks. The simple \"best stack on rest-api-crud is best everywhere\" assumption from experiment-1 doesn't fully generalize for the resource-cost dimensions.\n\n### Experiment-2 evaluation scores (brazil-bench)\n\n`PenScore` and `ReqCov` from `evaluate-run` bulk evaluation. Brazil-bench is a harder task (MCP server + CSV ingest + BDD tests) so scores spread more widely.\n\n| Language | Model | Tooling | Quality | PenScore | ReqCov | Notes |\n|---|---|---|---|---|---|---|\n| clojure | opus | beads | 0.833 | **1.000** | — | |\n| clojure | opus | none | 0.833 | **1.000** | — | |\n| clojure | sonnet | beads | 0.833 | **1.000** | 0.889 | |\n| clojure | sonnet | none | 0.833 | **1.000** | — | |\n| go | opus | beads | 1.000 | **1.000** | 1.000 | |\n| go | opus | none | 1.000 | 0.620 | 0.667 | |\n| go | sonnet | beads | 1.000 | 0.650 | 0.000 | 1 critical — BDD scaffold with no data |\n| go | sonnet | none | 1.000 | 0.900 | 0.500 | |\n| java | opus | beads | 1.000 | 0.940 | 0.900 | |\n| java | opus | none | 1.000 | **1.000** | — | |\n| java | sonnet | beads | 1.000 | **1.000** | 1.000 | |\n| java | sonnet | none | 1.000 | 0.000 | 0.067 | **11 critical** — catastrophic failure |\n| python | opus | beads | 0.667 | **1.000** | — | |\n| python | opus | none | 0.667 | **1.000** | 1.000 | |\n| python | sonnet | beads | 0.667 | **1.000** | — | |\n| python | sonnet | none | 0.667 | **1.000** | 0.917 | |\n| rust | opus | beads | 0.833 | 0.450 | 0.143 | 1 critical |\n| rust | opus | none | 0.833 | 0.750 | — | 1 critical |\n| rust | sonnet | beads | 0.833 | 0.400 | 0.143 | |\n| rust | sonnet | none | 0.833 | 0.990 | 1.000 | |\n| typescript | opus | beads | 0.000 | **1.000** | — | |\n| typescript | opus | none | 0.000 | **1.000** | 1.000 | |\n| typescript | sonnet | beads | 0.733 | — | — | |\n| typescript | sonnet | none | 0.733 | **1.000** | 1.000 | |\n\n**Headlines for brazil-bench:**\n- **Python sweeps clean** — all four cells hit PenScore 1.000, reversing the experiment-1 pattern (Starlette issue is task-specific, not language-specific).\n- **Java\u002Fsonnet\u002Fnone catastrophically fails** (11 criticals, PenScore 0.000) — a single cell failure, not a model trend; the same model\u002Ftooling is fine in all other combinations.\n- **Rust struggles on the harder task** — all four cells below 1.000, consistent with Rust's complexity wall on multi-component tasks.\n- **`model:task` interaction confirmed by evaluation scores**: opus outperforms sonnet on brazil-bench for Go and Java; sonnet leads on rust\u002Fnone.\n\n**Pareto frontier across both tasks** (`retort report pareto --data combined.csv --metric code_quality --metric -_cost_usd`):\n\n| | |\n|---|---|\n| **Rank 0 (Pareto-optimal)** | `go \u002F sonnet \u002F beads` — quality 1.000, cost $0.311 |\n| Rank 1 | `java \u002F sonnet \u002F none` — quality 1.000, cost $0.326 |\n| Rank 2 | `go \u002F sonnet \u002F none`, `java \u002F opus \u002F none`, `python \u002F opus \u002F none`, `rust \u002F opus \u002F none` |\n\nEvery other stack is dominated. **`go \u002F sonnet \u002F beads` is the only stack that no other stack beats on both quality AND cost simultaneously.**\n\n## Experiment 3 Results — Model Version Comparison (claude-opus-4-6 vs claude-opus-4-7)\n\n📊 **[Full report →](experiment-3\u002Freports\u002Fcomparison.md)**\n\nA quarter-fraction screening experiment on the same brazil-bench task, designed to estimate the effect of upgrading from `claude-opus-4-6` to `claude-opus-4-7`. **6 cells × 2 replicates = 12 run-slots, executed across 4 parallel polecats (May 2026).**\n\n**Model version provenance:** Experiment-2 used model alias `\"opus\"` which resolved to `claude-opus-4-6` via the Claude CLI (claude-opus-4-7 did not yet exist in April 2026). Experiment-3 uses explicit versioned model IDs: `claude-opus-4-6` and `claude-opus-4-7`.\n\n**Design (Resolution III quarter-fraction):** Each language is assigned to one model to maximize coverage; model main effect is aliased with the compiled-vs-scripted language contrast.\n\n**Total: 14 runs, $54.94, 52.2M tokens** (avg $3.92\u002Frun — 3.3× higher than experiment-2's $1.24\u002Frun avg)\n\n| Language | Model | Tooling | test_coverage | code_quality | avg duration | avg tokens | avg cost |\n|---|---|---|---|---|---|---|---|\n| go | claude-opus-4-7 | none | **0.813** | **1.000** | 23.1m | 7,612,020 | $8.13 |\n| java | claude-opus-4-6 | none | 1.000 | **1.000** | 12.9m | 2,472,358 | $2.87 |\n| rust | claude-opus-4-7 | beads | 1.000 | 0.833 | 25.0m | 7,573,138 | $7.34 |\n| python | claude-opus-4-6 | none | 0.897 | 0.667 | 5.2m | 750,075 | $0.98 |\n| typescript | claude-opus-4-6 | beads | 1.000 | 0.733 | 6.7m | 1,545,433 | $1.56 |\n| clojure | claude-opus-4-7 | beads | 1.000 | 0.833 | 20.9m | 5,007,051 | $5.30 |\n\n**Headlines:**\n- **Go + claude-opus-4-7 achieves 81% test coverage vs 42% for claude-opus-4-6 on the same task** — the clearest model-version signal in the dataset. Code quality is identical (1.000 = zero high-severity findings). The 5× longer runtime in experiment-3 correlates with more thorough test writing.\n- **Java and Rust hit 100% test coverage regardless of model version.** Java scores code_quality 1.000 in both experiments; consistent with experiment-2.\n- **TypeScript + beads tooling enables test frameworks.** Experiment-2 typescript\u002Fopus scores 0.0 (no test framework generated); experiment-3 typescript\u002Fclaude-opus-4-6\u002Fbeads scores 1.000 — the `model:tooling` interaction matters more than model version alone.\n- **Runs take 2–9× longer in experiment-3.** Compiled languages (Go, Rust, Clojure) now use a 45-minute budget vs 25 minutes in experiment-2. The adaptive timeout system (`_estimate_run_timeout` in `cli.py`) learns per-cell timing from history and sets future budgets automatically.\n- **Same model (claude-opus-4-6), same task, same quality.** Java and Python show identical `code_quality` across the April→May gap, suggesting model quality is stable.\n\n**Scorer fixes (applied in this experiment, rescored across all experiments):** Java MVN `-q` flag silenced surefire output (removed); Clojure test alias was wrong (`-X:test` → `-M:test`); Rust lacked a coverage-command path (added tests-only fallback); TypeScript vitest invoked via broken `.bin\u002F` wrapper (switched to direct `node` invocation with test-pass-rate fallback).\n\n## Experiment Summary\n\n| Experiment | Task | Runs | Cost | Tokens | Key Finding |\n|---|---|---|---|---|---|\n| 1 | rest-api-crud | 67\u002F73 | $25.07 | 25.8M | Language dominates: Java=1.0, Go=0.98, Rust=0.83 |\n| 2 | brazil-bench | 24\u002F24 | $29.85 | 33.6M | model×task interaction; TypeScript model-sensitive |\n| 3 | brazil-bench (model versions) | 14\u002F14 | $54.94 | 52.2M | Opus-4.7 adds 2× test coverage on Go |\n| **Total** | | **105\u002F111** | **$109.86** | **111.6M** | Java\u002FGo are production-ready; Opus-4.7 improves coverage |\n\n## Installation\n\n### Prerequisites\n\n`pip install` only fetches the Python deps. To actually run experiments you also need:\n\n| Requirement | Why | Install |\n|---|---|---|\n| **Python 3.11+** | Runtime | https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F |\n| **C\u002FC++ toolchain + cmake** | `OApackage` (orthogonal arrays) is a C++ extension; no manylinux wheel on every platform | `apt install build-essential cmake` (Debian\u002FUbuntu) \u002F `xcode-select --install` (macOS) |\n| **`claude` CLI, authenticated** | The only currently-implemented agent runner shells out to `claude -p ...` | https:\u002F\u002Fdocs.claude.com\u002Fclaude-code · run `claude` once to log in |\n| **`bd` (beads) CLI** | Required only if any factor uses `tooling: beads` (the bundled examples do). The agent runs `bd init`\u002F`bd create` inside its playpen | https:\u002F\u002Fgithub.com\u002Fsteveyegge\u002Fbeads |\n| **Per-language toolchains** | The scorer builds and tests the generated code. Install the toolchain for every language you list as a factor level. | `python` (already), `node` ≥ 20 + `npm` for typescript, `go` ≥ 1.22, `rustup` for rust |\n| **Docker** *(optional, future)* | `DockerRunner` is a skeleton; `LocalRunner` is the supported path today. Only install if you plan to develop the Docker path. | https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F |\n\n### Install retort\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fadrianco\u002Fretort.git\ncd retort\npip install -e \".[dev,test]\"\n```\n\n### Devcontainer \u002F Codespaces\n\n`.devcontainer\u002F` provisions Python 3.12 + the C++ toolchain + Node + Go + Rust + the `claude` and `bd` CLIs via `post-create.sh`. Open the repo in GitHub Codespaces or VS Code \"Dev Containers: Reopen in Container\" and the prereqs are handled automatically. You'll still need to authenticate `claude` interactively the first time.\n\n### Verify\n\n```bash\nretort --help                # CLI loads → Python deps OK\nclaude --version             # Claude CLI present\nbd --version                 # Beads present (only needed for tooling=beads)\n```\n\n## Quick Start\n\n```bash\n# Initialize a workspace\nretort init my-eval\ncd my-eval\n\n# Edit workspace.yaml to define your factors, responses, and tasks.\n# Add design.fraction: 0.25 to run a quarter-fraction instead of the full design.\n# Then generate a screening design matrix (optional — run does this automatically)\nretort design generate --phase screening --config workspace.yaml -o design.csv\n\n# Execute experiment runs (uses design.fraction from workspace.yaml automatically)\nretort run --phase screening --config workspace.yaml\n\n# Predict unrun cells from a fractional run\nretort analyze --data results.csv -r code_quality -f language -f model -f tooling --predict\n\n# Or pass a hand-edited CSV to run any arbitrary subset\nretort run --phase screening --config workspace.yaml --design design.csv\n\n# Compute main effects and interactions\nretort report effects --db retort.db --matrix-id 1 --metric code_quality\n\n# Evaluate a promotion gate\nretort promote my-stack --from screening --to trial \\\n    --evidence '{\"p_value\": 0.05}' --config workspace.yaml\n```\n\n## Running at scale with Gas Town (optional)\n\nRetort runs standalone — `pip install` + `claude` CLI is enough to drive every command above. There is **no Gas Town dependency**.\n\nThat said, real experiments are long-running, parallelizable, and benefit from an orchestrator. [Gas Town](https:\u002F\u002Fgithub.com\u002Fsteveyegge\u002Fgastown) is the orchestrator we use during development; it adds:\n\n- **Parallel execution.** `gt sling` dispatches a slice of the design to a polecat (a worker agent in its own git worktree). Multiple polecats share one `retort.db` via `retort run --shard N\u002FM --resume`. The `--shard` partition is a deterministic hash, so two polecats never both pick the same cell, and per-run sqlite commits keep concurrent writers safe.\n- **Patrol + escalation.** `witness` watches the merge queue; `refinery` patrols it; mail\u002Fescalations route to the mayor agent if anything sticks.\n- **Auto-evaluation.** With gt + `bd` (beads) installed, the `evaluate-run` and `file-run-issues` skills file findings as tracked beads in your project — survives session resets and shows up in queries.\n\nPattern:\n\n```bash\ngt sling re-ucc retort --crew alpha   --args \"retort run --phase screening --config experiment-2\u002Fworkspace.yaml --resume --shard 0\u002F4\"\ngt sling re-ucc retort --crew bravo   --args \"retort run --phase screening --config experiment-2\u002Fworkspace.yaml --resume --shard 1\u002F4\"\ngt sling re-ucc retort --crew charlie --args \"retort run --phase screening --config experiment-2\u002Fworkspace.yaml --resume --shard 2\u002F4\"\ngt sling re-ucc retort --crew delta   --args \"retort run --phase screening --config experiment-2\u002Fworkspace.yaml --resume --shard 3\u002F4\"\n```\n\n**Without Gas Town, the same parallelism works in plain bash:**\n\n```bash\nfor s in 0 1 2 3; do\n    nohup retort run --phase screening --config experiment-2\u002Fworkspace.yaml \\\n        --resume --shard $s\u002F4 > shard-$s.log 2>&1 &\ndone\nwait\n```\n\n**Caveats if you choose the gt path:**\n\n- Concurrent runs multiply per-second token usage. Keep the shard count to a value the Anthropic API tier comfortably supports — start at 2× and monitor rate-limit headers before going higher.\n\n## CLI Commands\n\n| Command | Description |\n|---------|-------------|\n| `retort init \u003Cname>` | Create a new workspace with config template and SQLite database |\n| `retort design generate` | Generate a fractional factorial design matrix (screening or characterization) |\n| `retort run` | Execute experiment runs: design matrix, playpen provisioning, scoring, storage. Honors `design.fraction` from config; `--design \u003Ccsv>` overrides with a manually-edited design file |\n| `retort promote` | Evaluate promotion gates for stack lifecycle transitions |\n| `retort report effects` | Compute and export main effects and interaction effects (text, JSON, CSV) |\n| `retort export csv` | Export experiment runs + scores to CSV for `retort analyze` and external tools |\n| `retort maturity` | Score each stack's maturity (replicate agreement, completion rate, score level, coverage) and suggest a lifecycle phase |\n| `retort report web` | Generate static HTML reports (sortable leaderboard + per-stack drill-downs); respects experiment.visibility for private experiments |\n| `retort export merge` | Combine multiple experiment CSVs into one with an experiment-tag column, for cross-experiment ANOVA |\n| `retort report pareto` | Identify Pareto-optimal stacks across multiple objectives; minimize cost-like metrics with `-` prefix |\n| `retort run --shard N\u002FM` | Run only the slice of cells owned by shard N (of M); deterministic partition for parallel polecats sharing one retort.db |\n| `retort analyze` | Run ANOVA analysis on experiment data with optional residual diagnostics |\n| `retort intake` | Ingest a new candidate (factor level) and generate D-optimal augmentation runs |\n| `retort report dashboard` | Show full workspace status dashboard (experiments, lifecycle, budget) |\n| `retort plugin list` | List installed retort plugins and their scorer\u002Frunner contributions |\n| `retort plugin show \u003Cname>` | Show details for a specific scorer or runner |\n\n## Configuration\n\nRetort workspaces are configured via `workspace.yaml`:\n\n```yaml\nfactors:\n  language:\n    levels: [python, typescript, go]\n  agent:\n    levels: [claude-code, cursor, copilot]\n  framework:\n    levels: [fastapi, nextjs, stdlib]\n\nresponses:\n  - code_quality\n  - token_efficiency\n  - build_time\n  - test_coverage\n\ntasks:\n  - source: bundled:\u002F\u002Frest-api-crud\n\nplaypen:\n  runner: local            # 'local' is the supported path; 'docker' is a skeleton\n  replicates: 3\n  timeout_minutes: 30\n  cost_limit_usd: 50.00   # optional: abort experiment if accumulated cost exceeds this\n\ndesign:\n  screening_resolution: 3\n  significance_threshold: 0.10\n  fraction: 0.25            # optional: quarter-fraction (6 cells from 24); omit for full fractional factorial\n\npromotion:\n  screening_to_trial: { p_value: 0.10 }\n  trial_to_production: { posterior_confidence: 0.80 }\n```\n\nWhen `fraction` is set, `retort run` automatically uses a balanced subset that covers every factor level at least once. After the run, use `retort analyze --predict` to project point estimates and 95% CIs for every unrun cell. You can also override the fraction with a manually-edited CSV via `retort run --design design.csv`.\n\n## Architecture\n\n```\nsrc\u002Fretort\u002F\n├── cli.py              # Click-based CLI entry point\n├── analysis\u002F           # ANOVA, Bayesian updating (conjugate NIG), Pareto frontier, residual diagnostics\n├── config\u002F             # Pydantic config schema and YAML loader\n├── design\u002F             # Factor registry and fractional factorial generator (pyDOE3)\n├── playpen\u002F            # Isolated execution: DockerRunner, task loading, prompt building\n├── plugins.py          # Pluggy-based plugin system for custom scorers and runners\n├── scoring\u002F            # Score collection with pluggable scorers (code_quality, token_efficiency, build_time)\n├── promotion\u002F          # Lifecycle state machine, configurable gates, immutable changelog\n├── reporting\u002F          # Effects computation, export (text, JSON, CSV), and status dashboard\n├── scheduler\u002F          # Candidate intake, D-optimal augmentation, budget tracking, run queue\n└── storage\u002F            # SQLAlchemy models and Alembic migrations (SQLite)\n```\n\n### Key Concepts\n\n- **Factors**: Variables under test (language, agent, framework) with discrete levels\n- **Design Matrix**: Fractional factorial design that efficiently covers the factor space\n- **Playpen**: Isolated execution environment where each experiment run executes\n- **Scoring**: Pluggable metrics collected from run artifacts\n- **Promotion**: Evidence-based lifecycle transitions (candidate → screening → trial → production → retired)\n\n## Implementation Status\n\nHonest accounting of what is tested end-to-end versus implemented but not exercised.\n\n| Area | Status | Notes |\n|------|--------|-------|\n| **Design generation** | ✅ Working | Fractional factorial (pyDOE3), mixed-level support, `design.fraction` config, `--design \u003Ccsv>` override, `DesignMatrix.from_csv()`, `retort analyze --predict` for unrun cells. Full unit test coverage. |\n| **LocalRunner + scoring** | ✅ Working | Exercised end-to-end across 111 runs covering 6 languages. 8 scorers: `code_quality`, `test_coverage`, `test_quality`, `token_efficiency`, `defect_rate`, `maintainability`, `idiomatic`, `findings`. Scoring gate: `test_coverage == 0` vetoes all scores. `evaluate-run` + `file-run-issues` skills auto-invoked after each run. `retort evaluate --workers N` for bulk re-evaluation. |\n| **Resume \u002F sharding** | ✅ Working | `--resume` skips recorded `(config, replicate)` pairs; `--retry-failed` retries failures. `--shard N\u002FM` deterministic partition for parallel polecats. Per-run DB commit = at most one lost run on interrupt. Run artifacts archived to `runs\u002F\u003Ccell>\u002Frep\u003CN>\u002F`. |\n| **Factor system** | ✅ Working | `language`, `model` (with alias table, versioned IDs), `tooling` (beads instructions), `prompt` (named `.md` files in `prompts\u002F`), `org_context`. Any additional factor flows through `stack.extra` automatically. |\n| **Budget enforcement** | ✅ Working | `cost_limit_usd` in config is enforced during `retort run` — experiment aborts if accumulated cost exceeds the limit. Error surfaced immediately via `click.ClickException`. |\n| **Agent validation** | ✅ Working | Unsupported agents raise a clear error at experiment startup, before any runs execute. Only `claude-code` is implemented. |\n| **MLflow sink** | ✅ Implemented | Logs factor levels, scores, and telemetry per run. Enabled by `mlflow:` block in `workspace.yaml`. Not covered by integration tests. |\n| **Local inference cost** | ✅ Working | `local_inference_cost` block computes `_cost_usd` from wall-clock duration × (electricity + amortized hardware) for local\u002Foffline models. |\n| **DockerRunner** | 🟡 Implemented | `provision()` and `execute()` implemented with timeout and teardown. **Not validated end-to-end** — use `runner: local` for now. |\n| **Promotion + lifecycle** | 🟡 Code present | State machine, gates, changelog, `retort promote`. Exercised in unit tests; not yet driven by a real promotion decision. |\n| **ANOVA \u002F analysis** | 🟡 Lightly exercised | `retort analyze` with additive and multiplicative transforms, residual diagnostics, `--predict` for fractional designs. Verified on real experiment-1\u002F2\u002F3 data for main effects; interaction + Bayesian paths lightly tested. |\n| **Reporting** | 🟡 Mostly working | `retort report effects`, `report web`, `report pareto`, `report wardley`, `report aliasing`, `report dashboard` all implemented. `wardley` and `aliasing` verified in code; not exercised against live experiment data. |\n| **Scheduler \u002F intake** | 🔴 Stub | `retort intake` (D-optimal augmentation) implemented but untested against real candidates. |\n| **Multi-agent** | 🔴 Not implemented | Only `claude-code` is wired in `LocalRunner`. Unsupported agents raise a `click.ClickException` at experiment startup — no silent skipping. |\n\n## Development\n\n```bash\n# Install dev dependencies\npip install -e \".[dev,test]\"\n\n# Run tests\npytest\n\n# Lint\nruff check src\u002F tests\u002F\n\n# Type check\nmypy src\u002Fretort\u002F\n```\n\n## License\n\nApache-2.0 — see [LICENSE](LICENSE).\n","Retort 是一个平台进化引擎，旨在通过统计设计实验（DoE）系统地评估 AI 辅助开发工具栈。其核心功能包括生成跨语言、编码代理和框架的分数因子设计，在隔离环境中执行实验，并根据评分结果推进或淘汰工具栈。技术特点包括支持单代理 `claude-code` 实验的完整生命周期管理、多种内置评分器、ANOVA 和效应报告等。目前版本已实现本地运行器、SQLite 存储等功能，但 Docker 运行器等部分功能尚未完成。该项目适用于需要优化和选择最佳开发工具栈的场景，特别是涉及多语言或多框架组合的情况。",2,"2026-06-11 02:49:14","CREATED_QUERY"]