[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77931":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":33,"discoverSource":34},77931,"distributed-system-testing","shenli\u002Fdistributed-system-testing","shenli","AI-agent skills for distributed-systems testing","",null,217,12,3,1,0,2,4,62,6,55.54,"MIT License",false,"main",[25,26,27,28,29],"agent-skills","ai-agents","chaos-engineering","distributed-systems","testing","2026-06-11 04:06:44","# Distributed Systems Testing Skills\n\n**Two skills for AI coding agents that design and run claim-driven\ntests for distributed and stateful systems.** Together they produce a\nstructured Markdown test plan and a findings report with 9-state\nverdicts and an explicit SUT \u002F harness \u002F checker \u002F environment blame\nclassification. A reviewer reads the two artifacts and decides\nwhether to ship; nothing else has to be re-run.\n\nWorks with Claude Code, Codex, Copilot CLI, Cursor, Gemini, or any\nagent that reads Markdown and runs shell. The skills are plain\nSKILL.md files. The agent executes them; the plan and findings\nreport are the output.\n\nOne skill designs the plan. The other runs it. A plan starts from\nthe product's claims, generates hypotheses tied to those claims, and\nwrites scenarios named after the claim each tries to falsify. For\nconsistency-critical scenarios, each scenario also binds an abstract\nmodel (`register | queue | log | lock | lease | ledger | …`) to an\noperation-history schema, a named checker, and a nemesis with\nobservable landing evidence. The plan ends with a coverage adequacy\nargument and a conservative confidence statement.\n\n## Why\n\nThe default for testing distributed and stateful systems — write a\nfew integration tests and call it done — finds a small fraction of\nthe bugs that actually break these systems in production: partial\nnetwork partitions, non-deterministic concurrency, crash-recovery,\nupgrade\u002Frollback, idempotency under replay, timing-sensitive ordering.\n\nThese skills enforce an opinionated workflow that pulls from the\nfield's hard-won knowledge:\n\n- **Claim-driven, not test-driven.** Start from what the product\n  promises. Every scenario falsifies one claim under one fault. A\n  test named after its claim is harder to weaken than one named\n  after its setup.\n- **Coverage adequacy is a deliverable.** The plan ends with an\n  argument that the chosen scenarios are *enough* to ship, plus an\n  honest list of what stays unverified.\n- **Reuse the SUT's own toolbox.** The execute skill discovers\n  existing tests, runbooks, and fault-injection scaffolding before\n  inventing anything new.\n- **Model + history + checker, not just chaos.** For safety,\n  durability, idempotency, isolation, ordering, or membership\n  claims, every scenario declares an abstract model, an\n  operation-history schema, a named checker (linearizability,\n  serializability, session-consistency, no-lost-ack, exactly-once,\n  …), and how it treats ambiguous outcomes (timeouts, unknown\n  commits, retries). Chaos plus a model and a checker, not chaos\n  alone.\n- **No silent passes.** Every PASS cites oracle execution evidence\n  *and* the signal proving the fault actually fired. Verdicts come\n  from a 9-state set, so \"the chaos script ran cleanly\" can't be\n  read as \"the claim survived the fault.\" Every FAIL carries a\n  SUT \u002F harness \u002F checker \u002F environment blame tag so reproducers\n  reach the right queue.\n\n## What you get\n\nEnd-to-end, the two skills produce:\n\n```\ntesting-plans\u002F\u003Cslug>.md             ← plan with §0–§9 (see below)\ntest-sessions\u002F\u003CUTC>\u002F\n  ├── session-log.md                 ← timeline + toolbox + env probe\n  ├── logs\u002F                          ← per-scenario stdout\u002Fstderr\n  ├── metrics\u002F                       ← metric snapshots\n  ├── artifacts\u002F                     ← ephemeral harnesses, dumps\n  └── findings\u002F\n      ├── \u003Cscenario>.md              ← per-scenario verdict (written as run proceeds)\n      └── report.md                  ← summary + adequacy + confidence delta\n```\n\nThe plan structure (a reviewer can read this and decide whether to\nship without re-running the tests):\n\n```\n0. Architectural summary       — system as it actually exists\n1. Scope\n1b. Claims under test          — the spine\n1c. Missing claims discovered  — docs ↔ code drift\n2. SUT model\n3. Existing test inventory     — what's already covered\n4. Failure-mode hypotheses     — tied to claim IDs\n5. Coverage matrix             — claim × hypothesis\n6. Technique selection         — from the catalog\n6b. Environment requirements\n7. Scenarios                   — each named after the claim, with\n                                  Target test file + Skeleton\n   7.M Model \u002F history \u002F       — mandatory when the scenario falsifies\n       checker discipline        a claim in {safety, durability,\n                                  idempotency, isolation, ordering,\n                                  membership}: model under test,\n                                  operation-history schema, named\n                                  checker, nemesis + landing evidence,\n                                  ambiguous-outcome handling, reduction\n                                  plan (SUT\u002Fharness\u002Fchecker\u002Fenv blame)\n7b. Coverage adequacy argument — why these tests are enough\n7c. Residual uncertainty       — what stays unverified, and why ok\n7d. Confidence statement       — the reviewer's verdict\n8. What this plan does NOT cover\n9. Open questions \u002F followups\n```\n\n### Example §7.M block (excerpt from a plan)\n\n```\n### Scenario S3: linearizable_append_under_partition\n- Falsifies if it FAILs: C1 (every acknowledged append is durable\n  and linearisable), C5 (leader election completes within 5s)\n- Workload: 8 clients, 70% append \u002F 30% read, 5min, key-skew zipf\n- Faults: asymmetric partition isolating current leader at T+60s\n  for 30s\n- Oracle: linearizability via Porcupine over per-key histories\n\n§7.M (model \u002F history \u002F checker discipline)\n- Model under test:    log\n- Operation history:   default 11-field schema (op id, process id,\n                       invoke\u002Fcomplete ts, op type, key, input,\n                       output, error, timeout marker, node seen,\n                       fault epoch). Recorded in-process + server-\n                       side audit.\n- Checker:             linearizability (Porcupine) per-key, then\n                       no-lost-ack against final state\n- Nemesis + landing:   asymmetric-partition (iptables drop one\n                       direction). Landing evidence = iptables drop\n                       counter goes 0 → 14,712 over the 30s window\n                       AND raft log emits \"leader-lost; starting\n                       election\" within 2s of injection.\n- Ambiguous outcomes:  timeouts → timeout_marker=true, complete_ts\n                       =null, treated as could-have-succeeded;\n                       retries are separate ops sharing input\n- Reduction plan:      if FAIL, bisect fault window + fix seed, then\n                       classify SUT \u002F harness \u002F checker \u002F environment\n                       per references\u002Ftest-case-reduction.md\n```\n\n### Example findings-report row\n\n| ID | Verdict             | Nemesis landing evidence                              | Reduction class |\n|----|---------------------|-------------------------------------------------------|-----------------|\n| S3 | `PASS-hardening`    | iptables ctr 0→14,712; raft re-election at T+1.8s     | n\u002Fa             |\n| S4 | `FAIL-reproducible` | partition landed; Elle: G2-item anomaly on key K17    | SUT             |\n| S7 | `INCONCLUSIVE-fault-not-proven` | iptables rule installed but counter stayed 0 — wrong chain | harness |\n| S9 | `PARTIAL-model`     | landing ok; checker covered per-key, not cross-key    | n\u002Fa             |\n\n(The full findings template carries Oracle, Oracle execution evidence,\nartifact links, an adequacy-vs-plan section, and a confidence delta —\nsee `skills\u002Fexecuting-distributed-system-tests\u002Fassets\u002Ffindings-report-template.md`.)\n\n## Install (one line, any agent)\n\nPaste this at any AI coding agent (Claude Code, Codex, Copilot CLI,\nCursor, Gemini, or anything else that reads Markdown and runs shell):\n\n```\nRead https:\u002F\u002Fraw.githubusercontent.com\u002Fshenli\u002Fdistributed-system-testing\u002Fmain\u002FINSTALL.md\nand follow the instructions to install and configure\ndistributed-testing-skills for this agent.\n```\n\nThe agent fetches [`INSTALL.md`](INSTALL.md), clones the repo to\n`~\u002F.local\u002Fshare\u002Fdistributed-testing-skills\u002F`, and wires the skills in\n(symlinks under `~\u002F.claude\u002Fskills\u002F` for Claude Code, a pointer block\nin `~\u002FAGENTS.md` for other agents).\n\nAfter that, ask any agent on the machine to \"design a test plan for\nthis system\" or \"execute the plan at X\" and it'll follow the\nSKILL.md workflow.\n\n### Update\n\n**Paste the same one-liner again.** `INSTALL.md` is idempotent:\nif the install path exists, it does `git pull --ff-only`; if not,\nit does `git clone`. Symlinks always point at the cloned content\nso they pick up the new version automatically. The `~\u002FAGENTS.md`\npointer block uses HTML markers and is replaced cleanly on each\nrun — no duplication.\n\nIf you have local edits to the cloned skills, `git pull --ff-only`\nwill fail; the agent will stop and ask before discarding them.\n\n### Manual install (if you'd rather see what's happening)\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fshenli\u002Fdistributed-system-testing.git \\\n    ~\u002F.local\u002Fshare\u002Fdistributed-testing-skills\n\n# Claude Code: symlink under ~\u002F.claude\u002Fskills\u002F\nmkdir -p ~\u002F.claude\u002Fskills\nln -snf ~\u002F.local\u002Fshare\u002Fdistributed-testing-skills\u002Fskills\u002Fdesigning-distributed-system-tests \\\n    ~\u002F.claude\u002Fskills\u002Fdesigning-distributed-system-tests\nln -snf ~\u002F.local\u002Fshare\u002Fdistributed-testing-skills\u002Fskills\u002Fexecuting-distributed-system-tests \\\n    ~\u002F.claude\u002Fskills\u002Fexecuting-distributed-system-tests\n\n# Codex \u002F Copilot CLI \u002F Cursor \u002F Gemini \u002F others: see INSTALL.md\n```\n\n## Usage\n\nOnce the skills are installed, you have two ways to drive them:\n\n**Casual ask (Claude Code with auto-trigger):**\n\n```\nDesign a project-wide test plan for this codebase.\n```\n\n```\nExecute the plan at .\u002Ftesting-plans\u002F\u003Cslug>.md against this codebase.\n```\n\nThe skill descriptions pick up natural phrasing like \"design a\ntest plan\", \"execute the plan\", \"run stability tests\", \"design a\nrelease validation plan\", etc.\n\nFor a specific mode, output path, or a non-auto-trigger agent,\n[`USAGE.md`](USAGE.md) has copy\u002Fpaste prompts for every workflow\n(design and execute, in their respective modes) plus tips on scope,\nenv probing, and long-run checkpointing.\n\n## The two skills\n\n### `designing-distributed-system-tests`\n\nWalks the repo, extracts the claims the product makes, generates\nhypotheses tied to those claims, picks techniques from the catalog,\nand writes a structured Markdown plan with a coverage adequacy\nargument and a confidence statement. For consistency-critical\nscenarios, the plan fills a §7.M block per scenario: model under\ntest, operation-history schema, named checker, nemesis + landing\nevidence, ambiguous-outcome handling, reduction plan. Details:\n[`history-discipline.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fhistory-discipline.md).\n\nTwo modes: **change-scoped** (a specific commit or PR) and\n**project-wide** (a holistic plan with existing-test inventory and\ngap analysis).\n\n### `executing-distributed-system-tests`\n\nReads the plan, discovers the SUT's toolbox, probes the environment,\nand runs scenarios with checkpoint discipline. Per scenario: captures\nlanding evidence for the fault, runs the green-but-broken and\nweak-oracle audits, assigns a verdict from the 9-state taxonomy in\n[`verdict-taxonomy.md`](skills\u002Fexecuting-distributed-system-tests\u002Freferences\u002Fverdict-taxonomy.md),\nand classifies every FAIL into SUT \u002F harness \u002F checker \u002F environment\nbefore filing. Produces a findings report with adequacy-vs-plan\nassessment and confidence delta.\n\nTwo modes: **default** (read-only on the SUT, ephemeral harnesses\nunder the session dir) and **author mode** (writes scenario skeletons\ndeclared in the plan's §7 into the SUT for review).\n\n## Technique catalog\n\nEight reference files distilled from the field's literature:\n\n| File | When to reach for it |\n|---|---|\n| [`catalog-index.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fcatalog-index.md) | Selector page — start here |\n| [`jepsen-and-elle.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fjepsen-and-elle.md) | Linearizability \u002F serializability under faults |\n| [`deterministic-simulation.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fdeterministic-simulation.md) | Reproducible bugs from a seed; async heavy code |\n| [`chaos-and-fault-injection.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fchaos-and-fault-injection.md) | Real-cluster partial \u002F asymmetric faults |\n| [`fuzzing.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Ffuzzing.md) | Input or concurrency fuzzing under sanitizers |\n| [`formal-methods-tla.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fformal-methods-tla.md) | Protocol correctness at design time |\n| [`property-and-metamorphic.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fproperty-and-metamorphic.md) | Algebraic-law \u002F metamorphic-relation testing |\n| [`performance-and-benchmarking.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fperformance-and-benchmarking.md) | Tail latency \u002F throughput \u002F fairness |\n| [`crash-recovery-and-upgrade.md`](skills\u002Fdesigning-distributed-system-tests\u002Freferences\u002Fcrash-recovery-and-upgrade.md) | Durability, replay, idempotency, mixed-version |\n\nEach follows the same shape: when to reach for it, what it detects\nwell, what it misses, concrete tools, papers, cost signal, plan\nchecklist. The catalog index pairs symptoms to references.\n\n## Repo layout\n\n```\n.\n├── plugin.json                                 ← optional plugin manifest\n├── README.md                                   ← this file\n├── INSTALL.md                                  ← idempotent install \u002F update (paste-this)\n├── USAGE.md                                    ← copy\u002Fpaste prompts for every workflow\n├── LICENSE\n├── skills\u002F\n│   ├── designing-distributed-system-tests\u002F\n│   │   ├── SKILL.md                            ← the design workflow\n│   │   ├── assets\u002Fplan-template.md             ← §0–§9 incl. gated §7.M\n│   │   └── references\u002F                         ← 8-file technique catalog + index,\n│   │                                             common-distributed-systems-pitfalls,\n│   │                                             history-discipline\n│   └── executing-distributed-system-tests\u002F\n│       ├── SKILL.md                            ← the execute workflow\n│       ├── assets\u002F\n│       │   ├── session-log-template.md\n│       │   └── findings-report-template.md     ← 9-state verdicts + landing evidence\n│       └── references\u002F                         ← oracle-patterns (checker picker + 13\n│                                                 patterns), fault-injection-howto\n│                                                 (22-row nemesis taxonomy),\n│                                                 test-case-reduction (with blame\n│                                                 classification), green-but-broken-\n│                                                 red-flags (incl. weak-oracle audit),\n│                                                 finding-classification (TaxDC),\n│                                                 verdict-taxonomy (9-state)\n├── evals\u002F                                      ← eval suites for both skills\n├── verification\u002F                               ← real runs against AgentDB (concrete output)\n└── specs\u002F                                      ← original design spec\n```\n\n## Status\n\nEarly but exercised. Both skills have been driven against AgentDB\n(a distributed agent runtime in Rust) end-to-end multiple times,\nsurfacing six findings (one P0-candidate now closed, two P1s shipped\nas a PR, two open). The skill bodies evolve as harness experience\naccumulates; expect minor updates to the SKILL.mds and templates\nover the next few iterations.\n\nReal plan outputs, session directories, and findings reports from\nthose runs live under [`verification\u002F`](verification\u002F), one\nsubdirectory per run, each with a `README.md` describing what\npassed, what failed, and what the skill surfaced about itself in\nthe process. Notable runs:\n\n- [`verification\u002Fagentdb-fab7d9d\u002F`](verification\u002Fagentdb-fab7d9d\u002F) —\n  change-scoped plan + execution for AgentDB commit `fab7d9d` (durable\n  idempotent append replay); 670-line plan with 16 hypotheses across\n  all eight failure-mode categories.\n- [`verification\u002Fagentdb-jepsen\u002F`](verification\u002Fagentdb-jepsen\u002F) —\n  consistency + crash-recovery run with linearizability checking.\n- [`verification\u002Fagentdb-projectwide-lidev\u002F`](verification\u002Fagentdb-projectwide-lidev\u002F)\n  and `-v2` — project-wide plans with full coverage matrix +\n  adequacy argument + confidence statement.\n\nThere is also an eval suite under [`evals\u002F`](evals\u002F) (separate\n`evals.json` for the design and execute skills) — used to validate\nbehavioural changes to the SKILL.md bodies between iterations.\n\n## Acknowledgements\n\nThe technique catalog is distilled from Andrey Satarin's comprehensive\n[testing-distributed-systems](https:\u002F\u002Fgithub.com\u002Fasatarin\u002Ftesting-distributed-systems)\ncatalog. Seminal papers anchoring the catalog include:\n\n- Yuan et al., \"Simple Testing Can Prevent Most Critical Failures\" (OSDI'14)\n- Gunawi et al., \"What Bugs Live in the Cloud?\" (SoCC'14)\n- Zheng et al., \"Torturing Databases for Fun and Profit\" (OSDI'14)\n- Kingsbury & Alvaro, \"Elle: Inferring Isolation Anomalies from\n  Experimental Observations\" (VLDB'20)\n- Alfatafta et al., \"Toward a Generic Fault Tolerance Technique for\n  Partial Network Partitioning\" (OSDI'20)\n- Lou et al., \"Understanding, Detecting and Localizing Partial Failures\n  in Large System Software\" (NSDI'20)\n- Gao et al., \"An Empirical Study on Crash Recovery Bugs in Large-Scale\n  Distributed Systems\" (FSE'18)\n- Zhang et al., \"Understanding and Detecting Software Upgrade Failures\n  in Distributed Systems\" (SOSP'21)\n- Bornholt et al., \"Using Lightweight Formal Methods to Validate a\n  Key-Value Storage Node in Amazon S3\" (SOSP'21)\n- Newcombe et al., \"How Amazon Web Services Uses Formal Methods\" (CACM'15)\n\n## License\n\nMIT.\n","该项目提供了一套用于分布式系统测试的AI代理技能，旨在设计并执行基于声明的测试。其核心功能包括生成结构化的Markdown格式测试计划及包含九种状态裁决的结果报告，并明确指出系统、测试框架、检查器或环境中的问题所在。技术特点上，它支持多种AI编码代理工具，如Claude Code、Codex等，通过简单的SKILL.md文件定义测试逻辑。特别适合需要对复杂分布式和有状态系统进行深入验证的场景，能够有效识别出传统集成测试难以发现的问题，例如网络分区、并发非确定性等问题，从而提高系统的可靠性和稳定性。","2026-06-11 03:56:14","CREATED_QUERY"]