[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82886":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":13,"starSnapshotCount":13,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},82886,"pi-agent-observability","disler\u002Fpi-agent-observability","disler",null,"TypeScript",97,31,52,0,19,42,4.52,"MIT License",false,"main",[],"2026-06-12 02:04:29","\u003Cp align=\"center\">\n  \u003Cimg src=\"apps\u002Fobservability\u002Fpublic\u002Flogo.svg\" alt=\"Pi Observability logo\" width=\"80\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">Pi Observability\u003C\u002Fh1>\n\n> **Stop guessing what your Pi agent is doing. Watch every turn, every tool call, every token, live.**\n> A local observability stack for the [pi coding agent](https:\u002F\u002Fgithub.com\u002Fearendil-works\u002Fpi-mono), plus a product-agent demo that proves the telemetry holds up inside a real app workflow.\n\n📺 Watch this video to get the full breakdown of this codebase: **[Pi Observability: full breakdown](https:\u002F\u002Fyoutu.be\u002Fo4KZH_KSqYQ)**\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F01_hero.png\" alt=\"Pi Observability hero: three coordinated views over a live agent stream\" width=\"850\">\n\u003C\u002Fp>\n\n## Four Tools In One\n> Four Tools. One Theme. _Agent Observability_\n>\n> Pi Extension, Observability Dashboard, Steelman Product Agent, Plan Prompts\n\n\n**If you don't measure, you can't improve.** This repo is built around one thesis: the way to win with agents (engineering agents *and* product agents) is to measure their trade-off trifecta of **performance, speed, and cost**, because more *useful* tokens beat fewer tokens every time. You can't make that call by vibes. You make it by watching every turn, every tool call, every token. Four tools give you that:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002Ftriangle_v1_wireframe.svg\" alt=\"The trade-off trifecta: a triangle with performance, speed, and cost at its vertices\" width=\"780\">\n\u003C\u002Fp>\n\n1. **Pi extension** (`extension\u002F`): drop it into any Pi agent with `-e` and it streams canonical lifecycle events (every turn, tool call, model change, cost line) to the server. Zero changes to the agent itself.\n2. **Observability dashboard** (`apps\u002Fobservability\u002F`): a Bun + SQLite server that ingests and persists those events, plus three browser views (**single** for one agent with full payloads, **swimlane** for N agents compared turn-by-turn, and **race** for who finished which step first) so you can A\u002FB prompts and weigh the trifecta side by side.\n3. **Steelman product agent** (`apps\u002Fsteelman\u002F`): a real product app (investment bear-thesis analysis) running on an *observed* Pi agent, proving the telemetry holds up where it matters most: a product agent executing for real users, real money, real tools.\n4. **Plan prompts** (`.claude\u002Fskills\u002F`): four spec skills that turn a prompt into an implementation plan with more *useful* tokens. The same skills your agents run under observation, so you can measure the trifecta across spec formats and pick the right one for the job:\n   - `\u002Fspec`: markdown\n   - `\u002Fhtmlspec`: HTML\n   - `\u002Fhtmlvspec`: HTML + visuals (`gpt-image-2`)\n   - `\u002Fvspec`: markdown + visuals (`gpt-image-2`)\n\nThe whole thing fits on `127.0.0.1` and runs from a single `just all`.\n\n> Using Claude Code? Check out the original [Claude Code Observability](https:\u002F\u002Fgithub.com\u002Fdisler\u002Fclaude-code-hooks-multi-agent-observability) codebase and [video breakdown](https:\u002F\u002Fyoutu.be\u002F9ijnN985O_c)\n\n---\n\n## Install\n\n### Agentic Install\n\n```bash\njust install        # runs the \u002Finstall slash command in Claude Code (or Pi, or your favorite agentic coding tool)\n```\n\nThe `\u002Finstall` command lives at `.claude\u002Fcommands\u002Finstall.md` and handles toolchain checks, executable bits, and any project-specific setup.\n\n### Manual Install\n\n**Prereqs:** [`bun`](https:\u002F\u002Fbun.sh) ≥ 1.1, [`pi`](https:\u002F\u002Fgithub.com\u002Fearendil-works\u002Fpi-mono) on `$PATH`, [`just`](https:\u002F\u002Fjust.systems), `sqlite3` (system).\n\n```bash\ngit clone \u003Cthis-repo> pi-agent-observability     # clone the repo\ncd pi-agent-observability                         # enter it\ncp .env.sample .env || true                       # optional: set OBS_AUTH_TOKEN, GEMINI_API_KEY (default model), etc.\njust all                                          # clears pinned ports, boots obs server + Steelman backend + Steelman web\n```\n\nDefault endpoints after `just all`:\n\n```txt\nobservability: http:\u002F\u002F127.0.0.1:43190\u002F?token=devtoken\nSteelman API:  http:\u002F\u002F127.0.0.1:45210\nSteelman web:  http:\u002F\u002F127.0.0.1:51730\n```\n\n---\n\n## Why this exists\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F02_observe_or_guess.png\" alt=\"Two-panel infographic: GUESS shows an opaque Pi agent with floating question marks; OBSERVE shows the same agent streaming labeled events into a store and a live UI\" width=\"750\">\n\u003C\u002Fp>\n\nPi runs fast and ships a single transcript. That's enough to debug a hello-world. It is **not** enough the moment you put an agent in front of real users, real money, real tools. You want to know: which model fired which tool with which args at which cost, how many retries happened, where the compaction event hit, when the branch nav cleared the context, why turn 14 spent eight seconds in `thinking`.\n\nWithout telemetry you guess. With this stack you **watch**. Three views answer three different questions:\n\n- **Single**: what is this one agent doing right now?\n- **Swimlane**: how do these N agents compare turn-by-turn?\n- **Race**: which agent finished which step first, and what did they do at that step?\n\n> *Measure to improve. The clarity of your measurement determines the clarity of the actions you can take.*\n\n---\n\n## How it works\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F03_extension_server_ui.png\" alt=\"Three-column architecture: Pi extension emits ObsEvents, Bun server persists to SQLite and broadcasts SSE, browser UI renders single\u002Fswimlane\u002Frace views\" width=\"780\">\n\u003C\u002Fp>\n\nThree components, one wire format, one canonical event store:\n\n1. **Pi observability extension**: `extension\u002Fpi-observability.ts`\n   - Subscribes to every Pi lifecycle hook (`session_start`, `turn_start`, `tool_call`, `tool_result`, `model_change`, `compaction`, `branch_nav`, `error`, …).\n   - Emits canonical `ObsEvent` envelopes defined in `shared\u002Ftypes.ts`.\n   - Batches up to 50 events, applies backpressure when the queue overflows, retries on transient HTTP failure.\n\n2. **Bun + SQLite observability server**: `apps\u002Fobservability\u002Fserver.ts`, `apps\u002Fobservability\u002Fdb.ts`\n   - `POST \u002Fevents`: idempotent ingest (`INSERT OR IGNORE`, `(session_id, seq)` unique index).\n   - SQLite via `bun:sqlite`, WAL mode, zero migrations.\n   - REST + Server-Sent Events. UI clients subscribe live and resync on reconnect.\n   - Hosts the browser UI as static files, no separate frontend server.\n\n3. **Vanilla-JS browser UI**: `apps\u002Fobservability\u002Fpublic\u002F`\n   - `index.html` + `app.js`: single-session timeline, URL hash state, search, type filters, keyboard nav, cost\u002Ftoken rollups, scroll-pause autoresume.\n   - `swimlane.js`: N sticky lanes side by side, live slide-in + per-event-type color pulse.\n   - `race.js`: horizontal turn-grouped race view for side-by-side step comparison.\n\nThe event flows left to right: extension → server → UI. Backpressure flows the other way: the server NACKs duplicates, the extension never overwrites with stale data thanks to `COALESCE`-based UPSERT.\n\n---\n\n## The three views\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F04_three_views.png\" alt=\"Three labeled UI tiles: SINGLE shows a vertical event timeline with an amber glow on the top live row, SWIMLANE shows three side-by-side session lanes, RACE shows three horizontal lanes with turn boundary ticks\" width=\"780\">\n\u003C\u002Fp>\n\nEach view answers a different question. Switch with the top-right toggle; the URL hash carries view + selection so the link is shareable.\n\n| View | Question it answers | Density | When to use it |\n|---|---|---|---|\n| **Single** | What is *this* session doing right now? | Vertical event-per-row stream with an amber slide-in pulse on every live row | Debugging one specific agent, reading the full payload of any event, copying event JSON |\n| **Swimlane** | How do these N sessions compare, turn-by-turn? | One sticky lane per session, identical row format, lanes scroll independently | Comparing a fleet, watching a swarm, A\u002FB-ing two prompts side by side |\n| **Race** | Who finished which step first, and what did they actually do at that step? | Horizontal lanes with turn boundaries as vertical ticks, events as arrows along the lane | Benchmarking, post-mortem, showing off |\n\nAll three views consume the same SSE stream from `server.ts`. The same `ObsEvent` rows render in all three, with no schema fanout.\n\n---\n\n## Wire format\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F05_event_taxonomy.png\" alt=\"ObsEvent hub-and-spoke: central ObsEvent connects to three cluster boxes labeled lifecycle, turn, and meta, each containing the event-type chips\" width=\"780\">\n\u003C\u002Fp>\n\n`shared\u002Ftypes.ts` is the single source of truth. Every event carries:\n\n- **identity**: `session_id`, `cwd`, `pool`, `tags`, `agent_name`, `provider`, `model`\n- **ordering**: monotonic `seq` per session (and `(session_id, seq)` is `UNIQUE` in the DB, so the extension's own bugs surface immediately)\n- **payload**: a discriminated union keyed by `type`\n\nThe 16 supported event types:\n\n```txt\nsession_start  session_shutdown  agent_start  agent_end\nturn_start     turn_end          user_message assistant_message\nthinking       tool_call         tool_result  model_change\ncompaction     branch_nav        error        custom\n```\n\n`payload_json` is stored as raw JSON text, with no normalized payload columns. Cost and token rollups use `json_extract(payload_json, '$.usage.total_tokens')` so the schema stays stable as event types evolve. Add a new event type, add it to the discriminated union, and ingest works the same day.\n\nOne thing the opaque payload buys you that a Pi transcript will never give you: the first `agent_start` of every session carries a **full boot snapshot** of exactly what the agent was just told. The fully assembled system prompt verbatim, plus a structured digest of `BuildSystemPromptOptions`: selected tools, prompt guidelines, `--system-prompt` \u002F `--append-system-prompt` overrides, every context file pi loaded (path + bytes + `sha256` + full content), and every skill pi loaded (file body + `sha256` for drift detection). Uncapped, once per session. If you ever need to prove which skills made it into the system prompt for a given run (and which ones quietly didn't), that event is the receipt.\n\n---\n\n## Folder structure\n\n```txt\n.\n├── README.md\n├── LICENSE\n├── justfile                          # all commands, start here\n├── shared\u002F\n│   └── types.ts                      # canonical ObsEvent wire format\n├── extension\u002F\n│   └── pi-observability.ts           # Pi telemetry extension\n├── scripts\u002F\n│   ├── smoke-server.sh\n│   ├── spawn-fleet.sh                # launch N observed Pi agents for fleet tests\n│   └── validate-swimlane.ts          # observability\u002FUI regression suite\n├── apps\u002F\n│   ├── observability\u002F                # Bun HTTP + SSE + SQLite server + static UI\n│   │   ├── server.ts                 # Bun HTTP + SSE + static UI server\n│   │   ├── db.ts                     # SQLite schema + prepared queries\n│   │   └── public\u002F                   # index.html · app.js · swimlane.js · race.js\n│   └── steelman\u002F                     # product-agent demo (see below)\n│       ├── extension\u002Fsteelman-product.ts\n│       ├── server\u002Fsrc\u002Fserver.ts\n│       ├── web\u002Fsrc\u002FApp.vue\n│       └── scripts\u002Fvalidate-steelman.ts\n├── docs\u002F                             # SPEC, V2\u002FV3 status docs\n└── db\u002F                               # gitignored: obs.db + WAL files land here\n```\n\n---\n\n## Commands\n\nThe `justfile` is the surface area. Every recipe clears its pinned port before booting, so re-running is always safe.\n\n```bash\njust all                       # boot obs + Steelman backend + Steelman web (default flow)\njust all watch                 # same, with --watch on the Steelman backend\njust obs                       # boot only the observability server\njust steelman-server           # boot only the Steelman backend (real Pi RPC mode)\njust steelman-web              # boot only the Vite frontend\njust agent                     # interactive Pi agent with the observability extension attached\njust backup                    # timestamped backup of db\u002Fobs.db\njust validate-steelman         # validation run for the Steelman backend\njust specping                  # ping the \u002Fspec skill (smoke test)\njust htmlping                  # same for \u002Fhtmlspec\njust htmlvping                 # same for \u002Fhtmlvspec\n```\n\n---\n\n## The Steelman demo\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"images\u002F06_steelman_layered.png\" alt=\"Two-layer diagram: product surface on top with chat bubbles and an artifact card, Pi internals below as a small swimlane, cyan arrows flowing both ways between the layers\" width=\"780\">\n\u003C\u002Fp>\n\n`apps\u002Fsteelman\u002F` is a real product app (investment-thesis analysis) built directly on top of an observed Pi agent. It exists to prove the telemetry survives a non-trivial workload, not just a hello-world.\n\n1. Browser posts a thesis to `POST \u002Fapi\u002Fruns`.\n2. Backend launches `pi --mode rpc --no-builtin-tools` with **two** extensions:\n   - `extension\u002Fpi-observability.ts`: the telemetry side\n   - `apps\u002Fsteelman\u002Fextension\u002Fsteelman-product.ts`: the product tools (`steelman_research`, `steelman_emit_artifact`)\n3. Product backend streams chat, tool calls, status, and artifacts to the Vue frontend over `\u002Fapi\u002Fruns\u002F:id\u002Fstream`.\n4. Observability server independently captures the Pi-internal lifecycle under `pool=product-steelman`, `tag=run-\u003Cid>`, on the same database, same UI, no special-casing.\n\nThe Vue UI renders a two-pane product experience: dynamic artifacts on the left (`table`, `bar-chart`, `pie-chart`, `trend`, `scorecard`, `risk-map`, `text`, sandboxed `html`) and a streaming chat with clickable `@artifact-ref` links on the right. From the operator side, you watch the same run in the observability UI (every tool call, every model change, every cost line) without touching the product code.\n\nA validated real run used:\n\n```txt\nprovider\u002Fmodel: google \u002F gemini-3.5-flash\nproduct tools:  steelman_research, steelman_emit_artifact\nartifacts:      table, bar-chart, pie-chart, trend, scorecard, risk-map, text, html\nobservability:  pool=product-steelman, tag=run-\u003Cid>\n```\n\nThe agent's model and provider are set via `STEELMAN_AGENT_MODEL` \u002F `STEELMAN_AGENT_MODEL_PROVIDER` (defaults `gemini-3.5-flash` \u002F `google`). Authenticate with `GEMINI_API_KEY` or `pi \u002Flogin`.\n\n---\n\n## More useful tokens: the planning phase\n\nAgentic engineering has two hard constraints: **planning** and **reviewing**. The model does the work in between; you live or die by how well you frame it going in and verify it coming out. This section is about the front end of that (the plan) and the single lever that moves it most.\n\nThe lever is **more useful tokens**. Anthropic's \"[unreasonable effectiveness of HTML](https:\u002F\u002Fclaude.com\u002Fblog\u002Fusing-claude-code-the-unreasonable-effectiveness-of-html)\" post landed on the same idea from the structure side: give the agent richer, more structured context and it performs better. The keyword is **useful**, not *more*. A wall of boilerplate is more tokens and worse plans. A diagram of the data model, a mocked-up component, a labeled before\u002Fafter: those are tokens that change what the agent builds. You are spending context to buy precision. When you combine this with OpenAIs [GPT Image 2.0](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-chatgpt-images-2-0\u002F) model for image generation, you can generate structured, information rich, visual prompts. What i like to call: **VSpecs**.\n\nThe four `\u002Fplan` prompts are four points on the tokens-vs-precision curve, cheapest to richest:\n\n| Prompt | Format | Tokens | Best when |\n|---|---|---|---|\n| `\u002Fspec` | Markdown | Lowest | Text-first work, tight context budgets, the plan is mostly prose and file lists |\n| `\u002Fhtmlspec` | HTML | Mid | You want structure and inline prototypes (a mocked component, a comparison table) without image cost |\n| `\u002Fvspec` | Markdown + AI visuals | Mid-High | You want image-enriched plans but prefer plain markdown over HTML scaffolding (`gpt-image-2`) |\n| `\u002Fhtmlvspec` | HTML + AI visuals | High | UI\u002Ffront-end work where a rendered diagram per section earns its tokens (`gpt-image-2`) |\n\nWhy visuals at all? Because modern models are multimodal, and an image is one of the most token-dense, lowest-ambiguity ways to communicate intent. A single diagram of \"these three components, wired this way\" replaces paragraphs of prose the agent would otherwise have to reconstruct, and reconstruct *its way*, not yours. The agent reads the plan and executes it, so an image embedded in the plan is an instruction with far less room to drift.\n\nThere's a real cost, and it's worth naming: visual specs are slower and more expensive to produce, and the observability stack here **does not** meter image-generation cost; that spend happens outside the Pi event stream. So the question is never \"which spec is best,\" it's the trifecta question this whole repo is built to answer: **for this task, what's the trade-off between performance, speed, and cost?** Run the same prompt through two spec formats, watch both agents in swimlane or race, and let the turn counts, token totals, and costs decide. Measure first; then turn the winner into an eval and scale it.\n\n---\n\n## Improvements & Failure Modes\n\nThings this stack does well, things it doesn't try to do, and the failure modes that are honestly worth knowing.\n\n- **Single-host only.** SQLite + a single Bun process. No multi-node ingest, no Postgres, no S3 archive. Scale path is intentional: spin up a second instance under a different port and namespace by pool\u002Ftag.\n- **Devtoken in the URL.** The default `OBS_AUTH_TOKEN=devtoken` is fine on `127.0.0.1`. Anywhere else, set a real token and don't share screenshots that include `?token=…`.\n- **Extension batches up to 50.** Burst-heavy agents can lag the UI by ~1s under load. The queue drops oldest on overflow (logged, never silent). Tune `EVT_BATCH_MAX` in the extension if your workload needs it.\n- **No retroactive backfill.** Events are stored on arrival. If your extension was disabled mid-run, that turn is gone; there's no Pi-session-log replayer (yet).\n- **WAL files in `db\u002F`.** `obs.db-wal` and `obs.db-shm` are real files. `.gitignore` covers them via `*.db*`. If you want a portable snapshot, `just backup` does the right thing.\n- **SSE reconnect is best-effort.** On reconnect the client refetches the latest N events for every active lane and dedupes by `event_id`. If you lose the network for an hour, you get the last hour's tail, not the gap.\n- **`~TPS` is an estimate.** The single-mode `~TPS` pill is `usage.output × 1000 \u002F generation_ms` (post-prefill, real streaming rate). For batched-delta turns where `generation_ms \u003C 50ms` it's suppressed; the math is honest, the millisecond timer is the noisy part. Renders as `—` when the window is too small to mean anything.\n- **Project-local skills follow pi's convention, not Claude's.** The boot snapshot reflects whatever pi actually loaded. Pi discovers project skills at `\u003Ccwd>\u002F.pi\u002Fskills\u002F`, not `.claude\u002Fskills\u002F`, so if you keep skills under `.claude\u002F`, point pi at them explicitly with `--skill .claude\u002Fskills` or symlink `.pi\u002Fskills`. The extension faithfully reports whatever pi finds.\n\n---\n\n## License\n\nMIT. See [`LICENSE`](LICENSE).\n\n---\n\n## Master Agentic Coding\n\nPrepare for the future of software engineering.\n\nLearn tactical agentic coding patterns with [Tactical Agentic Coding](https:\u002F\u002Fagenticengineer.com\u002Ftactical-agentic-coding?y=piobs).\n\nFollow the [IndyDevDan YouTube channel](https:\u002F\u002Fwww.youtube.com\u002F@indydevdan) to improve your agentic coding advantage.\n\n---\n\nStay Focused and Keep Building\n\n- IndyDevDan\n","Pi Observability 是一个用于监控和分析 Pi 编码代理行为的本地可观察性堆栈。它通过四个工具提供全面的监控能力，包括Pi扩展、可观测性仪表盘、Steelman产品代理以及计划提示。Pi扩展能够无缝集成到任何Pi代理中，无需修改原代理代码即可实时流式传输生命周期事件；可观测性仪表盘则利用Bun + SQLite服务器处理这些事件，并通过三种视图展示单个或多个代理的行为对比；Steelman产品代理演示了在实际应用中如何使用该监控系统来跟踪真实用户交互下的性能表现；而计划提示部分提供了四种规格技能，帮助开发者根据需求选择合适的实现方案。该项目适用于需要深入了解和优化基于AI代理的应用场景，特别是在关注性能、速度与成本平衡时。",2,"2026-06-11 04:09:31","CREATED_QUERY"]