[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-25":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":29,"discoverSource":30},25,"RunbookHermes","Tommy-yw\u002FRunbookHermes","Tommy-yw","Hermes-native AIOps agent for evidence-driven incident response, approval-gated remediation, and runbook learning.",null,"Python",561,37,19,3,0,2,5,38,6,8.74,"MIT License",false,"main",true,[],"2026-06-12 02:00:06","# RunbookHermes\n\n**Hermes-native AIOps Agent for payment incident response, evidence-driven root-cause analysis, approval-gated remediation, and runbook learning.**\n\nRunbookHermes is built by adapting the official **Hermes Agent** runtime into a production-oriented incident-response system. It keeps Hermes Agent's strengths—runtime loop, provider routing, tool system, memory, context engine, skills, gateway, and safety boundaries—and specializes them for AIOps workflows such as payment-system failures, observability evidence collection, approval, checkpoint, rollback, recovery verification, and runbook knowledge accumulation.\n\n> RunbookHermes is not a separate toy dashboard beside Hermes Agent. It is a Hermes-native vertical extension: Hermes provides the agent foundation; RunbookHermes adds the incident-response domain layer.\n\n---\n\n## Product Screenshots\n\nThe screenshots below show the current RunbookHermes Web Console. Put these images under `docs\u002Fassets\u002F` and keep the file names consistent with the Markdown paths.\n\n### AIOps Console Overview\n\n![AIOps Console Overview](docs\u002Fassets\u002Foverview.png)\n\nThe overview page shows the high-level AIOps control plane: incident count, pending approvals, generated skills, critical services, recommended operation flow, current capability boundaries, and a live monitoring preview.\n\n### Realtime Monitoring System\n\n![Realtime Monitoring System](docs\u002Fassets\u002Fmonitoring-overview.png)\n\nThe monitoring page provides a multi-dimensional service health view for `payment-service`, `coupon-service`, and `order-service`, including HTTP status signals, QPS, p95 latency, service topology, backend mode, and deployment state.\n\n![Monitoring Logs and Trace Signals](docs\u002Fassets\u002Fmonitoring-signals.png)\n\nThe lower section of the monitoring page shows log signals and trace signals. This is where RunbookHermes connects observability data to incident diagnosis instead of relying only on model guesses.\n\n### Incident Command Center\n\n![Incident Command Center](docs\u002Fassets\u002Fincidents.png)\n\nThe incident list page normalizes incidents created from Web, Alertmanager, Feishu, WeCom, or API entry points. It shows service, status, severity, root cause, creation time, and quick incident creation actions.\n\n### Incident Detail: Evidence and Executive Summary\n\n![Incident Evidence](docs\u002Fassets\u002Fincident-evidence.png)\n\nThe incident detail page displays evidence cards from metrics, logs, and traces, plus an executive summary with root cause, recommended action, evidence IDs, confidence, and approval status.\n\n### Incident Detail: Root Cause and Model-Assisted Summary\n\n![Incident Root Cause](docs\u002Fassets\u002Fincident-root-cause.png)\n\nThe root-cause tab separates deterministic evidence from optional model-assisted explanation. The model summary is only enabled when a model provider is configured.\n\n### Incident Detail: Actions, Approvals, and Checkpoints\n\n![Incident Actions](docs\u002Fassets\u002Fincident-actions.png)\n\nRisky actions are not executed blindly. RunbookHermes places write or destructive actions behind approval, checkpoint, dry-run, controlled execution, and recovery verification.\n\n### Incident Detail: Timeline\n\n![Incident Timeline](docs\u002Fassets\u002Fincident-timeline.png)\n\nThe timeline records the full incident lifecycle, including incident creation, evidence collection, hypothesis generation, action planning, checkpoint creation, approval request, approval decision, skill generation, and execution result.\n\n### Incident Detail: Generated Runbook Skill\n\n![Generated Runbook Skill](docs\u002Fassets\u002Fincident-skill.png)\n\nAfter an incident is processed, RunbookHermes can turn the operational experience into a reusable runbook skill. This is how incident handling becomes accumulated operational knowledge rather than a one-off response.\n\n### Approval Center\n\n![Approval Center](docs\u002Fassets\u002Fapproval-center.png)\n\nThe approval center is the human-in-the-loop safety gate. Operators can review the action, risk level, checkpoint, and payload before approving or rejecting execution.\n\n### Digests and Skills\n\n![Digests and Skills](docs\u002Fassets\u002Fdigests-skills.png)\n\nThe digest page summarizes recent incidents, high-frequency faults, and generated runbook skills, making RunbookHermes useful for both incident response and operational review.\n\n### Integration Readiness and Interface Status\n\n![Settings and Interface Status](docs\u002Fassets\u002Fsettings-interface-status.png)\n\nThe settings page shows whether model, observability, execution, Feishu, WeCom, and other production integration interfaces are configured. It also documents the environment variables needed to connect real systems.\n\n---\n\n## Why RunbookHermes\n\nMost AI Agent projects stop at chat, retrieval, or simple workflow automation. Real incident response requires much more:\n\n* reliable evidence collection from monitoring, logs, traces, and deployments;\n* context compression so models reason over evidence instead of raw log noise;\n* memory that remembers useful operational experience without stuffing every history item into the prompt;\n* tools that are governed by schemas, allowlists, and safety policies;\n* approval and checkpoint before risky production actions;\n* recovery verification after remediation;\n* runbook skill generation so successful operations become reusable knowledge.\n\nRunbookHermes was created to turn Hermes Agent into this kind of incident-response agent.\n\n---\n\n## What RunbookHermes Inherits from Hermes Agent\n\nRunbookHermes is valuable because it is not built from scratch as a simple rule engine. It is based on Hermes Agent's architecture and adapts those capabilities into the AIOps domain.\n\n| Hermes Agent capability   | RunbookHermes adaptation                                                                                                         |\n| ------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |\n| Agent runtime \u002F loop      | Used as the core agent foundation for the `runbook-hermes` profile.                                                              |\n| Provider \u002F model routing  | Keeps Hermes-style model provider flexibility and adds OpenAI-compatible model-summary integration for incident analysis.        |\n| Tool system               | Adds incident-response tools for Prometheus, Loki, Jaeger\u002FTrace, deploy history, approval, rollback, and recovery verification.  |\n| Memory provider           | Adds `IncidentMemory` for service profiles, incident summaries, team preferences, and skill index.                               |\n| Context engine            | Adds `EvidenceStack`, an evidence-centric context engine for alert, evidence, hypotheses, actions, and final answer compression. |\n| Skills                    | Adds runbook skills such as payment HTTP 503 triage and common incident triage.                                                  |\n| Gateway architecture      | Adds Alertmanager, Feishu, WeCom, and Web\u002FAPI entry paths for incident workflows.                                                |\n| Safety boundary           | Adds approval, checkpoint, dry-run, controlled execution, and recovery verification around risky actions.                        |\n| Execution backend concept | Adds local reference rollback plus production executor interfaces such as `custom_http`, Kubernetes, and Argo CD style adapters. |\n\nThe goal is not to clone every Hermes feature into a dashboard. The goal is to preserve Hermes Agent's strengths and turn them into an operationally meaningful AIOps system.\n\n---\n\n## Core Capabilities\n\n### 1. Incident Intake\n\nRunbookHermes can receive incident signals through multiple entry points:\n\n* Web Console\n* Alertmanager webhook\n* Feishu event and card callback shells\n* WeCom event and card callback shells\n* Hermes profile entry via `runbook-hermes`\n* API endpoints for incident creation and replay\n\nAll entries are normalized into an incident command so different sources can flow into the same agent workflow.\n\n### 2. Evidence Collection\n\nRunbookHermes collects evidence from:\n\n* Prometheus metrics\n* Loki logs\n* Jaeger \u002F Trace backend\n* deployment records\n* service-specific profiles\n* prior incident summaries\n* runbook skills\n\nThe current code includes real adapter interfaces and a local reference payment environment for validating the integration path.\n\n### 3. EvidenceStack Context Engine\n\nIncident response produces too much raw context: logs, metric samples, traces, tool outputs, deployment records, approvals, and timelines. RunbookHermes does not dump all of that into the prompt.\n\nInstead, `EvidenceStack` organizes context into:\n\n* alert summary\n* key evidence\n* hypotheses\n* action plan\n* final answer\n\nIt keeps evidence IDs and summaries, while avoiding large raw logs and trace payloads in the long-running reasoning context.\n\n### 4. IncidentMemory\n\nRunbookHermes uses a domain-specific memory provider for incident response.\n\nIt remembers stable operational knowledge such as:\n\n* service profiles;\n* team preferences;\n* incident summaries;\n* recurring root causes;\n* generated runbook skills;\n* approval requirements for risky actions.\n\nIt does not treat memory as “save the whole chat history.” It is designed to recall the right operational facts at the right time.\n\n### 5. Model-Assisted Analysis\n\nRunbookHermes supports model-assisted incident summaries through OpenAI-compatible endpoints.\n\nThe model is used to improve analysis readability and operator-facing summaries, while the evidence chain and safety gates remain explicit.\n\nTypical model-assisted outputs:\n\n* incident summary;\n* most likely root cause explanation;\n* evidence chain explanation;\n* operator-facing action summary;\n* postmortem draft material.\n\n### 6. Approval-Gated Remediation\n\nRunbookHermes treats destructive actions as controlled operations.\n\nHigh-risk actions such as rollback, restart, or configuration mutation should pass through:\n\n1. action policy check;\n2. approval request;\n3. checkpoint creation;\n4. dry-run;\n5. controlled execution;\n6. recovery verification;\n7. audit timeline.\n\nThis is one of the main reasons RunbookHermes is built on a Hermes-style safety boundary instead of being a simple script runner.\n\n### 7. Realtime Monitoring Dashboard\n\nThe Web Console includes a realtime monitoring view for:\n\n* service health matrix;\n* HTTP 503 \u002F 504 \u002F 429 signals;\n* p95 latency;\n* QPS;\n* log signals;\n* trace signals;\n* deployment status;\n* topology view;\n* backend status for Prometheus, Loki, Trace, Deploy, model, Feishu, WeCom, and controlled execution.\n\n---\n\n## Repository Layout\n\n```text\nrunbook-hermes\u002F\n├── agent\u002F                              # Hermes Agent upstream runtime code\n├── gateway\u002F                            # Hermes upstream gateway foundation\n├── hermes_cli\u002F                         # Hermes CLI components\n├── profiles\u002Frunbook-hermes\u002F            # RunbookHermes Hermes profile and persona\n├── plugins\u002Frunbook-hermes\u002F             # RunbookHermes tool plugin\n├── plugins\u002Fmemory\u002Fincident_memory\u002F     # IncidentMemory provider\n├── plugins\u002Fcontext_engine\u002Fevidence_stack\u002F # EvidenceStack context engine\n├── runbook_hermes\u002F                     # RunbookHermes domain logic\n├── apps\u002Frunbook_api\u002F                   # FastAPI Web\u002FAPI service\n├── web\u002Fstatic\u002F                         # Web Console pages\n├── integrations\u002Fobservability\u002F         # Prometheus \u002F Loki \u002F Trace \u002F Deploy adapters\n├── toolservers\u002Fobservability_mcp\u002F      # Observability toolserver boundary\n├── skills\u002Frunbooks\u002F                    # Runbook skills\n├── demo\u002Fpayment_system\u002F                # Local reference payment environment\n├── data\u002Fpayment_demo\u002F                  # Reference deploy state and runtime version\n├── data\u002Frunbook_mock\u002F                  # Mock observability data for local fallback\n├── scripts\u002F                            # Validation and smoke scripts\n└── docs\u002F                               # Architecture, deployment, integration, operations docs\n```\n\n---\n\n## Deployment Modes\n\nRunbookHermes should be understood as one merged codebase:\n\n```text\nHermes Agent upstream source\n+ RunbookHermes AIOps extension layer\n= RunbookHermes\n```\n\nYou do **not** deploy “official Hermes Agent first” and then deploy RunbookHermes as a separate unrelated app. You deploy the merged RunbookHermes repository and run the entry points you need.\n\n### Mode A: Web\u002FAPI Only\n\nUse this mode to inspect the Web Console, incident pages, approvals, monitoring UI, settings, and API surface.\n\n```bash\nset PYTHONPATH=.\npython -m uvicorn apps.runbook_api.app.main:app --host 127.0.0.1 --port 8000\n```\n\nOpen:\n\n```text\nhttp:\u002F\u002F127.0.0.1:8000\u002Fweb\u002Findex.html\nhttp:\u002F\u002F127.0.0.1:8000\u002Fweb\u002Fmonitoring.html\nhttp:\u002F\u002F127.0.0.1:8000\u002Fweb\u002Fincidents.html\nhttp:\u002F\u002F127.0.0.1:8000\u002Fweb\u002Fapprovals.html\nhttp:\u002F\u002F127.0.0.1:8000\u002Fdocs\n```\n\n### Mode B: Local Reference Payment Environment\n\nUse this mode to validate the full incident-response path with a local payment system and observability stack.\n\n```bash\ncd demo\u002Fpayment_system\ndocker compose up --build\n```\n\nThis starts a local reference environment containing:\n\n* payment-service\n* order-service\n* coupon-service\n* MySQL\n* Redis\n* Prometheus\n* Loki\n* Promtail\n* Jaeger\n* Grafana\n\nThen configure RunbookHermes to use real local observability adapters:\n\n```bash\nset OBS_BACKEND=real\nset DEPLOY_BACKEND=demo_file\nset TRACE_BACKEND=jaeger\nset TRACE_PROVIDER_KIND=jaeger\nset ROLLBACK_BACKEND_KIND=demo_file\nset RUNBOOK_CONTROLLED_EXECUTION_ENABLED=true\n\nset PROMETHEUS_BASE_URL=http:\u002F\u002F127.0.0.1:9090\nset LOKI_BASE_URL=http:\u002F\u002F127.0.0.1:3100\nset TRACE_BASE_URL=http:\u002F\u002F127.0.0.1:16686\n\nset DEMO_DEPLOY_STATE_FILE=data\u002Fpayment_demo\u002Fdeployments.json\nset DEMO_VERSION_FILE=data\u002Fpayment_demo\u002Fruntime\u002Fpayment-service-version.txt\n```\n\nStart the Web\u002FAPI service:\n\n```bash\nset PYTHONPATH=.\npython -m uvicorn apps.runbook_api.app.main:app --host 127.0.0.1 --port 8000\n```\n\nGenerate reference traffic:\n\n```bash\ncd demo\u002Fpayment_system\npython scripts\u002Fgenerate_traffic.py --fault PAYMENT_503_AFTER_DEPLOY --requests 60\npython scripts\u002Fgenerate_traffic.py --fault COUPON_504_TIMEOUT --requests 40\npython scripts\u002Fgenerate_traffic.py --fault ORDER_429_RATE_LIMIT --requests 40\n```\n\nThese scenarios are not the final goal. They are a local reference environment for proving how RunbookHermes connects to real systems.\n\n### Mode C: Production-Oriented Deployment\n\nIn a production-oriented deployment, RunbookHermes should run as a set of services:\n\n```text\n[Alertmanager]\n     |\n     v\n[RunbookHermes API \u002F Gateway]\n     |\n     +--> Hermes Agent Runner with runbook-hermes profile\n     +--> Model Provider\n     +--> Prometheus\n     +--> Loki\n     +--> Jaeger \u002F Tempo\n     +--> Deploy \u002F Rollback System\n     +--> Feishu \u002F WeCom\n     +--> Incident Store\n     +--> Redis \u002F Queue\n     +--> Audit Log\n```\n\nRecommended production components:\n\n* `runbookhermes-api`: FastAPI Web\u002FAPI and webhook service;\n* `runbookhermes-agent`: Hermes runner using `runbook-hermes` profile;\n* `incident-store`: SQLite \u002F MySQL \u002F PostgreSQL, replacing local JSON store;\n* `redis`: queue \u002F cache \u002F approval state support;\n* `model-provider`: OpenAI-compatible or internal model endpoint;\n* `observability`: Prometheus, Loki, Jaeger \u002F Tempo;\n* `messaging`: Feishu \u002F WeCom callbacks;\n* `executor`: controlled remediation adapter such as custom HTTP, Kubernetes, or Argo CD.\n\n---\n\n## Where Do I Chat with the Agent?\n\nRunbookHermes has different interaction surfaces.\n\n### 1. Hermes CLI \u002F Agent Profile\n\nFor direct agent interaction:\n\n```bash\nhermes --profile runbook-hermes\n```\n\nUse this when you want the Hermes-native conversation loop.\n\nExample prompt:\n\n```text\npayment-service HTTP 503 is rising after release. Please collect evidence first, then explain the most likely root cause and propose a safe action plan.\n```\n\n### 2. Web Console\n\nThe Web Console is not primarily a chat UI. It is the operator control plane:\n\n* incident list;\n* realtime monitoring;\n* evidence cards;\n* RCA results;\n* action plans;\n* approvals;\n* checkpoints;\n* recovery verification;\n* generated skills;\n* model-assisted summaries.\n\n### 3. Feishu \u002F WeCom\n\nFeishu and WeCom adapters are intended for production messaging integration:\n\n* create incident from message or alert;\n* show RCA card;\n* approve or reject risky action;\n* link back to Web Console.\n\n### 4. Alertmanager \u002F API\n\nAlertmanager and API entry points are designed for system-to-agent incident intake.\n\n---\n\n## Model Provider Setup\n\nRunbookHermes can use an OpenAI-compatible endpoint for model-assisted summaries.\n\nExample with OpenRouter or any compatible model provider:\n\n```bash\nset RUNBOOK_MODEL_ENABLED=true\nset RUNBOOK_MODEL_BASE_URL=https:\u002F\u002Fopenrouter.ai\u002Fapi\u002Fv1\nset RUNBOOK_MODEL_API_KEY=your_api_key\nset RUNBOOK_MODEL_NAME=your_model_name\n```\n\nModel output is used for readable incident summaries and operator-facing explanations. Evidence collection, approval boundaries, and remediation policies remain explicit and inspectable.\n\n---\n\n## Observability Integration\n\nConfigure real observability backends:\n\n```bash\nset OBS_BACKEND=real\nset PROMETHEUS_BASE_URL=http:\u002F\u002Fprometheus.example.com\nset LOKI_BASE_URL=http:\u002F\u002Floki.example.com\nset TRACE_BACKEND=jaeger\nset TRACE_PROVIDER_KIND=jaeger\nset TRACE_BASE_URL=http:\u002F\u002Fjaeger.example.com\n```\n\nRunbookHermes uses these adapters:\n\n* `integrations\u002Fobservability\u002Fprometheus_backend.py`\n* `integrations\u002Fobservability\u002Floki_backend.py`\n* `integrations\u002Fobservability\u002Ftrace_backend.py`\n* `integrations\u002Fobservability\u002Fdeploy_backend.py`\n\n---\n\n## Feishu \u002F WeCom Integration\n\nRunbookHermes includes gateway shells for Feishu and WeCom.\n\nFeishu environment variables:\n\n```bash\nset FEISHU_APP_ID=\nset FEISHU_APP_SECRET=\nset FEISHU_VERIFICATION_TOKEN=\nset FEISHU_ENCRYPT_KEY=\nset FEISHU_CALLBACK_BASE_URL=\nset FEISHU_BOT_WEBHOOK_URL=\nset FEISHU_BOT_SECRET=\n```\n\nWeCom environment variables:\n\n```bash\nset WECOM_CORP_ID=\nset WECOM_AGENT_ID=\nset WECOM_SECRET=\nset WECOM_TOKEN=\nset WECOM_ENCODING_AES_KEY=\nset WECOM_CALLBACK_BASE_URL=\n```\n\nProduction use requires public callback routing, signature verification, encryption handling, permission setup, and card callback validation.\n\n---\n\n## Controlled Remediation\n\nRunbookHermes is designed around safe production execution, not blind automation.\n\nSupported remediation boundary:\n\n```text\naction policy\n→ approval\n→ checkpoint\n→ dry-run\n→ controlled execution\n→ recovery verification\n→ audit timeline\n```\n\nLocal reference execution is available through the payment reference environment. Production execution should be connected through a controlled executor:\n\n```bash\nset ACTION_EXECUTION_BACKEND=custom_http\nset ACTION_EXECUTION_API_BASE_URL=https:\u002F\u002Fexecutor.example.com\nset ACTION_EXECUTION_API_TOKEN=your_token\nset ACTION_EXECUTION_TIMEOUT_SECONDS=5\n```\n\nOther possible executor types:\n\n* Kubernetes controlled API\n* Argo CD\n* Argo Rollouts\n* internal release platform\n* custom HTTP remediation gateway\n\n---\n\n## Validation\n\nRun validation scripts from the repository root:\n\n```bash\nset PYTHONPATH=.\npython -S scripts\u002Frunbook_validate.py\npython -S scripts\u002Frunbook_gateway_smoke.py\npython -S scripts\u002Frunbook_no_legacy_imports.py\npython -S scripts\u002Frunbook_monitoring_validate.py\npython -S scripts\u002Frunbook_stage8_validate.py\n```\n\n---\n\n## Current Status\n\nRunbookHermes currently provides:\n\n* Hermes-native RunbookHermes profile;\n* incident-response tool plugin;\n* IncidentMemory provider;\n* EvidenceStack context engine;\n* Web Console and monitoring dashboard;\n* local reference payment environment;\n* Prometheus \u002F Loki \u002F Jaeger adapter layer;\n* Feishu \u002F WeCom gateway shells;\n* model-assisted summary shell;\n* approval + checkpoint + controlled local rollback;\n* production-oriented executor interfaces.\n\nRecommended next hardening steps:\n\n* replace local JSON store with SQLite \u002F MySQL \u002F PostgreSQL;\n* add Memory Browser page;\n* add Skill Forge page;\n* complete Feishu \u002F WeCom production callback verification;\n* connect a real model provider;\n* connect a real production deploy \u002F rollback executor;\n* add Kubernetes \u002F Docker Compose production deployment manifests;\n* add RBAC and audit persistence.\n\n---\n\n## Roadmap\n\nSee [ROADMAP.md](ROADMAP.md).\n\nHigh-level roadmap:\n\n* v0.1: Hermes-native incident-response foundation\n* v0.2: stronger memory, skill, and monitoring UI\n* v0.3: production observability integrations\n* v0.4: Feishu \u002F WeCom production messaging workflow\n* v0.5: controlled Kubernetes \u002F Argo remediation reference\n* v1.0: production reference architecture\n\n---\n\n## Acknowledgements\n\nRunbookHermes is built on top of **Hermes Agent** by Nous Research.\n\nThis project preserves the Hermes Agent foundation and adds an AIOps \u002F incident-response layer for payment-system troubleshooting, observability integration, approval-gated remediation, and runbook learning.\n\nThe upstream Hermes README and release notes are kept under `docs\u002Fupstream\u002F` for attribution and reference.\n\n---\n\n## License\n\nThis repository preserves the upstream Hermes Agent license. See [LICENSE](LICENSE).\n\nRunbookHermes additions follow the same repository license unless otherwise stated.\n\n","RunbookHermes 是一个基于 Hermes Agent 的AIOps代理，专为支付系统故障响应、证据驱动的根本原因分析、审批门控修复及运行手册学习而设计。项目使用Python语言开发，其核心功能包括实时监控、多维度服务健康检查、根本原因分析以及安全边界内的自动化或手动干预。它通过集成观察性数据（如指标、日志和追踪信号）来支持更准确的问题诊断，并提供了一个直观的Web控制台用于管理事件响应流程。此外，RunbookHermes还支持多种渠道触发事件处理，适合于需要快速定位问题并采取行动以减少业务中断时间的企业级应用场景。","2026-06-11 02:30:30","CREATED_QUERY"]