[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83807":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":15,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":16,"compositeScore":12,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":9,"trendingCount":12,"starSnapshotCount":12,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},83807,"AgentEval","canwhite\u002FAgentEval","canwhite","The agent responsible for conducting the agent evaluation",null,"Rust",102,0,52,13,39,54,false,"main",true,[],"2026-06-12 02:04:35","# AgentEval\n\nA transparent HTTP proxy that captures Agent ↔ LLM API traffic, auto-splits sessions, builds structured conversation views, **grades** with multi-dimensional scoring, **diagnoses** behavioral issues via rule engine, and **probes** agent configuration for root causes — all with a built-in web dashboard.\n\n*透明 HTTP 代理，捕获 Agent ↔ LLM 的 API 流量，自动切分 session、构建结构化视图、多维自动评分、规则诊断行为问题、探针审查配置根因 —— 内置 Web 评测面板。*\n\n## How It Works \u002F 工作原理\n\n```\n                         ┌──────────────────────────────┐\n                         │   AgentEval (127.0.0.1:57633) │\n                         └──────────────┬───────────────┘\n                                        │\nAgent ── HTTP ──► Proxy ── forward ──► Upstream LLM API\n                    │\n                    ├─ Raw traffic → logs\u002F{stem}.jsonl\n                    │  原始流量记录\n                    ├─ Session detection (message rollback \u002F idle timeout)\n                    │  实时检测 session 边界\n                    ├─ SessionView → logs\u002F{session}.view.json\n                    │  结构化会话视图\n                    ├─ Auto-grade (rules + LLM judge) → logs\u002F{session}.grade.json\n                    │  自动评分（规则 + LLM 评审）\n                    ├─ Diagnose (10 rule-based checks + LLM summary) → logs\u002F{session}.diagnose.json\n                    │  行为诊断（10条规则 + LLM 自然语言总结）\n                    ├─ Probe (LLM agent reviews agent source config) → logs\u002F{session}.probe.json\n                    │  探针审查（LLM agent 审查被评测 agent 的 prompt\u002Fskills\u002Ftools 配置）\n                    └─ Web Dashboard → http:\u002F\u002F127.0.0.1:57633\u002Fdashboard\u002F\n                       Web 评测面板\n```\n\n## Screenshots \u002F 界面展示\n\n### Dashboard \u002F 面板列表\n\n![Dashboard](src\u002Fassets\u002Fdashboard.png)\n\n### Grader \u002F 评分详情\n\n![Grader](src\u002Fassets\u002Fgrader.png)\n\n### Diagnosis \u002F 诊断结果\n\n![Diagnosis](src\u002Fassets\u002Fdiagnosis.png)\n\n### Probe \u002F 探针审查\n\n![Probe](src\u002Fassets\u002Fprobe.png)\n\n## Quick Start \u002F 快速开始\n\n### 1. Configure `.env` \u002F 配置\n\n```bash\n# Upstream LLM API \u002F 上游 LLM 地址\nAGENTEVAL_UPSTREAM=https:\u002F\u002Fapi.edgefn.net\n\n# Proxy port \u002F 代理监听端口\nAGENTEVAL_PORT=57633\n\n# Log directory \u002F 日志目录\nAGENTEVAL_LOG_DIR=.\u002Flogs\n\n# Judge LLM for grading, diagnosis summary, and probing \u002F 评测 LLM（评分+诊断总结+探针共用）\nAGENTEVAL_JUDGE_API_BASE=https:\u002F\u002Fapi.deepseek.com\nAGENTEVAL_JUDGE_MODEL=deepseek-chat\nAGENTEVAL_JUDGE_API_KEY=sk-xxx\n\n# Source project directory for probe \u002F 被探针审查的 agent 项目路径\nPROBE_SOURCE_PROJECT_DIR=\u002Fpath\u002Fto\u002Fyour\u002Fagent\u002Fproject\n```\n\n### 2. Start the proxy \u002F 启动代理\n\n```bash\ncargo run\n# listening http:\u002F\u002F127.0.0.1:57633 -> https:\u002F\u002Fapi.edgefn.net\n# dashboard http:\u002F\u002F127.0.0.1:57633\u002Fdashboard\u002F\n```\n\n### 3. Configure your Agent \u002F 配置 Agent\n\nPoint your Agent's `BASE_URL` to the proxy.\n\n*将 Agent 的 `BASE_URL` 指向代理地址。*\n\n```bash\n# Claude Code\nexport ANTHROPIC_BASE_URL=http:\u002F\u002F127.0.0.1:57633\n\n# OpenAI SDK\nexport OPENAI_BASE_URL=http:\u002F\u002F127.0.0.1:57633\u002Fv1\n\n# Generic\nMODEL_BASE_URL=http:\u002F\u002F127.0.0.1:57633\u002Fv1 your-agent-command\n```\n\n### 4. View results \u002F 查看结果\n\nOpen **http:\u002F\u002F127.0.0.1:57633\u002Fdashboard\u002F** in your browser.\n\n*浏览器打开上述地址查看评测面板。*\n\n## Web Dashboard \u002F Web 面板\n\n### Session List \u002F 会话列表\n\n| Feature | Description |\n|---------|-------------|\n| Filter bar \u002F 过滤 | 全部 \u002F \u003C6分 \u002F 6-8分 \u002F >8分，默认低分优先 |\n| Pagination \u002F 分页 | 超过 10 条自动分页，智能页码导航 |\n| Score columns \u002F 评分列 | Overall score + 4 dimension mini-bars \u002F 总分 + 四维迷你进度条 |\n| Grade button \u002F 评分按钮 | Ungraded sessions show inline `[Grade]` button \u002F 未评分会话可直接触發 |\n| Diagnose badge \u002F 诊断标记 | `⚠ N issues` \u002F `✓ clean` \u002F `[Diagnose]` button |\n| Probe badge \u002F 探针标记 | `🔍 N findings` \u002F `✓ no findings` \u002F `[Probe]` button |\n\n### Detail View \u002F 详情页\n\n| Section | Description |\n|---------|-------------|\n| Grade \u002F 评分 | Large overall score + 4 dimension cards with LLM judge reasons |\n| Diagnose \u002F 诊断 | **AI Summary** (LLM natural-language summary at top) + issue list (severity + category + detail + evidence) |\n| Probe \u002F 探针 | **AI Summary** (LLM overall assessment at top) + findings list (confidence + root cause + recommendation + evidence) |\n| Conversation \u002F 对话 | Expandable turns: user input + reasoning + text + tool calls + results |\n| Scroll anchor \u002F 锚点跳转 | Clicking diagnose\u002Fprobe badge from list auto-scrolls to that panel |\n\n## CLI \u002F 命令行\n\n```bash\n# Run diagnosis on a session \u002F 对某个 session 运行诊断\ncargo run -- diagnose \u003Csession_id> [--format terminal|json]\n\n# Run probe on a session (requires prior diagnosis) \u002F 对某个 session 运行探针\ncargo run -- probe \u003Csession_id>\n```\n\n## Session Splitting \u002F Session 自动切分\n\n| Trigger \u002F 触发条件 | Behavior \u002F 行为 |\n|---|---|\n| New conversation (message array rollback) \u002F 用户开新对话 | Seal old session → background grade → start new session |\n| 2-minute idle timeout \u002F 2 分钟无新请求 | Same as above |\n| Proxy shutdown \u002F 进程退出 | Flush last session (synchronous grade) |\n\n**Detection logic:** Normal conversations grow messages turn-by-turn. A new conversation \"shrinks\" back to just the system prompt + new question. When `common_prefix_len \u003C= 1`, it's treated as a new session.\n\n*正常对话 messages 逐轮增长，新对话 messages 会\"回缩\"。当 `common_prefix_len \u003C= 1` 时判定为新 session。*\n\n## Grading \u002F 评分\n\nFour weighted dimensions → overall 0–1 score. Rule metrics + LLM judge. Falls back to rule-based estimates if LLM is unavailable.\n\n*四个加权维度 → 0-1 总分。规则统计 + LLM 评审。LLM 不可用时自动降级。*\n\n| Dimension \u002F 维度 | Source \u002F 来源 | Weight \u002F 权重 | What it measures \u002F 衡量内容 |\n|---|---|---|---|\n| `task_completion` | LLM judge | 0.35 | Did the agent complete the user's task? \u002F 是否完成用户任务 |\n| `tool_efficiency` | Rule-based | 0.30 | Tool errors, duplicate calls, call patterns \u002F 工具调用成功率、重复惩罚 |\n| `response_quality` | LLM judge | 0.20 | Accuracy, conciseness, substance \u002F 回复准确性、简洁性、实质性 |\n| `performance` | Rule-based | 0.15 | Token efficiency, latency, turn count \u002F Token 效率、耗时、turn 数 |\n\n## Diagnose \u002F 行为诊断\n\n10 rule-based checks across 4 categories. Pure rule engine (no LLM). After rules run, an LLM generates a 2-3 sentence natural-language summary (best-effort, skipped if no API key).\n\n*10 条规则检查，覆盖 4 个类别。纯规则引擎（不依赖 LLM）。规则运行后，LLM 生成 2-3 句自然语言总结（best-effort，无 API key 时跳过）。*\n\n| Category \u002F 类别 | Rules \u002F 规则 | What it detects \u002F 检测内容 |\n|---|---|---|\n| Tool (4 rules) | result_missing, result_error, duplicate_3plus, result_empty | Broken tool chain, retry loops, silent failures \u002F 工具链断裂、重试循环、静默失败 |\n| Prompt (2 rules) | bloat, context_overflow | Overlong system prompts, orphan tool call IDs \u002F 系统提示过长、孤儿 tool call ID |\n| Token (3 rules) | empty_response, waste, excessive_input | Empty responses, token waste, oversized input \u002F 空响应、token 浪费、超大输入 |\n| View (1 rule) | mismatch | Data integrity: turn count vs jsonl entries \u002F 数据一致性校验 |\n\n## Probe \u002F 探针审查\n\nAn LLM agent with 4 read-only file tools enters the agent's source project directory (`PROBE_SOURCE_PROJECT_DIR`), reviews configuration files (CLAUDE.md, skills, prompts, tools), and identifies root causes for each diagnose issue. All recommendations are written to the report — **never auto-applied**.\n\n*携带 4 个只读文件工具的 LLM agent，进入被评测 agent 的项目目录，审查配置文件（CLAUDE.md、skills、prompts、tools），找到每个 diagnose issue 的配置根因。所有改进建议写入 report — 绝不自动修改代码。*\n\n| Tool | What it does | Safety limit |\n|------|-------------|-------------|\n| `read_file` | Read file contents \u002F 读取文件 | 1MB truncation |\n| `grep` | Search with regex \u002F 正则搜索 | 1000 lines |\n| `list_dir` | List directory entries \u002F 列出目录 | — |\n| `glob` | Find files by pattern \u002F 按模式查找文件 | 2000 entries |\n\n**Safety \u002F 安全机制:**\n- All paths sandboxed: `..` rejected, canonicalize verification \u002F 路径沙箱：拒绝 `..` + 二次校验\n- Loop detection: 3 identical calls → warning injected \u002F 循环检测：3 次重复调用 → 注入警告\n- Max 30 steps \u002F 最多 30 步\n- Read-only: no write\u002Fexecute tools \u002F 只读不写\n- LLM timeout: 300s \u002F LLM 超时 300 秒\n\n## Output Files \u002F 输出文件\n\n```\n{AGENTEVAL_LOG_DIR}\u002F\n├── {stem}.jsonl                 ← Raw recorded traffic \u002F 原始流量\n├── {stem}_{N}.view.json         ← Session structured view \u002F 结构化视图\n├── {stem}_{N}.grade.json        ← Grade report \u002F 评分报告\n├── {stem}_{N}.diagnose.json     ← Diagnosis report \u002F 诊断报告\n└── {stem}_{N}.probe.json        ← Probe report \u002F 探针报告\n```\n\n## Configuration Reference \u002F 配置参考\n\n| Variable \u002F 变量 | Default \u002F 默认值 | Description \u002F 说明 |\n|---|---|---|\n| `AGENTEVAL_UPSTREAM` | `https:\u002F\u002Fapi.deepseek.com` | Target LLM API \u002F 上游 API 地址 |\n| `AGENTEVAL_PORT` | `57633` | Local proxy port \u002F 代理端口 |\n| `AGENTEVAL_LOG_DIR` | `~\u002F.agenteval\u002Flogs` | Log output directory \u002F 日志目录 |\n| `AGENTEVAL_VERBOSE` | `false` | Print request bodies \u002F 打印请求体 |\n| `AGENTEVAL_UI_ENABLED` | `true` | Enable web dashboard \u002F 启用 Web 面板 |\n| `AGENTEVAL_JUDGE_API_BASE` | same as upstream | Judge LLM URL \u002F 评测 LLM 地址（评分+诊断总结+探针共用） |\n| `AGENTEVAL_JUDGE_MODEL` | `MiniMax-M2.5` | Judge LLM model \u002F 评测 LLM 模型 |\n| `AGENTEVAL_JUDGE_API_KEY` | (empty) | Judge LLM API key \u002F 评测 LLM API Key |\n| `PROBE_SOURCE_PROJECT_DIR` | (empty) | Agent source dir for probe \u002F 被探针审查的项目目录 |\n\n## Docs \u002F 文档\n\n| Document \u002F 文档 | Content \u002F 内容 |\n|---|---|\n| [proxy.md](docs\u002Fproxy.md) | Proxy architecture \u002F 代理架构 |\n| [eval-design.md](docs\u002Feval-design.md) | Eval module design \u002F Eval 模块设计 |\n| [eval-impl.md](docs\u002Feval-impl.md) | Eval module implementation \u002F Eval 模块实现 |\n| [grader-design.md](docs\u002Fgrader-design.md) | Grader design \u002F Grader 方案设计 |\n| [grader-impl.md](docs\u002Fgrader-impl.md) | Grader implementation \u002F Grader 实现细节 |\n| [diagnose-design.md](docs\u002Fdiagnose-design.md) | Diagnose design \u002F Diagnose 方案设计 |\n| [diagnose-impl.md](docs\u002Fdiagnose-impl.md) | Diagnose implementation \u002F Diagnose 实现记录 |\n| [probe-design.md](docs\u002Fprobe-design.md) | Probe design \u002F Probe 方案设计 |\n| [probe-impl.md](docs\u002Fprobe-impl.md) | Probe implementation \u002F Probe 实现记录 |\n| [llm-summary-impl.md](docs\u002Fllm-summary-impl.md) | LLM summary for diagnose + probe \u002F LLM 总结实现 |\n| [web-ui-plan.md](docs\u002Fweb-ui-plan.md) | Web UI design plan \u002F Web UI 设计计划 |\n| [web-ui-impl.md](docs\u002Fweb-ui-impl.md) | Web UI implementation \u002F Web UI 实现记录 |\n| [dataflow.md](docs\u002Fdataflow.md) | Data flow \u002F 数据流 |\n",2,"2026-06-11 04:11:31","CREATED_QUERY"]