[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2487":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":14,"starSnapshotCount":14,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},2487,"agent-safety-eval-lab","YutoTerashima\u002Fagent-safety-eval-lab","YutoTerashima","Agent trace and tool-use safety evaluation lab.",null,"Python",358,18,12,0,172,3.84,"MIT License",false,"main",true,[22,23,24,25,26],"ai-agents","evals","llm-safety","red-teaming","tool-use","2026-06-12 02:00:41","# Agent Safety Eval Lab\n\nA reproducible lab for evaluating LLM agents as systems: messages, tool calls,\npolicy boundaries, traces, and safety outcomes.\n\nThis repository is designed to run in **mock mode by default**. Real OpenAI,\nHugging Face, or LiteLLM adapters can be added later without changing the eval\nschema.\n\n## Why It Matters\n\nAgent failures are often workflow failures, not single-message failures. A useful\nevaluation needs to inspect the trajectory: what the agent saw, which tools it\ncalled, whether the calls were allowed, and how the final answer handled risk.\n\n## Architecture\n\n```mermaid\nflowchart LR\n  A[\"Eval Case\"] --> B[\"Mock \u002F Model Adapter\"]\n  B --> C[\"Agent Trace Recorder\"]\n  C --> D[\"Tool Policy Grader\"]\n  C --> E[\"Safety Rubric Grader\"]\n  D --> F[\"Risk Report\"]\n  E --> F\n```\n\n## Quick Start\n\n```bash\npython -m venv .venv\n. .venv\u002FScripts\u002Factivate\npip install -e \".[dev]\"\npython examples\u002Frun_mock_eval.py\npytest\n```\n\n## Example Output\n\n```text\ncases=3 passed=2 failed=1 high_risk=1\nC-002: fail | tool_policy_violation | blocked_tool=file.delete\n```\n\n## Repository Layout\n\n- `src\u002Fagent_safety_eval_lab\u002F`: schema, mock runner, trace grader\n- `datasets\u002F`: small public\u002Fmock eval cases\n- `evals\u002F`: rubric and policy definitions\n- `reports\u002F`: paper-style mini report\n- `docs\u002Farchitecture.md`: implementation notes\n- `docs\u002Fresearch_brief.md`: problem framing, method, limitations, next experiments\n\n## Integration Points\n\nAdapters should return a normalized `AgentTrace`. The grader does not care whether\nthe trace came from OpenAI Agents SDK, LangGraph, a local model, or a replayed JSONL\nfile.\n\n## Portfolio Notes\n\nThis is the flagship project: it ties together agent traces, tool policy, and safety rubrics in one replayable mock pipeline.\n\n## Deeper Analysis\n\n`examples\u002Frun_trace_analytics.py` generates `reports\u002Ftrace_analytics.json` and\n`reports\u002Ftrace_analytics_report.md`, adding per-trace risk scores, denied-tool\ncounts, latency totals, pass rate, and review queue analysis.\n\n## Experiment Artifacts\n\n- Dataset: [`datasets\u002Fagent_trace_eval_cases.json`](datasets\u002Fagent_trace_eval_cases.json)\n- Results: [`reports\u002Fagent_trace_eval_results.csv`](reports\u002Fagent_trace_eval_results.csv), [`reports\u002Fagent_trace_eval_results.json`](reports\u002Fagent_trace_eval_results.json)\n- Analysis: [`reports\u002Fexperiment_analysis.md`](reports\u002Fexperiment_analysis.md)\n\n## CLI\n\n```bash\npython -m agent_safety_eval_lab.cli run-demo\npython -m agent_safety_eval_lab.cli replay examples\u002Ftraces --out reports\u002Freplay_results.json\n```\n\nThe replay command evaluates stored JSON traces against a deterministic policy\nlayer, producing risk scores, per-tool decisions, and pass\u002Freview\u002Ffail verdicts.\n\n## Full Trace Suite\n\nThe repository includes a 24-case trace suite in\n[`datasets\u002Ffull_trace_suite.json`](datasets\u002Ffull_trace_suite.json) with generated\nresults and analysis in [`reports\u002Ffull_trace_suite_analysis.md`](reports\u002Ffull_trace_suite_analysis.md).\n\n## Suite Coverage\n\n`examples\u002Fcheck_suite_coverage.py` validates that the eval suite covers pass,\nunsafe content, tool-policy violation, tool-budget violation, and at least five\ntool types.\n\n## Real Public Dataset Experiment\n\n        This repository now includes a sanitized feature sample from\n        [PKU-Alignment\u002FBeaverTails](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FPKU-Alignment\u002FBeaverTails)\n        in `datasets\u002Fexternal\u002Fbeavertails_feature_sample.jsonl`, plus a reproducible analysis in\n        `reports\u002Freal_beavertails_analysis.md`. The data is stored as hashes, lengths, safety labels,\n        and category features to support real safety-risk coverage analysis without publishing raw\n        harmful generations.\n\n## GPU-Backed Real Experiment\n\nThis repository now includes a reproducible GPU-backed experiment using `PKU-Alignment\u002FBeaverTails`.\nThe smoke path runs on the local RTX 5090 Laptop GPU through the `Transformers` conda\nenvironment and writes metrics, figures, and a markdown report.\n\n```powershell\nconda run -n Transformers python scripts\u002Fdownload_data.py --smoke\nconda run -n Transformers python scripts\u002Fpreprocess_data.py --max-samples 384\nconda run -n Transformers python scripts\u002Frun_experiment.py --device cuda --smoke\nconda run -n Transformers python scripts\u002Fmake_report.py\n```\n\nMain report: `reports\u002Fagent_safety_gpu_benchmark.md`.\n\n\u003C!-- V2_RESEARCH_UPGRADE -->\n## Publishable V2 Research Results\n\nThis repository now includes a full V2 research suite with real data, multiple baselines, ablations, result artifacts, figures, and failure analysis. The README summarizes the measured run so the project can be judged from results, not just project intent.\n\n### Dataset And Scale\n\nBeaverTails safety conversations, processed from the larger `330k_train` split; the full V2 run evaluates 50,000 prompt\u002Fresponse examples.\n\n- Full-profile result rows: `4`\n- Experiment profile: `full`\n- Experiment index: [`reports\u002Fresults\u002Fexperiment_index.json`](reports\u002Fresults\u002Fexperiment_index.json)\n- Full report: [`reports\u002Fagent_safety_v2_research_report.md`](reports\u002Fagent_safety_v2_research_report.md)\n\n### Main Results\n\n| experiment_id | accuracy | macro_f1 | unsafe_recall | unsafe_precision | auroc | runtime_seconds |\n| --- | --- | --- | --- | --- | --- | --- |\n| rule_safety_keywords | 0.4889 | 0.4481 | 0.1958 | 0.6237 | 0.5245 | 0.2180 |\n| tfidf_word_lr_prompt_response | 0.7752 | 0.7743 | 0.7557 | 0.8240 | 0.8593 | 4.2690 |\n| tfidf_char_lr_prompt_response | 0.7589 | 0.7580 | 0.7403 | 0.8085 | 0.8406 | 14.3220 |\n| gpu_tfidf_mlp_prompt_response | 0.6861 | 0.6576 | 0.8793 | 0.6636 | 0.7994 | 5.3940 |\n\n### Analysis\n\n- The word TF-IDF logistic baseline is the strongest measured classifier in this matrix, reaching macro-F1 around 0.774 and AUROC around 0.859 on the 50k run.\n- The keyword rule baseline has high safe recall but misses many unsafe cases, which is exactly the failure mode that motivates trace-aware grading rather than simple blocklists.\n- The GPU MLP over TF-IDF features increases unsafe recall relative to safe recall, showing a recall-oriented operating point that would need calibration before production use.\n- Failure examples are intentionally redacted in public artifacts; the casebook preserves labels, scores, error type, and size metadata without publishing unsafe instructions.\n\n### Failure Analysis\n\n- `false_negative`: 67 records\n- `false_positive`: 13 records\n\nThe public failure artifacts use redacted previews or structured metadata where source examples may contain harmful, private, or otherwise sensitive text. This keeps the analysis reproducible without turning the README into a prompt-injection or unsafe-content corpus.\n\n### Key Artifacts\n\n- [`reports\u002Fresults\u002Fv2_main_results.csv`](reports\u002Fresults\u002Fv2_main_results.csv)\n- [`reports\u002Fresults\u002Fv2_ablation_results.csv`](reports\u002Fresults\u002Fv2_ablation_results.csv)\n- [`reports\u002Fresults\u002Fv2_failure_cases.json`](reports\u002Fresults\u002Fv2_failure_cases.json)\n- [`reports\u002Ffigures\u002Fv2_accuracy_by_experiment.png`](reports\u002Ffigures\u002Fv2_accuracy_by_experiment.png)\n- [`reports\u002Ffigures\u002Fv2_confusion_matrix.png`](reports\u002Ffigures\u002Fv2_confusion_matrix.png)\n- [`reports\u002Ffigures\u002Fv2_model_macro_f1.png`](reports\u002Ffigures\u002Fv2_model_macro_f1.png)\n\nFigures:\n\n- [`reports\u002Ffigures\u002Fv2_accuracy_by_experiment.png`](reports\u002Ffigures\u002Fv2_accuracy_by_experiment.png)\n- [`reports\u002Ffigures\u002Fv2_confusion_matrix.png`](reports\u002Ffigures\u002Fv2_confusion_matrix.png)\n- [`reports\u002Ffigures\u002Fv2_model_macro_f1.png`](reports\u002Ffigures\u002Fv2_model_macro_f1.png)\n\n### Reproduction\n\n```powershell\nconda run -n Transformers python scripts\u002Frun_matrix.py --device cuda --profile full\nconda run -n Transformers python scripts\u002Fanalyze_failures.py\nconda run -n Transformers python scripts\u002Fmake_report.py\nconda run -n Transformers python -m pytest\n```\n","Agent Safety Eval Lab 是一个用于评估大语言模型（LLM）代理系统安全性的实验室，专注于消息、工具调用、策略边界、轨迹和安全结果的分析。该项目的核心功能包括模拟模式运行、代理轨迹记录、工具策略评分及安全性评估，并生成风险报告。其技术特点在于支持多种模型适配器（如OpenAI、Hugging Face等），同时保持评估框架的一致性。适合需要深入理解或测试AI代理在特定任务中表现及其潜在风险的应用场景，比如开发安全合规的人工智能助手时使用。",2,"2026-06-11 02:50:04","CREATED_QUERY"]