[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80909":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":13,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},80909,"STATE-Bench","microsoft\u002FSTATE-Bench","microsoft","Benchmark AI Agents on Enterprise Workflows","https:\u002F\u002Fmicrosoft.github.io\u002FSTATE-Bench\u002Fleaderboard\u002F",null,"Python",45,6,33,1,0,11,12,18,2.54,"MIT License",false,"main",true,[26,27,28,29,7],"ai","ai-agents","benchmark","benchmark-framework","2026-06-12 02:04:08","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fstate-bench-banner.svg\" alt=\"STATE-Bench: Benchmark For Enterprise Workflows\" width=\"100%\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fmicrosoft.github.io\u002FSTATE-Bench\u002Fleaderboard\u002F\">\u003Cimg src=\"assets\u002Fleaderboard-live-badge.svg\" alt=\"Leaderboard Live\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg\" alt=\"License\" \u002F>\n  \u003Ca href=\"https:\u002F\u002Fopensource.microsoft.com\u002Fblog\u002F2026\u002F05\u002F19\u002Fintroducing-state-bench-a-benchmark-for-ai-agent-memory\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Read-blue\" alt=\"Blog\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"docs\u002FRUN_BENCHMARK.md\">Main Track\u003C\u002Fa> &nbsp;·&nbsp; \u003Ca href=\"docs\u002FAGENT_LEARNING_TRACK.md\">Agent Learning Track\u003C\u002Fa>\n\u003C\u002Fp>\n\nSTATE-Bench evaluates AI agents on realistic, multi-step enterprise workflows across three domains: **travel**, **customer support**, and **shopping assistant**.\n\nEach task gives the agent a task-local sandbox database, domain-specific tools, and a simulated user. To pass a task, the agent must do multi-step reasoning by gathering the right information with domain tools, applying the correct policy, taking actions to update the database to the right final state when needed, and following the required procedure in conversation.\n\n## Overview\n\nSTATE-Bench includes 450 challenging enterprise tasks across three domains.\n\n| Domain | Tasks | Description |\n| --- | ---: | --- |\n| **Travel** | 150 | Flight, hotel, and car rental bookings; cancellations, updates, fee and policy reasoning, cross-product trip planning |\n| **Customer Support** | 150 | Returns, refunds, exchanges, warranty claims, cancellations, shipping issues, and order changes |\n| **Shopping Assistant** | 150 | Product search, cart updates, applying promos, loyalty redemption, shipping options, and compatibility checks |\n\n## Choose Your Benchmark Track\n\nStart with the track that matches what you want to evaluate. Each track guide links to the setup and reference docs only when you need them.\n\n| Goal | Start here |\n| --- | --- |\n| Evaluate an agent or model directly on the provided enterprise benchmark tasks | **[Main Track](docs\u002FRUN_BENCHMARK.md)** |\n| Evaluate agentic memory, skills, or prompt optimization | **[Agent Learning Track](docs\u002FAGENT_LEARNING_TRACK.md)** |\n\nThe **Main Track** is the default benchmark path. The **Agent Learning Track** uses the same simulator, domain tools, judges, and metrics, but adds train trajectories and a retrieval hook for reusable learnings such as memories, skills, or prompt optimizations.\n\n\u003Cbr\u002F>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fchat_bubble_2.svg\" alt=\"Sample task trajectory from the Travel domain\" width=\"55%\" \u002F>\n  \u003Cbr\u002F>\n  \u003Cem>Sample task trajectory from the Travel domain.\u003C\u002Fem>\n\u003C\u002Fp>\n\n## Metrics\n\nSTATE-Bench reports four headline metrics:\n\n| Metric | What it measures |\n| --- | --- |\n| **Task Completion pass@1** | Average task completion rate across five runs per task. |\n| **Task Completion pass^5** | Percentage of tasks completed successfully on all five runs. |\n| **UX Score** | LLM-judged conversation quality on a 1-5 scale. |\n| **Cost Per Task** | Average agent cost from user-reported token usage and pricing. |\n\n## License\n\nSTATE-Bench is released under the MIT License. See [LICENSE](LICENSE).\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.\n\n## Disclosures\n\nDatasets provided in this benchmark were synthetically generated using large language models. The benchmark is intended for research purposes and users should exercise caution and consider the limitations of synthetic data when interpreting results.\n","STATE-Bench是一个用于评估AI代理在企业工作流程中表现的基准测试框架。它提供了450个跨旅行、客户服务和购物助手三个领域的多步骤任务，每个任务都配备了一个本地沙箱数据库、领域特定工具及模拟用户。核心功能包括多步推理能力评估、政策应用准确性以及与用户的交互过程合规性检查等。采用Python语言开发，并遵循MIT许可证。该项目特别适合于希望在实际业务场景下测试其AI解决方案的企业或研究者使用，尤其是在需要高度上下文理解和复杂决策制定的应用场合。",2,"2026-06-11 04:02:47","CREATED_QUERY"]