[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80800":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":21,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":13,"starSnapshotCount":13,"syncStatus":16,"lastSyncTime":26,"discoverSource":27},80800,"Pi-Bench","Simplified-Reasoning\u002FPi-Bench","Simplified-Reasoning","Benchmark for proactive personal assistant agents in long-horizon workflows.","",null,"Python",44,0,38,1,2,40.7,"Apache License 2.0",false,"main",true,[],"2026-06-11 04:07:14","\u003Ch1 align=\"center\">π-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflow\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fpi-bench-overview.png\" alt=\"Pi-Bench Overview\" width=\"100%\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.14678\">\n    \u003Cimg alt=\"arXiv\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.14678-B31B1B?style=for-the-badge&logo=arxiv&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fsimplified-reasoning.github.io\u002FPi-Bench\u002F\">\n    \u003Cimg alt=\"Project Page\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPROJECT_PAGE-3B82F6?style=for-the-badge&logo=googlechrome&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSimplified-Reasoning\u002FPi-Bench\">\n    \u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGITHUB-181717?style=for-the-badge&logo=github&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzzzhr97\u002FPi-Bench\">\n    \u003Cimg alt=\"Dataset\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDATASET-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.14678\">\n    \u003Cimg alt=\"HF Daily Paper\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF--DAILY--PAPER-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#news\">📢 News\u003C\u002Fa> •\n  \u003Ca href=\"#introduction\">🧭 Introduction\u003C\u002Fa> •\n  \u003Ca href=\"#leaderboard\">🏆 Leaderboard\u003C\u002Fa> •\n  \u003Ca href=\"#setup\">🚀 Getting Started\u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Ca href=\"#run\">🛠️ Run\u003C\u002Fa> •\n  \u003Ca href=\"#outputs\">📦 Outputs\u003C\u002Fa> •\n  \u003Ca href=\"#acknowledgement\">🙏 Acknowledgement\u003C\u002Fa> •\n  \u003Ca href=\"#citation\">📚 Citation\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n\u003Ca id=\"news\">\u003C\u002Fa>\n## 📢 News\n\n- [May 2026] `π-BENCH` is available on arXiv: [2605.14678](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.14678).\n- [May 2026] Project page is online: https:\u002F\u002Fsimplified-reasoning.github.io\u002FPi-Bench\u002F\n\n\u003Ca id=\"introduction\">\u003C\u002Fa>\n## 🧭 Introduction\n\n`π-BENCH` is a benchmark for **proactive personal assistant agents** in\nlong-horizon workflows, where users start with underspecified requests and\nimportant requirements emerge across interaction. It contains **100 multi-turn\ntasks** across **5 domain-specific personas** (`researcher`, `marketer`,\n`pharmacist`, `law_trainee`, `financier`) and organizes them as multi-session\nepisodes in persistent workspaces.\n\nThe benchmark jointly measures **Proactivity (PROC)** and **Completeness\n(COMP)**. PROC evaluates whether an agent resolves hidden intents early (through\ninference or focused elicitation) to reduce avoidable user burden, while COMP\nevaluates whether final deliverables satisfy checklist requirements and\nartifact-level obligations. Scoring combines rubric-based hidden-intent\njudgment and checklist validation, and audit results show low\njudge disagreement (**\u003C4%**), which supports evaluation reliability.\n\nCompared with benchmarks focused mainly on short-horizon tasks, GUI\u002Fmobile\ninteractions, or memory retrieval alone, `π-BENCH` emphasizes **persistent,\nartifact-centric workflows** with **hidden intents**, **inter-task\ndependencies**, and **cross-session continuity**, enabling clearer separation\nbetween reactive task completion and proactive assistance quality.\n\n\u003Ca id=\"leaderboard\">\u003C\u002Fa>\n## 🏆 Leaderboard\n\nOverall results for `Proc \u002F Comp` (%). Results are averaged over three runs,\nwith subscripts denoting standard deviation.\n\n| Model | Average&nbsp;Proc | Average&nbsp;Comp | Researcher | Marketer | Pharmacist | Law&nbsp;Trainee | Financier |\n| --- | --- | --- | --- | --- | --- | --- | --- |\n| GPT-5.4 | **67.0**\u003Csub>2.1\u003C\u002Fsub> | 65.6\u003Csub>1.8\u003C\u002Fsub> | 46.0&nbsp;\u002F&nbsp;66.4 | **78.2**&nbsp;\u002F&nbsp;67.1 | 75.9&nbsp;\u002F&nbsp;71.5 | **56.9**&nbsp;\u002F&nbsp;**61.9** | 78.1&nbsp;\u002F&nbsp;61.2 |\n| Gemini&nbsp;3.1&nbsp;Pro | 57.1\u003Csub>0.9\u003C\u002Fsub> | 60.0\u003Csub>0.8\u003C\u002Fsub> | 41.1&nbsp;\u002F&nbsp;59.2 | 65.0&nbsp;\u002F&nbsp;62.1 | 71.0&nbsp;\u002F&nbsp;72.1 | 50.0&nbsp;\u002F&nbsp;55.3 | 58.6&nbsp;\u002F&nbsp;51.1 |\n| Claude&nbsp;Opus&nbsp;4.6 | 65.5\u003Csub>1.4\u003C\u002Fsub> | **67.6**\u003Csub>1.5\u003C\u002Fsub> | **50.3**&nbsp;\u002F&nbsp;**74.5** | 75.0&nbsp;\u002F&nbsp;**74.6** | **82.8**&nbsp;\u002F&nbsp;68.6 | 45.7&nbsp;\u002F&nbsp;57.2 | 73.8&nbsp;\u002F&nbsp;**63.2** |\n| DeepSeek&nbsp;V3.2 | 53.3\u003Csub>1.9\u003C\u002Fsub> | 57.8\u003Csub>3.0\u003C\u002Fsub> | 29.0&nbsp;\u002F&nbsp;66.9 | 69.1&nbsp;\u002F&nbsp;59.4 | 75.9&nbsp;\u002F&nbsp;62.6 | 33.2&nbsp;\u002F&nbsp;51.1 | 59.1&nbsp;\u002F&nbsp;48.9 |\n| MiniMax&nbsp;M2.7 | 55.6\u003Csub>3.2\u003C\u002Fsub> | 60.0\u003Csub>1.8\u003C\u002Fsub> | 33.4&nbsp;\u002F&nbsp;63.9 | 71.9&nbsp;\u002F&nbsp;61.9 | 77.1&nbsp;\u002F&nbsp;63.6 | 38.6&nbsp;\u002F&nbsp;52.5 | 57.2&nbsp;\u002F&nbsp;58.1 |\n| Kimi&nbsp;K2.5 | 61.4\u003Csub>2.1\u003C\u002Fsub> | 53.9\u003Csub>0.8\u003C\u002Fsub> | 39.4&nbsp;\u002F&nbsp;52.6 | 68.2&nbsp;\u002F&nbsp;59.7 | 81.8&nbsp;\u002F&nbsp;78.3 | 46.5&nbsp;\u002F&nbsp;44.4 | 71.1&nbsp;\u002F&nbsp;34.4 |\n| Kimi&nbsp;K2.6 | 63.8\u003Csub>1.3\u003C\u002Fsub> | 62.0\u003Csub>1.2\u003C\u002Fsub> | 43.9&nbsp;\u002F&nbsp;60.3 | 69.5&nbsp;\u002F&nbsp;69.6 | 77.8&nbsp;\u002F&nbsp;**85.3** | 48.7&nbsp;\u002F&nbsp;55.5 | **79.2**&nbsp;\u002F&nbsp;39.4 |\n| Seed2.0&nbsp;Pro | 58.4\u003Csub>0.9\u003C\u002Fsub> | 52.1\u003Csub>3.8\u003C\u002Fsub> | 38.9&nbsp;\u002F&nbsp;59.6 | 71.4&nbsp;\u002F&nbsp;44.2 | 77.0&nbsp;\u002F&nbsp;67.6 | 46.0&nbsp;\u002F&nbsp;44.7 | 58.7&nbsp;\u002F&nbsp;44.5 |\n| GLM-5.1 | 58.4\u003Csub>0.8\u003C\u002Fsub> | 63.6\u003Csub>2.9\u003C\u002Fsub> | 41.8&nbsp;\u002F&nbsp;61.6 | 62.6&nbsp;\u002F&nbsp;69.1 | 75.2&nbsp;\u002F&nbsp;70.3 | 45.5&nbsp;\u002F&nbsp;57.3 | 66.7&nbsp;\u002F&nbsp;59.8 |\n| Qwen3.6&nbsp;Plus | 64.0\u003Csub>1.1\u003C\u002Fsub> | 64.1\u003Csub>0.6\u003C\u002Fsub> | 40.1&nbsp;\u002F&nbsp;70.0 | 77.5&nbsp;\u002F&nbsp;66.6 | 79.7&nbsp;\u002F&nbsp;70.2 | 45.7&nbsp;\u002F&nbsp;60.2 | 77.1&nbsp;\u002F&nbsp;53.6 |\n\n\u003Ca id=\"setup\">\u003C\u002Fa>\n## 🧰 Setup\n\n1. Create and activate a Python environment:\n\n```bash\nconda create -n pi-bench python=3.11\nconda activate pi-bench\n```\n\n2. Install local dependencies and prepare AppWorld data:\n\n```bash\npip install -e .\npip install -e third_party\u002Fnanobot\nbash scripts\u002Fsetup_appworld.sh\n```\n\n3. Create a local environment file and fill in the provider credentials:\n\n```bash\ncp env.example.sh env.sh\n```\n\nThe template leaves all values empty. Edit `env.sh` with your local credentials,\nthen source it in every shell where you run the benchmark:\n\n```bash\nsource env.sh\n```\n\n`env.sh` is ignored by git. The default model configs read credentials from\n`MODEL_BASE_URL`, `MODEL_API_KEY`, `USER_BASE_URL`, `USER_API_KEY`,\n`JUDGER_BASE_URL`, `JUDGER_API_KEY`, and `BRAVE_SEARCH_API_KEY`. These values\nfill the placeholders in `config\u002Fmodels\u002F*.yaml`, such as\n`config\u002Fmodels\u002Fexample.full.yaml`.\n\n4. Pull the benchmark Docker image:\n\n```bash\ndocker pull zzzhr97\u002Fpi-bench:latest\n```\n\n5. Optionally edit the target model file under `config\u002Fmodels\u002F`.\n\nMost users only need to configure `env.sh`. Edit `config\u002Fmodels\u002F\u003Cmodel-id>.yaml`\nonly when you need to change model names, endpoints, proxy settings, timeouts,\nor other per-model overrides. The YAML filename stem is the model id passed to\n`pibench`; see `config\u002Fmodels\u002Fexample.full.yaml` for the complete schema.\n\n\u003Ca id=\"run\">\u003C\u002Fa>\n## ▶️ Run\n\nRun from the repository root. For benchmark reporting, use three repeated trials\nas the default run pattern:\n\n```bash\npibench --model-id deepseek-v3.2 --run 3\n```\n\nEach repeat is written to a separate output directory with a `__runNN` suffix.\n\nAdditional examples:\n\n| Goal | Command |\n| --- | --- |\n| Single trial | `pibench --model-id deepseek-v3.2` |\n| Specific user | `pibench --user-id law_trainee --model-id deepseek-v3.2` |\n| Multiple models | `pibench --model-id deepseek-v3.2,MiniMax-M2.5` |\n| Multiple users and models | `pibench --user-id researcher,law_trainee --model-id deepseek-v3.2,MiniMax-M2.5` |\n\n\u003Ca id=\"outputs\">\u003C\u002Fa>\n## 📦 Outputs\n\nResults and logs are written under:\n\n```text\noutputs\u002F\u003Cmodel-id>\u002F\u003Cuser-id>\u002F\n```\n\nRuntime logs for each container run are under:\n\n```text\noutputs\u002F\u003Cmodel-id>\u002F\u003Cuser-id>\u002Frun\u002F\u003Ctimestamp>-runtime\u002F\n```\n\n\u003Ca id=\"acknowledgement\">\u003C\u002Fa>\n## 🙏 Acknowledgement\n\nPi-Bench is built on top of AppWorld and NanoBot. We thank the contributors to these open-source projects.\n\n\u003Ca id=\"citation\">\u003C\u002Fa>\n## 📚 Citation\n\n```bibtex\n@misc{zhang2026pibenchevaluatingproactivepersonal,\n  title={$\\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows},\n  author={Haoran Zhang and Luxin Xu and Zhilin Wang and Runquan Gui and Shunkai Zhang and Haodi Lei and Zihao He and Bingsu He and Chicheng Qin and Tong Zhu and Xiaoye Qu and Yang Yang and Yu Cheng and Yafu Li},\n  year={2026},\n  eprint={2605.14678},\n  archivePrefix={arXiv},\n  primaryClass={cs.AI},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.14678}\n}\n```\n","π-Bench是一个用于评估长周期工作流程中主动型个人助理代理性能的基准测试工具。该项目通过100个多轮对话任务，涵盖5个特定领域的角色（如研究员、营销人员等），来衡量代理在处理未明确需求和持续交互时的表现。其核心功能包括评估代理的主动性（PROC）和完整性（COMP），前者关注于代理能否提前识别并解决隐藏意图以减轻用户负担，后者则检查最终交付物是否满足所有要求。技术上，π-Bench使用Python开发，并采用了细致的评分体系确保评价结果的一致性和可靠性。适用于需要长时间协作完成复杂任务的场景，特别是当这些任务涉及多个步骤且用户最初提供的信息不够完整时。","2026-06-11 04:02:23","CREATED_QUERY"]