[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80000":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":13,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},80000,"SaaS-Bench","UniPat-AI\u002FSaaS-Bench","UniPat-AI","Official repository for SaaS-Bench: realistic, locally deployable SaaS workflows for GUI agent evaluation.",null,"Python",81,10,1,4,0,9,3.12,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:56","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"assets\u002Fbanner.svg\" alt=\"SaaS-Bench\" width=70%\u002F>\n\n\u003Ch1>SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?\u003C\u002Fh1>\n\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b91c1c?style=for-the-badge&logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.15777)\n[![Blog](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Read_Post-f59e0b?style=for-the-badge&logo=substack&logoColor=white)](https:\u002F\u002Funipat.ai\u002Fblog\u002FSaaS-Bench)\n[![Leaderboard](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLeaderboard-Results-2563eb?style=for-the-badge&logo=googleanalytics&logoColor=white)](https:\u002F\u002Funipat.ai\u002Fbenchmarks\u002FSaaS-Bench)\n[![GitHub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-Code-181717?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002FUniPat-AI\u002FSaaS-Bench)\n\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\nA benchmark for evaluating LLM agents on **real, self-hosted SaaS\napplications**. Each task asks the agent to drive a browser through a\nmulti-step business workflow (project management, accounting, HR, document\nauthoring, etc.); a per-task verifier inspects the running application's\nstate to score the result.\n\nThe bench currently ships **106 task instances across 6 domains** (split\ninto a text-only **uni-m** track and a multimodal **multi-m** track) and\n**23 self-hosted SaaS apps**:\n\n| Track   | Domain      | Tasks | Representative apps                              |\n| ------- | ----------- | ----- | ------------------------------------------------ |\n| uni-m   | Business    | 15    | Twenty, Bigcapital, HRMS, Pretix                 |\n| uni-m   | Healthcare  | 16    | OpenEMR, OnlyOffice, OpnForm                     |\n| uni-m   | Software    | 31    | Baserow, OpenProject, code-server, Metabase      |\n| uni-m   | Teamwork    | 12    | OnlyOffice, Mattermost, RoundcubeMail, ownCloud  |\n| multi-m | Agriculture | 12    | Grocy, farmOS, Recipya, e-label                  |\n| multi-m | Media       | 20    | SiYuan, Watcharr, BookLore, PhotoPrism, MediaCMS |\n\nMulti-m tasks consume image \u002F audio \u002F PDF inputs from\n`tasks\u002Fmulti-m\u002Finputs\u002F`; verifiers locate them via paths relative to\n`verify.py`. Use `scripts\u002Ffetch_multimodal_assets.sh` to check that the\nexpected input files are present before running the multi-m suite.\n\nThe reference agent is built on [browser-use](https:\u002F\u002Fgithub.com\u002Fbrowser-use\u002Fbrowser-use)\nand talks to any OpenAI-compatible chat-completions endpoint. You can swap\nin your own agent — only the `verify.py` contract is load-bearing.\n\n## Repository layout\n\n```\nsaas_bench\u002F         Eval harness (Python package)\n  run.py            CLI entry: orchestrates concurrent task execution\n  agent.py          browser-use reference agent\n  slot.py           Per-slot Docker container manager\n  loader.py         Task discovery + prompt builder\n  verify_runner.py  Runs verify.py and parses results\n  apps.yaml         App registry (ports, start commands, health probes)\n\ndocker\u002F             Compose templates + image archives (download separately)\ntasks\u002F\n  uni-m\u002F            Text-only tasks (Software, Business, Healthcare, Teamwork)\n  multi-m\u002F          Multimodal tasks (Agriculture, Media) + inputs\u002F assets\nscripts\u002F            run.sh, stop_all.sh, load_images.sh, fetch_multimodal_assets.sh\ndocs\u002F               Verify protocol and task format specifications\n```\n\n## Prerequisites\n\n- Linux host (tested on Ubuntu 22.04 \u002F Alibaba Cloud Linux)\n- Docker 24+ with the `compose` plugin\n- Python ≥ 3.10\n- ~100 GB free disk for the SaaS app images\n- Outbound network access (for first-time pull of compose-stack auxiliary\n  images and for `pip install scipy numpy` inside the code-server container)\n\n## Setup\n\n```bash\n# 1. Clone and install the Python package\ngit clone \u003Cthis-repo>.git SaaS-Bench\ncd SaaS-Bench\npip install -e .\nplaywright install chromium\npip install socksio\n\n# 2. Download SaaS-Bench docker images from Huggingface (see docker\u002FREADME.md\n#    for the URL) and place the .tar files under docker\u002Fimages\u002F, then:\nbash scripts\u002Fload_images.sh\n\n# 3. Configure your LLM endpoint\ncp .env.example .env\n$EDITOR .env   # set LLM_API_KEY, LLM_BASE_URL, LLM_MODEL\n```\n\n## Running the eval\n\n**We recommend running the evaluation on a machine with more than 500GB of RAM to support parallel SaaS environment deployment and long-horizon agent execution.**\n\nRun all tasks with 4 concurrent workers:\n\n```bash\nbash scripts\u002Frun.sh\n```\n\nUseful flags:\n\n```bash\nbash scripts\u002Frun.sh --workers 8                                 # bump concurrency\nbash scripts\u002Frun.sh --tasks-dir tasks\u002Funi-m\u002FBusiness                  # one domain\nbash scripts\u002Frun.sh --task-ids business_023 software_004             # cherry-pick\nbash scripts\u002Frun.sh --max-steps 200                             # tighter step budget\nbash scripts\u002Frun.sh --result-dir results\u002Frun_2026_05_05         # custom output dir\nbash scripts\u002Frun.sh --no-isolation                              # reuse already-running containers\nbash scripts\u002Frun.sh --log results\u002Frun.log                       # also tee to a file\n```\n\nPer-worker the harness:\n1. Picks a slot id and computes app ports `30000 + slot_id*20 + app_index`.\n2. Starts the docker containers \u002F compose stacks for that task's `sites`.\n3. Launches a headless Chrome and a fresh browser-use Agent.\n4. Saves the agent trajectory to `\u003Cresult_dir>\u002F\u003Ctask_id>.json`.\n5. Runs `verify.py` and saves the score to `\u003Cresult_dir>\u002F\u003Ctask_id>_verify.json`.\n6. Tears down the containers and tmp dirs.\n\nAggregated stats land in `\u003Cresult_dir>\u002Fsummary.json`. Errors are appended\nto `\u003Cresult_dir>\u002Ferrors.log` without aborting the run.\n\nWhen in doubt, you can purge stale containers from a previous (crashed) run:\n\n```bash\nbash scripts\u002Fstop_all.sh\n```\n\n## Bring your own agent\n\nThe harness invokes a single async function `run_task(task, model_name,\nprompt, result_dir, max_steps, slot_id, todo_md) -> dict`. Implement that\nin a module of your choice and have `saas_bench.run` import it instead of\n`saas_bench.agent`. The contract is intentionally tiny: return a dict with\n`status` (`completed` \u002F `error`), `agent_output` (string), and\n`trajectory` (list of step dicts). The verifier runs against the live\ndocker state; what the agent does to get there is up to you.\n\n## Adding a new task\n\nSee [docs\u002Ftask_format.md](docs\u002Ftask_format.md) and\n[docs\u002Fverify_protocol.md](docs\u002Fverify_protocol.md).\n\n## Citation\n\n```bibtex\n@misc{saasbench2026,\n  title  = {SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?},\n  author = {UniPat AI},\n  year   = {2026}\n}\n```\n","SaaS-Bench是一个用于评估大语言模型代理在真实、本地部署的SaaS应用程序上执行多步骤业务流程能力的基准测试工具。它通过让代理驱动浏览器完成一系列专业工作流（如项目管理、会计、人力资源等），并使用每任务验证器检查运行中应用的状态来评分结果，从而实现对代理性能的准确评估。该项目涵盖了6个领域内的106个任务实例和23个自托管SaaS应用，分为文本仅模式（uni-m）和多媒体模式（multi-m）。适合于研究机构、开发团队或任何希望测试其AI助手在实际软件环境中解决问题能力的情景。采用Python编写，易于扩展与定制新的代理进行测试。",2,"2026-06-11 03:58:50","CREATED_QUERY"]