[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80860":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":22,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":14,"starSnapshotCount":14,"syncStatus":17,"lastSyncTime":32,"discoverSource":33},80860,"ToolCUA","X-PLUG\u002FToolCUA","X-PLUG","ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents","https:\u002F\u002Fx-plug.github.io\u002FToolCUA\u002F",null,"Python",51,1,0,12,16,2,48.5,"MIT License",false,"main",true,[24,25,26,27,28],"agentic-rl","computer-use-agent","gui-agent","mllm","sandbox-environment","2026-06-12 04:01:30","\n\u003Ch1 style=\"\n  font-family:-apple-system,BlinkMacSystemFont,'Segoe UI',Helvetica,Arial,sans-serif;\n  font-size:48px;\n  font-weight:700;\n  line-height:1.25;\n  text-align:center;\n  margin:0 0 24px;\">\n  \u003Cimg src=\"assets\u002Ftongyi.png\" width=\"30px\" style=\"vertical-align: middle; margin-right: 10px;\">\n  ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents\n\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n&nbsp&nbsp🌐 \u003Ca href=\"https:\u002F\u002Fx-plug.github.io\u002FToolCUA\u002F\">Website\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12481\">Paper\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FmPLUG\u002FToolCUA-8B\">ToolCUA-8B\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📄 \u003Ca href=\"https:\u002F\u002Fx-plug.github.io\u002FToolCUA\u002F#case-study\">Cases\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fmain_teaser.png\" width=\"760\" alt=\"ToolCUA overview\">\n\u003C\u002Fdiv>\n\n\u003Cdiv style=\"max-width:900px;margin:0 auto;\">\n\n## 📢 Updates\n- 2026-05-12: 🎉 **Thrilled to release ToolCUA** with the ToolCUA-8B model, evaluation code, and OSWorld-MCP benchmark results.\n\n## 📚 Table of Contents\n- [🌟 Introduction](#-introduction)\n    - [🔍 Path Selection Confusion Under Hybrid Actions](#-path-selection-confusion-under-hybrid-actions)\n    - [🧠 Method Overview](#-method-overview)\n    - [Installation \\& Download](#installation--download)\n    - [🚀 vLLM Serve](#-vllm-serve)\n    - [🖥️ Evaluation](#️-evaluation)\n  - [📊 Performance](#-performance)\n  - [Acknowledge](#acknowledge)\n  - [Citation](#citation)\n\n## TODO\n- [x] **Tool CUA Model Released**\n- [ ] **Data Pipeline**: GUI-Tool interleaved trajectory scaling pipeline\n- [ ] **Training Infra**: Asynchronous training-rollout decoupled agentic RL in sandbox\n\n\n\u003Ca id=\"introduction\">\u003C\u002Fa>\n# 🌟 Introduction\n\u003Cdiv style=\"\n  max-width: 880px;              \u002F* 可按需调节整体宽度 *\u002F\n  margin: 0 auto;               \u002F* 居中容器 *\u002F\n  text-align: justify;          \u002F* 关键：两端对齐 *\u002F\n  text-justify: inter-word;     \u002F* 优化英文对齐效果 *\u002F\n  line-height: 1.6;\">\n  \n**ToolCUA** is an end-to-end Computer Use Agent (CUA) designed for **optimal GUI-Tool path orchestration**. Modern CUAs can act through both atomic GUI actions, such as clicking, typing, and scrolling, and high-level tool calls, such as API-based file or application operations. However, simply exposing a model to both action spaces does not make it a reliable desktop agent: the model must learn **when to continue with GUI actions, when to invoke tools, and when to switch back**.\n\nToolCUA addresses this challenge with a staged training pipeline. We first scale interleaved GUI-Tool trajectories from existing GUI-only data through trajectory-aware tool synthesis. Then, we use Tool-Bootstrapped GUI RFT to acquire tool-calling knowledge and calibrate critical switching decisions. Finally, we optimize the agent with Online Agentic RL in a GUI-Tool environment using a Tool-Efficient Path Reward, encouraging appropriate tool use and shorter execution paths.\n\n\u003Ca id=\"path-selection-confusion-under-hybrid-actions\">\u003C\u002Fa>\n### 🔍 Path Selection Confusion Under Hybrid Actions\n\nGiving agents both GUI actions and tool calls does not automatically make them better. In our diagnostic study, hybrid actions introduce a clear **path selection confusion** problem: some models stay GUI-centric and almost never invoke tools, while stronger models may overuse tools, shorten trajectories, and still lose task success. The bottleneck is therefore not tool availability itself, but whether the agent can choose the right GUI-Tool execution path at each state.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fgui_tool_confusion.png\" width=\"900\" alt=\"GUI-Tool path confusion\">\n\u003C\u002Fdiv>\n\n\u003Ca id=\"method-overview\">\u003C\u002Fa>\n### 🧠 Method Overview\n\nToolCUA learns GUI-Tool orchestration through three tightly connected stages: (1) scalable interleaved GUI-Tool trajectory construction from existing GUI corpora, (2) Tool-Bootstrapped GUI RFT for tool knowledge and local switching calibration, and (3) Online Agentic RL with Tool-Efficient Path Reward for trajectory-level optimization.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fmethod_overview.png\" width=\"900\" alt=\"ToolCUA method overview\">\n\u003C\u002Fdiv>\n\n\u003C\u002Fdiv>\n\n\n\n\u003Ca id=\"installation-download\">\u003C\u002Fa>\n### Installation & Download\n\nFirst, install the required transformers dependencies:\n\n```bash\npip install -r requirement.txt\n```\n\nDownload the model weight from huggingface:\n```bash\nfrom huggingface_hub import snapshot_download\nsnapshot_download(\n    repo_id=\"mPLUG\u002FToolCUA-8B\",\n    local_dir=\"ToolCUA-8B\",                \n    local_dir_use_symlinks=False  \n)\n```\n\n\u003Ca id=\"vllm-serve\">\u003C\u002Fa>\n### 🚀 vLLM Serve\n\nWe recommend using vLLM for production deployment. Requires **vllm>=0.12.0** with `--trust-remote-code`.\n\n```bash\n# 8B (single GPU)\n\nMAX_IMAGE=${MAX_IMAGE:-5}\nIMAGE_LIMIT_ARGS='{\"image\": '\"$MAX_IMAGE\"'}'\n\nPIXEL_ARGS='{\"size\": {\"longest_edge\": 3072000, \"shortest_edge\": 65536}}' # 3000*32*32\n\nvllm serve xPLUG\u002FToolCUA-8B \\\n    --max-model-len 32768 \\\n    --mm-processor-kwargs \"$PIXEL_ARGS\" \\\n    --limit-mm-per-prompt \"$IMAGE_LIMIT_ARGS\" \\\n    --tensor-parallel-size 1 \\\n    --allowed-local-media-path '\u002F' \\\n    --port 4243 \\\n    --gpu-memory-utilization 0.85 \\\n    --mm-processor-cache-gb 0 \\\n    --no-enable-prefix-caching \\\n    --enforce-eager \\\n    --max-logprobs 50\n```\nAs ToolCUA-8B is based on Qwen3VL-8B-Instruct, whom you can follow the similar implementation\n\n\n\n\u003Ca id=\"evaluation\">\u003C\u002Fa>\n### 🖥️ Evaluation\n\nYou should first have the complete evaluation environment as [OSWorld](https:\u002F\u002Fgithub.com\u002Fxlang-ai\u002FOSWorld) (for pure GUI settings) or [OSWorld-MCP](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FOSWorld-MCP) (for GUI-Tool settings)\n\nThere are some key files that you need to place at the right place os OSWorld \u002F OSWorld-MCP evaluation dir\n\n- eval_data with tool_beneficial label: `.\u002Feval\u002Fevaluaton_data`\n- desktop_env: `.\u002Feval\u002Fdesktop_env.py`\n- agents_implementation: `.\u002Feval\u002Fqwen3vl_toolcua_aget_mcp.py`\n- eval_main: `.\u002Feval\u002Frun_multienv_qwen3vl_toolcua_mcp_eval.py`\n- \n\nCommand for running in `average@k` and calculate results\n```\nbash .\u002Feval\u002Fpassk_run_new.sh\n\npython .\u002Feval\u002Fpass_k_results.py --root_path ${RESULT_DIR} --trials 0 1 2\n```\n\n---\n\n\u003Ca id=\"performance\">\u003C\u002Fa>\n##  📊 Performance\n\nResults are reported on the feasible tasks of **OSWorld-MCP**. We list the **Overall** metrics from the main paper table: Accuracy, Tool Invocation Rate (TIR), and Average Completion Steps (ACS).\n\n| Agent Model | Accuracy ↑ | TIR ↑ | ACS ↓ |\n|:--|--:|--:|--:|\n| Gemini-2.5-Pro | 20.22 | 17.22 | 29.97 |\n| OpenAI o3 | 20.62 | 18.22 | 31.87 |\n| Seed1.5-VL | 34.53 | 26.83 | 20.69 |\n| Claude-4-Sonnet | 43.54 | 35.74 | 19.76 |\n| Gemini-3.1-Pro | 41.14 | 34.23 | 25.40 |\n| Claude-4-5-Sonnet | 48.35 | 40.24 | 19.07 |\n| Qwen3-VL-235B-A22B | 38.14 | 28.63 | 17.95 |\n| Qwen3.5-397B-A17B | 40.84 | 11.71 | 21.86 |\n| UI-Tars-1.5-7B | 12.31 | 4.50 | 37.11 |\n| EvoCUA-8B | 35.74 | 13.81 | 26.77 |\n| EvoCUA-32B | 40.54 | 22.52 | 26.16 |\n| GUI-Owl-1.5-8B | 43.84 | 36.04 | 21.19 |\n| GUI-Owl-1.5-32B | 48.05 | 41.14 | 24.19 |\n| Qwen3-VL-8B-Instruct | 28.23 | 8.41 | 19.34 |\n| **ToolCUA-8B** | **46.85** | **24.32** | **14.93** |\n\nCompared with the Qwen3-VL-8B-Instruct baseline, **ToolCUA-8B** improves Accuracy by **+18.62**, TIR by **+15.91**, and reduces ACS by **-4.41**.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fapp_results.png\" width=\"900\" alt=\"ToolCUA results across applications\">\n\u003C\u002Fdiv>\n\u003C\u002Fdiv>\n  \n\n\u003C!-- ## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=xlang-ai\u002FOpenCUA&type=date&legend=top-left)](https:\u002F\u002Fwww.star-history.com\u002F#xlang-ai\u002FOpenCUA&type=date&legend=top-left) -->\n\n## Acknowledge\n\nWe thank Zhaoqing Zhu, Junyang Wang, Jitong Liao and Haowei Liu for their support of training infrastructure, sandbox construction and evaluation.\n\nOur work is motivated by [OpenCUA](https:\u002F\u002Fgithub.com\u002Fxlang-ai\u002FOpenCUA), [ScaleCUA](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FScaleCUA), [AutoGLM](https:\u002F\u002Fgithub.com\u002Fzai-org\u002FOpen-AutoGLM), [CUA-skill](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fcua_skill), [EVOCUA](https:\u002F\u002Fgithub.com\u002Fmeituan\u002FEvoCUA), and [Mobile-Agent Series](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent). Thanks for their wonderful work.\n\n\n## Citation\n\nIf you use ToolCUA in your research or project, please cite our work:\n\n```bibtex\n@article{hu2026toolcua,\n  title={ToolCUA: Towards Optimal GUI-Tool Path Orchestration for Computer Use Agents},\n  author={Hu, Xuhao and Zhang, Xi and Xu, Haiyang and Qiao, Kyle and Yang, Jingyi and Huang, Xuanjing and Shao, Jing and Yan, Ming and Ye, Jieping},\n  journal={arXiv preprint arXiv:2605.12481},\n  year={2026}\n}\n```\n\n\n\u003C\u002Fdiv>\n","ToolCUA 是一个专为计算机使用代理（CUA）设计的端到端系统，旨在实现最优的图形用户界面（GUI）与工具路径编排。该项目通过分阶段训练流水线解决了在混合动作空间中选择合适操作路径的问题，包括基于现有GUI数据生成交错的GUI-工具轨迹、利用工具引导的GUI RFT获取工具调用知识以及在线Agentic RL优化以鼓励适当工具使用和更短执行路径。它适用于需要自动化桌面任务并智能决策何时使用GUI操作或调用高级工具的应用场景，如文件管理、应用程序操作等。项目采用Python开发，并在MIT许可下开源。","2026-06-11 04:02:35","CREATED_QUERY"]