[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73964":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},73964,"OmniParser","microsoft\u002FOmniParser","microsoft","A simple screen parsing tool towards pure vision based GUI agent","",null,"Jupyter Notebook",24884,2182,182,173,0,23,52,139,69,120,"Creative Commons Attribution 4.0 International",false,"master",true,[],"2026-06-12 04:01:12","# OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"imgs\u002Flogo.png\" alt=\"Logo\">\n\u003C\u002Fp>\n\u003C!-- \u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F12975\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F12975\" alt=\"microsoft%2FOmniParser | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa> -->\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-green)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00203)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n\n📢 [[Project Page](https:\u002F\u002Fmicrosoft.github.io\u002FOmniParser\u002F)] [[V2 Blog Post](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Farticles\u002Fomniparser-v2-turning-any-llm-into-a-computer-use-agent\u002F)] [[Models V2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser-v2.0)] [[Models V1.5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser)] [[HuggingFace Space Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FOmniParser-v2)]\n\n**OmniParser** is a comprehensive method for parsing user interface screenshots into structured and easy-to-understand elements, which significantly enhances the ability of GPT-4V to generate actions that can be accurately grounded in the corresponding regions of the interface. \n\n## News\n- [2025\u002F3] We support local logging of trajecotry so that you can use OmniParser+OmniTool to build training data pipeline for your favorate agent in your domain. [Documentation WIP]\n- [2025\u002F3] We are gradually adding multi agents orchstration and improving user interface in OmniTool for better experience.\n- [2025\u002F2] We release OmniParser V2 [checkpoints](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser-v2.0). [Watch Video](https:\u002F\u002F1drv.ms\u002Fv\u002Fc\u002F650b027c18d5a573\u002FEWXbVESKWo9Buu6OYCwg06wBeoM97C6EOTG6RjvWLEN1Qg?e=alnHGC)\n- [2025\u002F2] We introduce OmniTool: Control a Windows 11 VM with OmniParser + your vision model of choice. OmniTool supports out of the box the following large language models - OpenAI (4o\u002Fo1\u002Fo3-mini), DeepSeek (R1), Qwen (2.5VL) or Anthropic Computer Use. [Watch Video](https:\u002F\u002F1drv.ms\u002Fv\u002Fc\u002F650b027c18d5a573\u002FEehZ7RzY69ZHn-MeQHrnnR4BCj3by-cLLpUVlxMjF4O65Q?e=8LxMgX)\n- [2025\u002F1] V2 is coming. We achieve new state of the art results 39.5% on the new grounding benchmark [Screen Spot Pro](https:\u002F\u002Fgithub.com\u002Flikaixin2000\u002FScreenSpot-Pro-GUI-Grounding\u002Ftree\u002Fmain) with OmniParser v2 (will be released soon)! Read more details [here](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FOmniParser\u002Ftree\u002Fmaster\u002Fdocs\u002FEvaluation.md).\n- [2024\u002F11] We release an updated version, OmniParser V1.5 which features 1) more fine grained\u002Fsmall icon detection, 2) prediction of whether each screen element is interactable or not. Examples in the demo.ipynb. \n- [2024\u002F10] OmniParser was the #1 trending model on huggingface model hub (starting 10\u002F29\u002F2024). \n- [2024\u002F10] Feel free to checkout our demo on [huggingface space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FOmniParser)! (stay tuned for OmniParser + Claude Computer Use)\n- [2024\u002F10] Both Interactive Region Detection Model and Icon functional description model are released! [Hugginface models](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser)\n- [2024\u002F09] OmniParser achieves the best performance on [Windows Agent Arena](https:\u002F\u002Fmicrosoft.github.io\u002FWindowsAgentArena\u002F)! \n\n## Install \nFirst clone the repo, and then install environment:\n```python\ncd OmniParser\nconda create -n \"omni\" python==3.12\nconda activate omni\npip install -r requirements.txt\n```\n\nEnsure you have the V2 weights downloaded in weights folder (ensure caption weights folder is called icon_caption_florence). If not download them with:\n```\n   # download the model checkpoints to local directory OmniParser\u002Fweights\u002F\n   for f in icon_detect\u002F{train_args.yaml,model.pt,model.yaml} icon_caption\u002F{config.json,generation_config.json,model.safetensors}; do huggingface-cli download microsoft\u002FOmniParser-v2.0 \"$f\" --local-dir weights; done\n   mv weights\u002Ficon_caption weights\u002Ficon_caption_florence\n```\n\n\u003C!-- ## [deprecated]\nThen download the model ckpts files in: https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser, and put them under weights\u002F, default folder structure is: weights\u002Ficon_detect, weights\u002Ficon_caption_florence, weights\u002Ficon_caption_blip2. \n\nFor v1: \nconvert the safetensor to .pt file. \n```python\npython weights\u002Fconvert_safetensor_to_pt.py\n\nFor v1.5: \ndownload 'model_v1_5.pt' from https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser\u002Ftree\u002Fmain\u002Ficon_detect_v1_5, make a new dir: weights\u002Ficon_detect_v1_5, and put it inside the folder. No weight conversion is needed. \n``` -->\n\n## Examples:\nWe put together a few simple examples in the demo.ipynb. \n\n## Gradio Demo\nTo run gradio demo, simply run:\n```python\npython gradio_demo.py\n```\n\n## Model Weights License\nFor the model checkpoints on huggingface model hub, please note that icon_detect model is under AGPL license since it is a license inherited from the original yolo model. And icon_caption_blip2 & icon_caption_florence is under MIT license. Please refer to the LICENSE file in the folder of each model: https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FOmniParser.\n\n## 📚 Citation\nOur technical report can be found [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00203).\nIf you find our work useful, please consider citing our work:\n```\n@misc{lu2024omniparserpurevisionbased,\n      title={OmniParser for Pure Vision Based GUI Agent}, \n      author={Yadong Lu and Jianwei Yang and Yelong Shen and Ahmed Awadallah},\n      year={2024},\n      eprint={2408.00203},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.00203}, \n}\n```\n","OmniParser 是一个用于解析用户界面截图的工具，能够将屏幕内容转换为结构化且易于理解的元素。其核心功能包括精细的小图标检测、交互性预测以及支持与多种大型语言模型（如GPT-4V）的集成，显著提升了基于视觉的GUI代理生成精确操作的能力。该项目采用Jupyter Notebook编写，并以Creative Commons Attribution 4.0 International许可证发布。OmniParser特别适用于需要通过视觉识别来控制图形用户界面的应用场景，例如自动化测试、辅助技术开发或是任何需要对UI进行分析和操作的任务。",2,"2026-06-11 03:48:08","high_star"]