[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71093":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},71093,"AppAgent","TencentQQGYLab\u002FAppAgent","TencentQQGYLab","AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.","https:\u002F\u002Fappagent-official.github.io\u002F",null,"Python",6770,752,76,86,0,4,10,37,12,39.63,"MIT License",false,"main",[26,27,28,29,30,31],"agent","chatgpt","generative-ai","gpt4","gpt4v","llm","2026-06-12 02:02:47","# [CHI 2025] AppAgent \n\n\u003Cdiv align=\"center\">\n\n\u003Ca href='https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13771'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2312.13771-b31b1b.svg'>\u003C\u002Fa> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n \u003Ca href='https:\u002F\u002Fappagent-official.github.io'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-Green'>\u003C\u002Fa> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n \u003Ca href='https:\u002F\u002Fgithub.com\u002Fbuaacyw\u002FGaussianEditor\u002Fblob\u002Fmaster\u002FLICENSE.txt'>\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue'>\u003C\u002Fa> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fdr_chizhang\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fdr_chizhang?style=social\" alt=\"Twitter Follow\">\u003C\u002Fa> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;\n \u003Cbr>\u003Cbr>\n \u003C!-- [![Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue)](https:\u002F\u002Fhuggingface.co\u002Flisten2you002\u002FChartLlama-13b) &nbsp;&nbsp;&nbsp;&nbsp;&nbsp; \n[![Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Dataset-blue)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flisten2you002\u002FChartLlama-Dataset) -->\n\n[**Chi Zhang***†](https:\u002F\u002Ficoz69.github.io\u002F), [**Zhao Yang***](https:\u002F\u002Fgithub.com\u002Fyz93), [**Jiaxuan Liu***](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fjiaxuan-liu-9051b7105\u002F), [Yucheng Han](http:\u002F\u002Ftingxueronghua.github.io), [Xin Chen](https:\u002F\u002Fchenxin.tech\u002F), [Zebiao Huang](),\n\u003Cbr>\n[Bin Fu](https:\u002F\u002Fopenreview.net\u002Fprofile?id=~BIN_FU2), [Gang Yu✦](https:\u002F\u002Fwww.skicyyu.org\u002F)\n\u003Cbr>\n(* equal contribution, † Project Leader, ✦ Corresponding Author )\n\u003C\u002Fdiv>\n\n\n![](.\u002Fassets\u002Fteaser.png)\n\nℹ️ 🔥🔥🔥 [AppAgentX](https:\u002F\u002Fappagentx.github.io\u002F) is released, the next-generatation GUI Agent with evolving mechanism.\n\nℹ️Should you encounter any issues⚠️ while using our project, please feel free to report them on [GitHub Issues](https:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fissues) or reach out to [Dr. Chi Zhang](https:\u002F\u002Ficoz69.github.io\u002F) via email at dr.zhang.chi@outlook.com.\n\nℹ️This project will be synchronously updated on the official [TencentQQGYLab](https:\u002F\u002Fgithub.com\u002FTencentQQGYLab\u002FAppAgent) Github Page.\n\n## 📝 Changelog\n\n- __[2025.3.5]__: 🔥🔥🔥[AppAgentX](https:\u002F\u002Fappagentx.github.io\u002F) is released, the next-generatation GUI Agent with evloving mechanism.\n- __[2024.2.8]__: Added `qwen-vl-max` (通义千问-VL) as an alternative multi-modal model. The model is currently free to use but has a relatively poorer performance compared with GPT-4V.\n- __[2024.1.31]__: Released the [evaluation benchmark](https:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fblob\u002Fmain\u002Fassets\u002Ftestset.md) used during our testing of AppAgent\n- __[2024.1.2]__: Added an optional method for the agent to bring up a grid overlay on the screen to **tap\u002Fswipe anywhere** on the screen.\n- __[2023.12.26]__: Added [Tips](#tips) section for better use experience; added instruction for using the **Android Studio emulator** for\n  users who do not have Android devices.\n- __[2023.12.21]__:  Open-sourced the git repository, including the detailed configuration steps to implement our AppAgent!\n\n\n## 🔆 Introduction\n\nWe introduce a novel LLM-based multimodal agent framework designed to operate smartphone applications. \n\nOur framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps.\n\nCentral to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications.\n\n\n## ✨ Demo\n\nThe demo video shows the process of using AppAgent to follow a user on X (Twitter) in the deployment phase.\n\nhttps:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fassets\u002F40715314\u002Fdb99d650-dec1-4531-b4b2-e085bfcadfb7\n\nAn interesting experiment showing AppAgent's ability to pass CAPTCHA.\n\nhttps:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fassets\u002F27103154\u002F5cc7ba50-dbab-42a0-a411-a9a862482548\n\nAn example of using the grid overlay to locate a UI element that is not labeled with a numeric tag.\n\nhttps:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fassets\u002F27103154\u002F71603333-274c-46ed-8381-2f9a34cdfc53\n\n## 🚀 Quick Start\n\nThis section will guide you on how to quickly use `gpt-4-vision-preview` (or `qwen-vl-max`) as an agent to complete specific tasks for you on\nyour Android app.\n\n### ⚙️ Step 1. Prerequisites\n\n1. On your PC, download and install [Android Debug Bridge](https:\u002F\u002Fdeveloper.android.com\u002Ftools\u002Fadb) (adb) which is a\n   command-line tool that lets you communicate with your Android device from the PC.\n\n2. Get an Android device and enable the USB debugging that can be found in Developer Options in Settings.\n\n3. Connect your device to your PC using a USB cable.\n\n4. (Optional) If you do not have an Android device but still want to try AppAgent. We recommend you download\n   [Android Studio](https:\u002F\u002Fdeveloper.android.com\u002Fstudio\u002Frun\u002Femulator) and use the emulator that comes with it.\n   The emulator can be found in the device manager of Android Studio. You can install apps on an emulator by\n   downloading APK files from the internet and dragging them to the emulator.\n   AppAgent can detect the emulated device and operate apps on it just like operating a real device.\n\n   \u003Cimg width=\"570\" alt=\"Screenshot 2023-12-26 at 22 25 42\" src=\"https:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fassets\u002F27103154\u002F5d76b810-1f42-44c8-b024-d63ec7776789\">\n\n5. Clone this repo and install the dependencies. All scripts in this project are written in Python 3 so make sure you\n   have installed it.\n\n```bash\ncd AppAgent\npip install -r requirements.txt\n```\n\n### 🤖 Step 2. Configure the Agent\n\nAppAgent needs to be powered by a multi-modal model which can receive both text and visual inputs. During our experiment\n, we used `gpt-4-vision-preview` as the model to make decisions on how to take actions to complete a task on the smartphone.\n\nTo configure your requests to GPT-4V, you should modify `config.yaml` in the root directory.\nThere are two key parameters that must be configured to try AppAgent:\n1. OpenAI API key: you must purchase an eligible API key from OpenAI so that you can have access to GPT-4V.\n2. Request interval: this is the time interval in seconds between consecutive GPT-4V requests to control the frequency \nof your requests to GPT-4V. Adjust this value according to the status of your account.\n\nOther parameters in `config.yaml` are well commented. Modify them as you need.\n\n> Be aware that GPT-4V is not free. Each request\u002Fresponse pair involved in this project costs around $0.03. Use it wisely.\n\nYou can also try `qwen-vl-max` (通义千问-VL) as the alternative multi-modal model to power the AppAgent. The model is currently \nfree to use but its performance in the context of AppAgent is poorer compared with GPT-4V.\n\nTo use it, you should create an Alibaba Cloud account and [create a Dashscope API key](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fdashscope\u002Fdeveloper-reference\u002Factivate-dashscope-and-create-an-api-key?spm=a2c4g.11186623.0.i1) to fill in the `DASHSCOPE_API_KEY` field \nin the `config.yaml` file. Change the `MODEL` field from `OpenAI` to `Qwen` as well.\n\nIf you want to test AppAgent using your own models, you should write a new model class in `scripts\u002Fmodel.py` accordingly.\n\n### 🔍 Step 3. Exploration Phase\n\nOur paper proposed a novel solution that involves two phases, exploration, and deployment, to turn GPT-4V into a capable \nagent that can help users operate their Android phones when a task is given. The exploration phase starts with a task \ngiven by you, and you can choose to let the agent either explore the app on its own or learn from your demonstration. \nIn both cases, the agent generates documentation for elements interacted during the exploration\u002Fdemonstration and \nsaves them for use in the deployment phase.\n\n#### Option 1: Autonomous Exploration\n\nThis solution features a fully autonomous exploration which allows the agent to explore the use of the app by attempting\nthe given task without any intervention from humans.\n\nTo start, run `learn.py` in the root directory. Follow the prompted instructions to select `autonomous exploration` \nas the operating mode and provide the app name and task description. Then, your agent will do the job for you. Under \nthis mode, AppAgent will reflect on its previous action making sure its action adheres to the given task and generate \ndocumentation for the elements explored.\n\n```bash\npython learn.py\n```\n\n#### Option 2: Learning from Human Demonstrations\n\nThis solution requires users to demonstrate a similar task first. AppAgent will learn from the demo and generate \ndocumentations for UI elements seen during the demo.\n\nTo start human demonstration, you should run `learn.py` in the root directory. Follow the prompted instructions to select \n`human demonstration` as the operating mode and provide the app name and task description. A screenshot of your phone \nwill be captured and all interactive elements shown on the screen will be labeled with numeric tags. You need to follow \nthe prompts to determine your next action and the target of the action. When you believe the demonstration is finished, \ntype `stop` to end the demo.\n\n```bash\npython learn.py\n```\n\n![](.\u002Fassets\u002Fdemo.png)\n\n### 📱 Step 4. Deployment Phase\n\nAfter the exploration phase finishes, you can run `run.py` in the root directory. Follow the prompted instructions to enter \nthe name of the app, select the appropriate documentation base you want the agent to use and provide the task \ndescription. Then, your agent will do the job for you. The agent will automatically detect if there is documentation \nbase generated before for the app; if there is no documentation found, you can also choose to run the agent without any \ndocumentation (success rate not guaranteed).\n\n```bash\npython run.py\n```\n\n## 💡 Tips\u003Ca name=\"tips\">\u003C\u002Fa>\n- For an improved experience, you might permit AppAgent to undertake a broader range of tasks through autonomous exploration, or you can directly demonstrate more app functions to enhance the app documentation. Generally, the more extensive the documentation provided to the agent, the higher the likelihood of successful task completion.\n- It is always a good practice to inspect the documentation generated by the agent. When you find some documentation not accurately\n  describe the function of the element, manually revising the documentation is also an option.\n\n\n## 📊 Evaluation\nPlease refer to  [evaluation benchmark](https:\u002F\u002Fgithub.com\u002Fmnotgod96\u002FAppAgent\u002Fblob\u002Fmain\u002Fassets\u002Ftestset.md).\n\n\n## 📖 To-Do List\n- [ ] Incorporate more LLM APIs into the project.\n- [x] Open source the Benchmark.\n- [x] Open source the configuration.\n\n## 😉 Citation\n```bib\n@misc{yang2023appagent,\n      title={AppAgent: Multimodal Agents as Smartphone Users}, \n      author={Chi Zhang and Zhao Yang and Jiaxuan Liu and Yucheng Han and Xin Chen and Zebiao Huang and Bin Fu and Gang Yu},\n      year={2023},\n      eprint={2312.13771},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=mnotgod96\u002FAppAgent&type=Date)](https:\u002F\u002Fstar-history.com\u002F#mnotgod96\u002FAppAgent&Date)\n\n\n## License\nThe [MIT license](.\u002Fassets\u002Flicense.txt).\n","AppAgent 是一个基于大语言模型的多模态代理框架，旨在操作智能手机应用程序。它通过结合文本和图像处理能力，使代理能够理解并执行复杂的用户指令，从而实现对手机应用的自动化操作。项目采用了先进的生成式AI技术，如GPT-4V等，支持多种多模态模型的选择，并提供了一个可扩展的架构以适应不同的应用场景。此外，AppAgent 还为开发者提供了详细的文档与示例代码，便于快速上手。该工具非常适合需要进行移动应用自动化测试、用户体验研究或辅助功能开发的场景。",2,"2026-06-11 03:35:51","high_star"]