[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1743":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":8,"language":10,"languages":8,"totalLinesOfCode":8,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":8,"rankLanguage":8,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":26,"discoverSource":27},1743,"TaskMatrix","chenfei-wu\u002FTaskMatrix","chenfei-wu",null,"","Python",34085,3221,291,225,0,2,45,"Other",false,"main",true,[],"2026-06-12 02:00:32","# TaskMatrix\n\n**TaskMatrix** connects ChatGPT and a series of Visual Foundation Models to enable **sending** and **receiving** images during chatting.\n\nSee our paper: [\u003Cfont size=5>Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models\u003C\u002Ffont>](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671)\n\n\u003Ca src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Open%20in%20Spaces-blue\" href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fvisual_chatgpt\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Open%20in%20Spaces-blue\" alt=\"Open in Spaces\">\n\u003C\u002Fa>\n\n\u003Ca src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1P3jJqKEWEaeNcZg8fODbbWeQ3gxOHk2-?usp=sharing\">\n    \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open in Colab\">\n\u003C\u002Fa>\n\n## Updates:\n- Now TaskMatrix supports [GroundingDINO](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO) and [segment-anything](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything)! Thanks **@jordddan** for his efforts. For the image editing case, `GroundingDINO` is first used to locate bounding boxes guided by given text, then `segment-anything` is used to generate the related mask, and finally stable diffusion inpainting is used to edit image based on the mask. \n    - Firstly, run `python visual_chatgpt.py --load \"Text2Box_cuda:0,Segmenting_cuda:0,Inpainting_cuda:0,ImageCaptioning_cuda:0\"`\n    - Then, say `find xxx in the image` or `segment xxx in the image`. `xxx` is an object. TaskMatrix will return the detection or segmentation result!\n\n\n- Now TaskMatrix can support Chinese! Thanks to **@Wang-Xiaodong1899** for his efforts.\n- We propose the **template** idea in TaskMatrix!\n    - A template is a **pre-defined execution flow** that assists ChatGPT in assembling complex tasks involving multiple foundation models. \n    - A template contains the **experiential solution** to complex tasks as determined by humans. \n    - A template can **invoke multiple foundation models** or even **establish a new ChatGPT session**\n    - To define a **template**, simply adding a class with attributes `template_model = True`\n- Thanks to **@ShengmingYin** and **@thebestannie** for providing a template example in `InfinityOutPainting` class (see the following gif)\n    - Firstly, run `python visual_chatgpt.py --load \"Inpainting_cuda:0,ImageCaptioning_cuda:0,VisualQuestionAnswering_cuda:0\"`\n    - Secondly, say `extend the image to 2048x1024` to TaskMatrix!\n    - By simply creating an `InfinityOutPainting` template, TaskMatrix can seamlessly extend images to any size through collaboration with existing `ImageCaptioning`, `Inpainting`, and `VisualQuestionAnswering` foundation models, **without the need for additional training**.\n- **TaskMatrix needs the effort of the community! We crave your contribution to add new and interesting features!**\n\u003Cimg src=\".\u002Fassets\u002Fdemo_inf.gif\" width=\"750\">\n\n\n## Insight & Goal:\nOn the one hand, **ChatGPT (or LLMs)** serves as a **general interface** that provides a broad and diverse understanding of a\nwide range of topics. On the other hand, **Foundation Models** serve as **domain experts** by providing deep knowledge in specific domains.\nBy leveraging **both general and deep knowledge**, we aim at building an AI that is capable of handling various tasks.\n\n\n## Demo \n\u003Cimg src=\".\u002Fassets\u002Fdemo_short.gif\" width=\"750\">\n\n##  System Architecture \n\n \n\u003Cp align=\"center\">\u003Cimg src=\".\u002Fassets\u002Ffigure.jpg\" alt=\"Logo\">\u003C\u002Fp>\n\n\n## Quick Start\n\n```\n# clone the repo\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FTaskMatrix.git\n\n# Go to directory\ncd visual-chatgpt\n\n# create a new environment\nconda create -n visgpt python=3.8\n\n# activate the new environment\nconda activate visgpt\n\n#  prepare the basic environments\npip install -r requirements.txt\npip install  git+https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FGroundingDINO.git\npip install  git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fsegment-anything.git\n\n# prepare your private OpenAI key (for Linux)\nexport OPENAI_API_KEY={Your_Private_Openai_Key}\n\n# prepare your private OpenAI key (for Windows)\nset OPENAI_API_KEY={Your_Private_Openai_Key}\n\n# Start TaskMatrix !\n# You can specify the GPU\u002FCPU assignment by \"--load\", the parameter indicates which \n# Visual Foundation Model to use and where it will be loaded to\n# The model and device are separated by underline '_', the different models are separated by comma ','\n# The available Visual Foundation Models can be found in the following table\n# For example, if you want to load ImageCaptioning to cpu and Text2Image to cuda:0\n# You can use: \"ImageCaptioning_cpu,Text2Image_cuda:0\"\n\n# Advice for CPU Users\npython visual_chatgpt.py --load ImageCaptioning_cpu,Text2Image_cpu\n\n# Advice for 1 Tesla T4 15GB  (Google Colab)                       \npython visual_chatgpt.py --load \"ImageCaptioning_cuda:0,Text2Image_cuda:0\"\n                                \n# Advice for 4 Tesla V100 32GB                            \npython visual_chatgpt.py --load \"Text2Box_cuda:0,Segmenting_cuda:0,\n    Inpainting_cuda:0,ImageCaptioning_cuda:0,\n    Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,\n    Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,\n    InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,\n    SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,\n    Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,\n    NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3\"\n\n```\n\n## GPU memory usage\nHere we list the GPU memory usage of each visual foundation model, you can specify which one you like:\n\n| Foundation Model        | GPU Memory (MB) |\n|------------------------|-----------------|\n| ImageEditing           | 3981            |\n| InstructPix2Pix        | 2827            |\n| Text2Image             | 3385            |\n| ImageCaptioning        | 1209            |\n| Image2Canny            | 0               |\n| CannyText2Image        | 3531            |\n| Image2Line             | 0               |\n| LineText2Image         | 3529            |\n| Image2Hed              | 0               |\n| HedText2Image          | 3529            |\n| Image2Scribble         | 0               |\n| ScribbleText2Image     | 3531            |\n| Image2Pose             | 0               |\n| PoseText2Image         | 3529            |\n| Image2Seg              | 919             |\n| SegText2Image          | 3529            |\n| Image2Depth            | 0               |\n| DepthText2Image        | 3531            |\n| Image2Normal           | 0               |\n| NormalText2Image       | 3529            |\n| VisualQuestionAnswering| 1495            |\n\n## Acknowledgement\nWe appreciate the open source of the following projects:\n\n[Hugging Face](https:\u002F\u002Fgithub.com\u002Fhuggingface) &#8194;\n[LangChain](https:\u002F\u002Fgithub.com\u002Fhwchase17\u002Flangchain) &#8194;\n[Stable Diffusion](https:\u002F\u002Fgithub.com\u002FCompVis\u002Fstable-diffusion) &#8194; \n[ControlNet](https:\u002F\u002Fgithub.com\u002Flllyasviel\u002FControlNet) &#8194; \n[InstructPix2Pix](https:\u002F\u002Fgithub.com\u002Ftimothybrooks\u002Finstruct-pix2pix) &#8194; \n[CLIPSeg](https:\u002F\u002Fgithub.com\u002Ftimojl\u002Fclipseg) &#8194;\n[BLIP](https:\u002F\u002Fgithub.com\u002Fsalesforce\u002FBLIP) &#8194;\n\n## Contact Information\nFor help or issues using the TaskMatrix, please submit a GitHub issue.\n\nFor other communications, please contact Chenfei WU (chewu@microsoft.com) or Nan DUAN (nanduan@microsoft.com).\n\n## Trademark Notice\n\nTrademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.\n\n## Disclaimer\nThe recommended models in this Repo are just examples, used for scientific research exploring the concept of task automation and benchmarking with the paper published at [Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.04671). Users can replace the models in this Repo according to their research needs. When using the recommended models in this Repo, you need to comply with the licenses of these models respectively. Microsoft shall not be held liable for any infringement of third-party rights resulting from your usage of this repo. Users agree to defend, indemnify and hold Microsoft harmless from and against all damages, costs, and attorneys' fees in connection with any claims arising from this Repo. If anyone believes that this Repo infringes on your rights, please notify the project owner [email](chewu@microsoft.com).\n","TaskMatrix 是一个连接ChatGPT与一系列视觉基础模型的平台，支持在聊天过程中发送和接收图片。该项目通过整合GroundingDINO、segment-anything等工具，实现了基于文本指导的对象定位、分割及图像编辑功能。此外，它还引入了模板概念，允许用户定义预设流程来组合多个模型完成复杂任务，如无需额外训练即可扩展图片尺寸。TaskMatrix适用于需要结合自然语言处理与计算机视觉技术的应用场景，例如图像识别、编辑以及基于对话的创意设计等。","2026-06-11 02:45:48","top_all"]