[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72471":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72471,"CogVLM2","zai-org\u002FCogVLM2","zai-org","GPT4V-level open-source multi-modal model based on Llama3-8B","",null,"Python",2438,163,28,58,0,1,59.24,"Apache License 2.0",false,"main",[23,24,25,26],"cogvlm","language-model","multi-modal","pretrained-models","2026-06-12 04:01:05","# CogVLM2 & CogVLM2-Video\n\n[中文版README](.\u002FREADME_zh.md)\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=resources\u002Flogo.svg width=\"40%\"\u002F> \n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    👋 Join our \u003Ca href=\"resources\u002FWECHAT.md\" target=\"_blank\">Wechat\u003C\u002Fa> · 💡Try CogVLM2 \u003Ca href=\"http:\u002F\u002Fcogvlm2-online.cogviewai.cn:7861\u002F\" target=\"_blank\">Online\u003C\u002Fa> 💡Try CogVLM2-Video \u003Ca href=\"http:\u002F\u002Fcogvlm2-online.cogviewai.cn:7868\u002F\" target=\"_blank\">Online\u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n📍Experience the larger-scale CogVLM model on the \u003Ca href=\"https:\u002F\u002Fopen.bigmodel.cn\u002F?utm_campaign=open&_channel_track_key=OWTVNma9\">ZhipuAI Open Platform\u003C\u002Fa>.\n\u003C\u002Fp>\n\n## Recent updates\n- 🔥 **News**: ``2024\u002F8\u002F30``: The [CogVLM2 paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16500) has been published on arXiv.\n- 🔥 **News**: ``2024\u002F7\u002F12``: We have released CogVLM2-Video [online web demo](http:\u002F\u002Fcogvlm2-online.cogviewai.cn:7868\u002F), welcome to experience it.\n- 🔥 **News**: ``2024\u002F7\u002F8``: We released the video understanding version of the CogVLM2 model, the CogVLM2-Video model.\n  By extracting keyframes, it can interpret continuous images. The model can support videos of up to 1 minute. See more\n  in our [blog](https:\u002F\u002Fcogvlm2-video.github.io\u002F).\n- 🔥 **News**: ``2024\u002F6\u002F8``:We release [CogVLM2 TGI Weight](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chat-19B-tgi),\n  which is a model can be inferred in [TGI](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Findex). See\n  Inference Code in [here](https:\u002F\u002Fgithub.com\u002Fleizhao1234\u002Fcogvlm2)\n- 🔥 **News**: ``2024\u002F6\u002F5``:We release [GLM-4V-9B](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fglm-4v-9b), which use the same data and\n  training recipes as CogVLM2 but with GLM-9B as the language backbone. We removed visual experts to reduce the model\n  size to 13B. More details at [GLM-4 repo](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4\u002F).\n- 🔥 **News**: ``2024\u002F5\u002F24``: We have released\n  the [Int4 version model](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chat-19B-int4), which requires only 16GB of video\n  memory for inference. You can also run on-the-fly int4 version by passing `--quant 4`.\n- 🔥 **News**: ``2024\u002F5\u002F20``: We released the next generation model CogVLM2, which is based on llama3-8b and is\n  equivalent (or better) to GPT-4V in most cases ! Welcome to download!\n\n## Model introduction\n\nWe launch a new generation of **CogVLM2** series of models and open source two models based\non [Meta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct). Compared with the previous\ngeneration of CogVLM open source models, the CogVLM2 series of open source models have the following improvements:\n\n1. Significant improvements in many benchmarks such as `TextVQA`, `DocVQA`.\n2. Support **8K** content length.\n3. Support image resolution up to **1344 * 1344**.\n4. Provide an open source model version that supports both **Chinese and English**.\n\nYou can see the details of the **CogVLM2** family of open source models in the table below:\n\n| Model Name       | cogvlm2-llama3-chat-19B                                                                                                                                                                                                                                  | cogvlm2-llama3-chinese-chat-19B                                                                                                                                                                                                                                          | cogvlm2-video-llama3-chat                                                                                                                                 | cogvlm2-video-llama3-base                                                                                                                                 |  \n|------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|\n| Base Model       | Meta-Llama-3-8B-Instruct                                                                                                                                                                                                                                 | Meta-Llama-3-8B-Instruct                                                                                                                                                                                                                                                 | Meta-Llama-3-8B-Instruct                                                                                                                                  | Meta-Llama-3-8B-Instruct                                                                                                                                  |\n| Language         | English                                                                                                                                                                                                                                                  | Chinese, English                                                                                                                                                                                                                                                         | English                                                                                                                                                   | English                                                                                                                                                   |\n| Task             | Image Understanding, Multi-turn Dialogue Model                                                                                                                                                                                                           | Image Understanding, Multi-turn Dialogue Model                                                                                                                                                                                                                           | Video Understanding, Single-turn Dialogue Model                                                                                                           | Video Understanding, Base Model, No Dialogue                                                                                                              |\n| Model Link       | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chat-19B)  [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chat-19B\u002F)  [💫 Wise Model](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chat-19B\u002F)                    | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chinese-chat-19B) [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chinese-chat-19B\u002F)  [💫 Wise Model](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chinese-chat-19B)              | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-video-llama3-chat)  [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-video-llama3-chat) | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-video-llama3-base)  [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-video-llama3-base) |\n| Experience Link  | [📙 Official Page](http:\u002F\u002F36.103.203.44:7861\u002F)                                                                                                                                                                                                           | [📙 Official Page](http:\u002F\u002F36.103.203.44:7861\u002F) [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FZhipuAI\u002FCogvlm2-llama3-chinese-chat-Demo\u002Fsummary)                                                                                                                           | [📙 Official Page](http:\u002F\u002F36.103.203.44:7868\u002F)   [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FZhipuAI\u002FCogvlm2-Video-Llama3-Chat-Demo)                    | \u002F                                                                                                                                                         |\n| Int4 Model       | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chat-19B-int4)  [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chat-19B-int4)       [💫 Wise Model](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chat-19B-int4\u002F) | [🤗 Huggingface](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fcogvlm2-llama3-chinese-chat-19B-int4) [🤖 ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chinese-chat-19B-int4) [💫 Wise Model](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FZhipuAI\u002Fcogvlm2-llama3-chinse-chat-19B-int4\u002F) | \u002F                                                                                                                                                         | \u002F                                                                                                                                                         |\n| Text Length      | 8K                                                                                                                                                                                                                                                       | 8K                                                                                                                                                                                                                                                                       | 2K                                                                                                                                                        | 2K                                                                                                                                                        |\n| Image Resolution | 1344 * 1344                                                                                                                                                                                                                                              | 1344 * 1344                                                                                                                                                                                                                                                              | 224 * 224 (Video, take the first 24 frames)                                                                                                               | 224 * 224 (Video, take the average 24 frames)                                                                                                             |\n\n## Benchmark\n\n### Image Understand\n\nOur open source models have achieved good results in many lists compared to the previous generation of CogVLM open\nsource models. Its excellent performance can compete with some non-open source models, as shown in the table below:\n\n\n| Model                      | Open Source | LLM Size | TextVQA  | DocVQA   | ChartQA  | OCRbench | VCR_EASY | VCR_HARD | MMMU     | MMVet    | MMBench  |\n|----------------------------|-------------|----------|----------|----------|----------|----------|-------------|-------------|----------|----------|----------|\n| CogVLM1.1                  | ✅           | 7B       | 69.7     | -        | 68.3     | 590      | 73.9        | 34.6        | 37.3     | 52.0     | 65.8     |\n| LLaVA-1.5                  | ✅           | 13B      | 61.3     | -        | -        | 337      | -           | -           | 37.0     | 35.4     | 67.7     |\n| Mini-Gemini                | ✅           | 34B      | 74.1     | -        | -        | -        | -           | -           | 48.0     | 59.3     | 80.6     |\n| LLaVA-NeXT-LLaMA3          | ✅           | 8B       | -        | 78.2     | 69.5     | -        | -           | -           | 41.7     | -        | 72.1     |\n| LLaVA-NeXT-110B            | ✅           | 110B     | -        | 85.7     | 79.7     | -        | -           | -           | 49.1     | -        | 80.5     |\n| InternVL-1.5               | ✅           | 20B      | 80.6     | 90.9     | **83.8** | 720      | 14.7        | 2.0         | 46.8     | 55.4     | **82.3** |\n| QwenVL-Plus                | ❌           | -        | 78.9     | 91.4     | 78.1     | 726      | -           | -           | 51.4     | 55.7     | 67.0     |\n| Claude3-Opus               | ❌           | -        | -        | 89.3     | 80.8     | 694      | 63.85       | 37.8        | **59.4** | 51.7     | 63.3     |\n| Gemini Pro 1.5             | ❌           | -        | 73.5     | 86.5     | 81.3     | -        | 62.73       | 28.1        | 58.5     | -        | -        |\n| GPT-4V                     | ❌           | -        | 78.0     | 88.4     | 78.5     | 656      | 52.04       | 25.8        | 56.8     | **67.7** | 75.0     |\n| **CogVLM2-LLaMA3**         | ✅           | 8B       | 84.2     | **92.3** | 81.0     | 756      | **83.3**    | **38.0**        | 44.3     | 60.4     | 80.5     |\n| **CogVLM2-LLaMA3-Chinese** | ✅           | 8B       | **85.0** | 88.4     | 74.7     | **780**  | 79.9        | 25.1        | 42.8     | 60.5     | 78.9     |\n\nAll reviews were obtained without using any external OCR tools (\"pixel only\").\n\n### Video Understand\n\nCogVLM2-Video achieves state-of-the-art performance on multiple video question answering tasks. The following diagram\nshows the performance of CogVLM2-Video on\nthe [MVBench](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FAsk-Anything), [VideoChatGPT-Bench](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT)\nand Zero-shot VideoQA datasets (MSVD-QA, MSRVTT-QA, ActivityNet-QA). Where VCG-* refers to the VideoChatGPTBench, ZS-*\nrefers to Zero-Shot VideoQA datasets and MV-* refers to main categories in the MVBench.\n\n![Quantitative Evaluation](resources\u002Fcogvlm2_video_bench.jpeg)\n\n### Detailed performance\n\nPerformance on VideoChatGPT-Bench and Zero-shot VideoQA dataset:\n\n| Models                | VCG-AVG  | VCG-CI   | VCG-DO   | VCG-CU   | VCG-TU   | VCG-CO   | ZS-AVG    |\n|-----------------------|----------|----------|----------|----------|----------|----------|-----------|\n| IG-VLM GPT4V          | 3.17     | 3.40     | 2.80     | 3.61     | 2.89     | 3.13     | 65.70     |\n| ST-LLM                | 3.15     | 3.23     | 3.05     | 3.74     | 2.93     | 2.81     | 62.90     |\n| ShareGPT4Video        | N\u002FA      | N\u002FA      | N\u002FA      | N\u002FA      | N\u002FA      | N\u002FA      | 46.50     |\n| VideoGPT+             | 3.28     | 3.27     | 3.18     | 3.74     | 2.83     | **3.39** | 61.20     |\n| VideoChat2_HD_mistral | 3.10     | 3.40     | 2.91     | 3.72     | 2.65     | 2.84     | 57.70     |\n| PLLaVA-34B            | 3.32     | **3.60** | 3.20     | **3.90** | 2.67     | 3.25     | **68.10** | \n| CogVLM2-Video         | **3.41** | 3.49     | **3.46** | 3.87     | **2.98** | 3.23     | 66.60     |\n\nPerformance on MVBench dataset:\n\n| Models                | AVG      | AA       | AC       | AL       | AP       | AS       | CO       | CI       | EN       | ER       | FA       | FP       | MA       | MC       | MD       | OE       | OI       | OS       | ST       | SC       | UA       |\n|-----------------------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|----------|\n| IG-VLM GPT4V          | 43.7     | 72.0     | 39.0     | 40.5     | 63.5     | 55.5     | 52.0     | 11.0     | 31.0     | 59.0     | 46.5     | 47.5     | 22.5     | 12.0     | 12.0     | 18.5     | 59.0     | 29.5     | 83.5     | 45.0     | 73.5     |\n| ST-LLM                | 54.9     | 84.0     | 36.5     | 31.0     | 53.5     | 66.0     | 46.5     | 58.5     | 34.5     | 41.5     | 44.0     | 44.5     | 78.5     | 56.5     | 42.5     | 80.5     | 73.5     | 38.5     | 86.5     | 43.0     | 58.5     |\n| ShareGPT4Video        | 51.2     | 79.5     | 35.5     | 41.5     | 39.5     | 49.5     | 46.5     | 51.5     | 28.5     | 39.0     | 40.0     | 25.5     | 75.0     | 62.5     | 50.5     | 82.5     | 54.5     | 32.5     | 84.5     | 51.0     | 54.5     |\n| VideoGPT+             | 58.7     | 83.0     | 39.5     | 34.0     | 60.0     | 69.0     | 50.0     | 60.0     | 29.5     | 44.0     | 48.5     | 53.0     | 90.5     | 71.0     | 44.0     | 85.5     | 75.5     | 36.0     | 89.5     | 45.0     | 66.5     |\n| VideoChat2_HD_mistral | **62.3** | 79.5     | **60.0** | **87.5** | 50.0     | 68.5     | **93.5** | 71.5     | 36.5     | 45.0     | 49.5     | **87.0** | 40.0     | **76.0** | **92.0** | 53.0     | 62.0     | **45.5** | 36.0     | 44.0     | 69.5     |\n| PLLaVA-34B            | 58.1     | 82.0     | 40.5     | 49.5     | 53.0     | 67.5     | 66.5     | 59.0     | **39.5** | **63.5** | 47.0     | 50.0     | 70.0     | 43.0     | 37.5     | 68.5     | 67.5     | 36.5     | 91.0     | 51.5     | **79.0** |\n| CogVLM2-Video         | **62.3** | **85.5** | 41.5     | 31.5     | **65.5** | **79.5** | 58.5     | **77.0** | 28.5     | 42.5     | **54.0** | 57.0     | **91.5** | 73.0     | 48.0     | **91.0** | **78.0** | 36.0     | **91.5** | **47.0** | 68.5     |\n\n## Project structure\n\nThis open source repos will help developers to quickly get started with the basic calling methods of the CogVLM2 open\nsource model, fine-tuning examples, OpenAI API format calling examples, etc. The specific project structure is as\nfollows, you can click to enter the corresponding tutorial link:\n\n## [basic_demo](basic_demo\u002FREADME.md) folder includes:\n\n+ **CLI** demo, inference CogVLM2 model.\n+ **CLI** demo, inference CogVLM2 model using multiple GPUs.\n+ **Web** demo, provided by chainlit.\n+ **API** server, in OpenAI format.\n+ **Int4** can be easily enabled with `--quant 4`, memory usage is 16GB.\n\n## [finetune_demo](finetune_demo\u002FREADME.md) folder includes:\n\n+ [**peft**](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) framework's efficient fine-tuning example.\n\n## [video_demo](video_demo\u002FREADME.md) folder includes:\n\n+ **CLI** demo, inference CogVLM2-Video model.\n+ **Int4** can be easily enabled with `--quant 4`, with 16GB memory usage.\n+ Restful **API** server.\n+ **Gradio** demo.\n\n## Useful Links\n\nIn addition to the official inference code, you can also refer to the following community-provided inference solutions:\n\n+ [**xinference**](https:\u002F\u002Fgithub.com\u002Fxorbitsai\u002Finference\u002Fpull\u002F1551)\n\n## License\n\nThis model is released under the CogVLM2 [CogVLM2 LICENSE](MODEL_LICENSE). For models built with Meta Llama 3, please\nalso adhere to the [LLAMA3_LICENSE](https:\u002F\u002Fllama.meta.com\u002Fllama3\u002Flicense\u002F).\n\n## Citation\n\nIf you find our work helpful, please consider citing the following papers\n\n```\n@article{hong2024cogvlm2,\n  title={CogVLM2: Visual Language Models for Image and Video Understanding},\n  author={Hong, Wenyi and Wang, Weihan and Ding, Ming and Yu, Wenmeng and Lv, Qingsong and Wang, Yan and Cheng, Yean and Huang, Shiyu and Ji, Junhui and Xue, Zhao and others},\n  journal={arXiv preprint arXiv:2408.16500},\n  year={2024}\n}\n```\n\n```\n@misc{wang2023cogvlm,\n      title={CogVLM: Visual Expert for Pretrained Language Models}, \n      author={Weihan Wang and Qingsong Lv and Wenmeng Yu and Wenyi Hong and Ji Qi and Yan Wang and Junhui Ji and Zhuoyi Yang and Lei Zhao and Xixuan Song and Jiazheng Xu and Bin Xu and Juanzi Li and Yuxiao Dong and Ming Ding and Jie Tang},\n      year={2023},\n      eprint={2311.03079},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV}\n}\n```\n","CogVLM2 是一个基于 Llama3-8B 的开源多模态模型，达到 GPT4V 级别的性能。该项目的核心功能包括支持 8K 内容长度和高达 1344*1344 分辨率的图像处理，并且提供中英文双语支持。此外，CogVLM2 还推出了视频理解版本 CogVLM2-Video，能够通过提取关键帧来解释连续图像，支持长达一分钟的视频。技术上，该模型在多个基准测试如 TextVQA 和 DocVQA 上表现出显著改进，并且提供了多种量化版本以适应不同的硬件需求。适用于需要高质量图文理解和生成的应用场景，例如智能客服、内容审核、教育辅助等。",2,"2026-06-11 03:42:11","high_star"]