[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-10769":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":39,"discoverSource":40},10769,"Video-ChatGPT","mbzuai-oryx\u002FVideo-ChatGPT","mbzuai-oryx","[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation. We also introduce a rigorous 'Quantitative Evaluation Benchmarking' for video-based conversational models.","https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT",null,"Python",1503,129,12,25,0,2,5,19.34,"Creative Commons Attribution 4.0 International",false,"main",true,[25,26,27,28,29,30,31,32,33,34,35],"chatbot","clip","gpt-4","llama","llava","mulit-modal","vicuna","video-chatboat","video-conversation","vision-language","vision-language-pretraining","2026-06-12 02:02:26","# Oryx Video-ChatGPT :movie_camera: :speech_balloon:\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwaxVImv.png\" alt=\"Oryx Video-ChatGPT\">\n\u003C\u002Fp>\n\n### Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [ACL 2024 🔥]\n\n#### [Muhammad Maaz](https:\u002F\u002Fwww.mmaaz60.com)* , [Hanoona Rasheed](https:\u002F\u002Fwww.hanoonarasheed.com\u002F)* , [Salman Khan](https:\u002F\u002Fsalman-h-khan.github.io\u002F) and [Fahad Khan](https:\u002F\u002Fsites.google.com\u002Fview\u002Ffahadkhans\u002Fhome)\n\\* Equally contributing first authors\n\n#### **Mohamed bin Zayed University of Artificial Intelligence**\n\n---\n#### **Diverse Video-based Generative Performance Benchmarking (VCGBench-Diverse)**\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fvcgbench-diverse-on-videoinstruct)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvcgbench-diverse-on-videoinstruct?p=video-chatgpt-towards-detailed-video)\n\n\n#### **Video-based Generative Performance Benchmarking**\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fvideo-based-generative-performance)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fvideo-based-generative-performance?p=video-chatgpt-towards-detailed-video)\n\n\n#### **Zeroshot Question-Answer Evaluation**\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fzeroshot-video-question-answer-on-msvd-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msvd-qa?p=video-chatgpt-towards-detailed-video)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fzeroshot-video-question-answer-on-msrvtt-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-msrvtt-qa?p=video-chatgpt-towards-detailed-video)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fzeroshot-video-question-answer-on-tgif-qa)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-tgif-qa?p=video-chatgpt-towards-detailed-video)\n[![PWC](https:\u002F\u002Fimg.shields.io\u002Fendpoint.svg?url=https:\u002F\u002Fpaperswithcode.com\u002Fbadge\u002Fvideo-chatgpt-towards-detailed-video\u002Fzeroshot-video-question-answer-on-activitynet)](https:\u002F\u002Fpaperswithcode.com\u002Fsota\u002Fzeroshot-video-question-answer-on-activitynet?p=video-chatgpt-towards-detailed-video)\n\n\n---\n\n| Demo | Paper | Demo Clips | Offline Demo | Training | Video Instruction Data | Quantitative Evaluation | Qualitative Analysis |\n| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| [![Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Try%20it%20out-green)](https:\u002F\u002Fwww.ival-mbzuai.com\u002Fvideo-chatgpt) [![YouTube](https:\u002F\u002Fbadges.aleen42.com\u002Fsrc\u002Fyoutube.svg)](https:\u002F\u002Fyoutu.be\u002FfRhm---HWJY) | [![paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.05424) | [![DemoClip-1](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-DemoClip1-blue)](https:\u002F\u002Fyoutu.be\u002FR8qW5EJD2-k) [![DemoClip-2](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-DemoClip2-yellow)](https:\u002F\u002Fyoutu.be\u002FujCxqxMXLVw) [![DemoClip-3](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-DemoClip3-violet)](https:\u002F\u002Fyoutu.be\u002F97IWKMsbZ80) [![DemoClip-4](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-DemoClip4-orange)](https:\u002F\u002Fyoutu.be\u002FZyJZfTg_Ttc) | [Offline Demo](#running-demo-offline-cd) | [Training](#training-train) | [Video Instruction Dataset](#video-instruction-dataset-open_file_folder) | [Quantitative Evaluation](#quantitative-evaluation-bar_chart) | [Qualitative Analysis](#qualitative-analysis-mag) |\n\n---\n\n## :loudspeaker: Latest Updates\n- **Mar-28-25**: *Mobile-VideoGPT* is released. It achieves excellent results on multiple benchmarks with 2x higher throughput. Check it out [Mobile-VideoGPT](https:\u002F\u002Fgithub.com\u002FAmshaker\u002FMobile-VideoGPT) :fire::fire:\n---\n\n- **Jun-14-24**: *VideoGPT+* is released. It achieves SoTA results on multiple benchmarks. Check it out at [VideoGPT+](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideoGPT-plus) :fire::fire:\n- **Jun-14-24**: *Semi-automatic video annotation pipeline* is released. Check it out at [GitHub](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideoGPT-plus), [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002Fvideo_annotation_pipeline). :fire::fire:\n- **Jun-14-24**: *VCGBench-Diverse Benchmarks* are released. It provides 4,354 human annotated QA pairs across 18 video categories to extensively evaluate the performance of a video-conversation model. Check it out at [GitHub](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideoGPT-plus), [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVCGBench-Diverse). :fire::fire:\n---\n\n- **May-16-24**: Video-ChatGPT is accepted at ACL 2024! 🎊🎊\n- **Sep-30-23**: Our VideoInstruct100K dataset can be downloaded from [HuggingFace\u002FVideoInstruct100K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K). :fire::fire:\n- **Jul-15-23**: Our quantitative evaluation benchmark for Video-based Conversational Models now has its own dedicated website: [https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT). :fire::fire:\n- **Jun-28-23**: Updated GitHub readme featuring benchmark comparisons of Video-ChatGPT against recent models - Video Chat, Video LLaMA, and LLaMA Adapter. Amid these advanced conversational models, Video-ChatGPT continues to deliver state-of-the-art performance.:fire::fire:\n- **Jun-08-23** : Released the training code, offline demo, instructional data and technical report. \nAll the resources including models, datasets and extracted features are available \n[here](https:\u002F\u002Fmbzuaiac-my.sharepoint.com\u002F:f:\u002Fg\u002Fpersonal\u002Fhanoona_bangalath_mbzuai_ac_ae\u002FEudc2kLOX4hIuCenDmFe-UIBthkBQKpF9p6KrY2q_s9hwQ?e=zHKbTX). :fire::fire:\n- **May-21-23** : Video-ChatGPT: demo released.\n\n---\n\n## Online Demo :computer:\n\n:fire::fire: **You can try our demo using the provided examples or by uploading your own videos [HERE](https:\u002F\u002Fwww.ival-mbzuai.com\u002Fvideo-chatgpt).** :fire::fire:\n\n:fire::fire: **Or click the image to try the demo!** :fire::fire:\n[![demo](docs\u002Fimages\u002Fdemo_icon.png)](https:\u002F\u002Fwww.ival-mbzuai.com\u002Fvideo-chatgpt)\nYou can access all the videos we demonstrate on [here](https:\u002F\u002Fmbzuaiac-my.sharepoint.com\u002F:f:\u002Fg\u002Fpersonal\u002Fhanoona_bangalath_mbzuai_ac_ae\u002FEqrZjHG0KoFNhx6nDcCmFU0BtRqWyg8_zUgzvNQDY5t_3Q?e=AoEdnI).\n\n---\n\n## Video-ChatGPT Overview :bulb:\n\nVideo-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. \nIt combines the capabilities of LLMs with a pretrained visual encoder adapted for spatiotemporal video representation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002FVideo-ChatGPT.gif\" alt=\"Video-ChatGPT Architectural Overview\">\n\u003C\u002Fp>\n\n---\n\n## Contributions :trophy:\n\n- We introduce 100K high-quality video-instruction pairs together with a novel annotation framework that is scalable and generates a diverse range of video-specific instruction sets of high-quality.\n- We develop the first quantitative video conversation evaluation framework for benchmarking video conversation models.\n- Unique multimodal (vision-language) capability combining video understanding and language generation that is comprehensively \nevaluated using quantitative and qualitiative comparisons on video reasoning, creativitiy, spatial and temporal understanding, and action recognition tasks.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fimages\u002Fhightlights_video_chatgpt.png\" alt=\"Contributions\">\n\u003C\u002Fp>\n\n---\n\n## Installation :wrench:\n\nWe recommend setting up a conda environment for the project:\n```shell\nconda create --name=video_chatgpt python=3.10\nconda activate video_chatgpt\n\ngit clone https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT.git\ncd Video-ChatGPT\npip install -r requirements.txt\n\nexport PYTHONPATH=\".\u002F:$PYTHONPATH\"\n```\nAdditionally, install [FlashAttention](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention) for training,\n```shell\npip install ninja\n\ngit clone https:\u002F\u002Fgithub.com\u002FHazyResearch\u002Fflash-attention.git\ncd flash-attention\ngit checkout v1.0.7\npython setup.py install\n```\n\n---\n\n## Running Demo Offline :cd:\n\nTo run the demo offline, please refer to the instructions in [offline_demo.md](docs\u002Foffline_demo.md).\n\n---\n\n## Training :train:\n\nFor training instructions, check out [train_video_chatgpt.md](docs\u002Ftrain_video_chatgpt.md).\n\n---\n\n## Video Instruction Dataset :open_file_folder:\n\nWe are releasing our 100,000 high-quality video instruction dataset that was used for training our Video-ChatGPT model. You can download the dataset from \n[here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMBZUAI\u002FVideoInstruct-100K). \nMore details on our human-assisted and semi-automatic annotation framework for generating the data are available at [VideoInstructionDataset.md](data\u002FREADME.md).\n\n---\n\n## Quantitative Evaluation :bar_chart:\nOur paper introduces a new Quantitative Evaluation Framework for Video-based Conversational Models. To explore our benchmarks and understand the framework in greater detail, \nplease visit our dedicated website: [https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT](https:\u002F\u002Fmbzuai-oryx.github.io\u002FVideo-ChatGPT).\n\nFor detailed instructions on performing quantitative evaluation, please refer to [QuantitativeEvaluation.md](quantitative_evaluation\u002FREADME.md).\n\n**Video-based Generative Performance Benchmarking**  and **Zero-Shot Question-Answer Evaluation** tables are provided for a detailed performance overview. \n\n### Zero-Shot Question-Answer Evaluation\n\n| **Model** | **MSVD-QA** |  | **MSRVTT-QA** |  | **TGIF-QA** |  | **Activity Net-QA** |  |\n| --- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| | **Accuracy** | **Score** | **Accuracy** | **Score** | **Accuracy** | **Score** | **Accuracy** | **Score** |\n| FrozenBiLM | 32.2 | -- | 16.8 | -- | 41.0 | -- | 24.7 | -- |\n| Video Chat | 56.3 | 2.8 | 45.0 | 2.5 | 34.4 | 2.3 | 26.5 | 2.2 |\n| LLaMA Adapter | 54.9 | 3.1 | 43.8 | 2.7 | - | - | 34.2 | 2.7 |\n| Video LLaMA | 51.6 | 2.5 | 29.6 | 1.8 | - | - | 12.4 | 1.1 |\n| Video-ChatGPT | **64.9** | **3.3** | **49.3** | **2.8** | **51.4** | **3.0** | **35.2** | **2.7** |\n\n\n---\n\n### Video-based Generative Performance Benchmarking\n\n| **Evaluation Aspect** | **Video Chat** | **LLaMA Adapter** | **Video LLaMA** | **Video-ChatGPT** |\n| --- |:--------------:|:-----------------:|:--------------:|:-----------------:|\n| Correctness of Information |      2.23      |       2.03        |      1.96      |       **2.40**        |\n| Detail Orientation |      2.50      |       2.32        |      2.18      |       **2.52**        |\n| Contextual Understanding |      2.53      |       2.30        |      2.16      |       **2.62**        |\n| Temporal Understanding |      1.94      |       **1.98**        |      1.82      |       **1.98**        |\n| Consistency |      2.24      |       2.15        |      1.79      |       **2.37**        |\n\n---\n\n## Qualitative Analysis :mag:\nA Comprehensive Evaluation of Video-ChatGPT's Performance across Multiple Tasks.\n\n### Video Reasoning Tasks :movie_camera:\n![sample1](docs\u002Fdemo_samples\u002Fvideo_reasoning-min.png)\n\n---\n### Creative and Generative Tasks :paintbrush:\n![sample5](docs\u002Fdemo_samples\u002Fcreative_and_generative-min.png)\n\n---\n### Spatial Understanding :globe_with_meridians:\n![sample8](docs\u002Fdemo_samples\u002Fspatial_understanding-min.png)\n\n---\n### Video Understanding and Conversational Tasks :speech_balloon:\n![sample10](docs\u002Fdemo_samples\u002Fvideo_understanding_and_conversation-min.png)\n\n---\n### Action Recognition :runner:\n![sample22](docs\u002Fdemo_samples\u002Faction_recognition-min.png)\n\n---\n### Question Answering Tasks :question:\n![sample14](docs\u002Fdemo_samples\u002Fquestion_answering-min.png)\n\n---\n### Temporal Understanding :hourglass_flowing_sand:\n![sample18](docs\u002Fdemo_samples\u002Ftemporal_understanding-min.png)\n\n---\n\n## Acknowledgements :pray:\n\n+ [LLaMA](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama): A great attempt towards open and efficient LLMs!\n+ [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): Has the amazing language capabilities!\n+ [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): our architecture is inspired from LLaVA.\n+ Thanks to our colleagues at MBZUAI for their essential contribution to the video annotation task, \nincluding Salman Khan, Fahad Khan, Abdelrahman Shaker, Shahina Kunhimon, Muhammad Uzair, Sanoojan Baliah, Malitha Gunawardhana, Akhtar Munir, \nVishal Thengane, Vignagajan Vigneswaran, Jiale Cao, Nian Liu, Muhammad Ali, Gayal Kurrupu, Roba Al Majzoub, \nJameel Hassan, Hanan Ghani, Muzammal Naseer, Akshay Dudhane, Jean Lahoud, Awais Rauf, Sahal Shaji, Bokang Jia,\nwithout which this project would not be possible.\n\nIf you're using Video-ChatGPT in your research or applications, please cite using this BibTeX:\n```bibtex\n@inproceedings{Maaz2023VideoChatGPT,\n    title={Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models},\n    author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz},\n    booktitle={Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)},\n    year={2024}\n}\n```\n\n## License :scroll:\n\u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">\u003Cimg alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https:\u002F\u002Fi.creativecommons.org\u002Fl\u002Fby-nc-sa\u002F4.0\u002F80x15.png\" \u002F>\u003C\u002Fa>\u003Cbr \u002F>This work is licensed under a \u003Ca rel=\"license\" href=\"http:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc-sa\u002F4.0\u002F\">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License\u003C\u002Fa>.\n\n\nLooking forward to your feedback, contributions, and stars! :star2:\nPlease raise any issues or questions [here](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FVideo-ChatGPT\u002Fissues). \n\n\n---\n[\u003Cimg src=\"docs\u002Fimages\u002FIVAL_logo.png\" width=\"200\" height=\"100\">](https:\u002F\u002Fwww.ival-mbzuai.com)\n[\u003Cimg src=\"docs\u002Fimages\u002FOryx_logo.png\" width=\"100\" height=\"100\">](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx)\n[\u003Cimg src=\"docs\u002Fimages\u002FMBZUAI_logo.png\" width=\"360\" height=\"85\">](https:\u002F\u002Fmbzuai.ac.ae)\n","Video-ChatGPT 是一个能够针对视频内容生成有意义对话的模型，结合了大规模语言模型和预训练的视觉编码器以适应时空视频表示。该项目的核心功能包括基于视频的对话生成、多模态理解和零样本问答能力，并引入了一套严格的定量评估基准来衡量视频对话模型的性能。它适用于需要对视频内容进行深入理解并与其互动的应用场景，例如教育、娱乐、智能客服等领域。","2026-06-11 03:30:05","top_topic"]