[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72456":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},72456,"VITA","VITA-MLLM\u002FVITA","VITA-MLLM","✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction","",null,"Python",2513,181,46,58,0,1,4,59.68,"Other",false,"main",[24,25,26,27,28],"large-multimodal-models","multimodal-large-language-models","omni-language-model","omni-modal-video-understanding","omni-model","2026-06-12 04:01:05","# VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction\n\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fasset\u002Fvita_newlog.jpg\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[📖 VITA-1.5 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957)] [[🤖 Basic Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fmodelscope\u002FVITA1.5_demo)] [[🍎 VITA-1.0](https:\u002F\u002Fvita-home.github.io\u002F)] [[💬 WeChat (微信)](.\u002Fasset\u002Fwechat-group.jpg)]\u003C\u002Fdiv>\u003C\u002Ffont>\n\n---\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fasset\u002Fvita_demo.jpg\" width=\"80%\" height=\"80%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[📽 VITA-1.5 Demo Show! Here We Go! 🔥](https:\u002F\u002Fyoutu.be\u002Ftyi6SVFT5mM?si=fkMQCrwa5fVnmEe7)] \u003C\u002Fdiv>\u003C\u002Ffont>  \n\u003Cfont size=7>\u003Cdiv align='center' > VITA-1.5 supports both **English** and **Chinese**.🌟 \u003C\u002Fdiv>\u003C\u002Ffont>  \nYou can experience our [Basic Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fmodelscope\u002FVITA1.5_demo) on ModelScope directly. The Real-Time Interactive Demo needs to be configured according to the [instructions](#-real-time-interactive-demo).\n\n## 🔥 News\n* **`2025.01.17`** 🌟 ModelScope has supported VITA-1.5! You could try our [Basic Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fmodelscope\u002FVITA1.5_demo) on it!\n* **`2025.01.06`** 🌟 [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit) of OpenCompass has supported our both VITA-1.5 and VITA-1.0 models!\n* **`2025.01.06`** 🌟 The [technical report](https:\u002F\u002Fhuggingface.co\u002FVITA-MLLM) of VITA-1.5 has been released!\n* **`2024.12.20`** 🌟 We are excited to introduce the **VITA-1.5**, a more powerful and more real-time version!\n* **`2024.08.12`** 🌟 We are very proud to launch **VITA-1.0**, the First-Ever open-source interactive omni multimodal LLM! We have submitted the open-source code, yet it is under review internally. We are moving the process forward as quickly as possible, stay tuned!\n\n\n## Contents \u003C!-- omit in toc -->\n\n- [VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction](#vita-15-towards-gpt-4o-level-real-time-vision-and-speech-interaction)\n  - [🔥 News](#-news)\n  - [👀 VITA-1.5 Overview](#-vita-15-overview)\n    - [🌟 What’s New in VITA-1.5?](#-whats-new-in-vita-15)\n  - [📈 Experimental Results](#-experimental-results)\n  - [⭐ Training](#-training)\n    - [Requirements and Installation](#requirements-and-installation)\n    - [Data Preparation](#data-preparation)\n    - [Continual Training](#continual-training)\n  - [📐 Inference](#-inference)\n    - [Quick Start](#quick-start)\n    - [Demo](#demo)\n      - [📍 Basic Demo](#-basic-demo)\n      - [📍 Real-Time Interactive Demo](#-real-time-interactive-demo)\n  - [📏Evaluating on MLLM Benchmarks](#evaluating-on-mllm-benchmarks)\n    - [VLMEvalKit](#vlmevalkit)\n    - [Video-MME](#video-mme)\n      - [Data Preparation](#data-preparation-1)\n      - [Evaluation](#evaluation)\n  - [✒️ Citation](#️-citation)\n  - [📣 Statement](#-statement)\n  - [📜 Related Works](#-related-works)\n  - [👍 Acknowledgement](#-acknowledgement)\n\n\n\n## 👀 VITA-1.5 Overview\nOn 2024.08.12, we launched **VITA-1.0**, the **first-ever open-source interactive omni-multimodal LLM**. Now (2024.12.20), we bring **a new version VITA-1.5**!\n\n### 🌟 What’s New in VITA-1.5?\n\nWe are excited to present **VITA-1.5**, which incorporates a series of advancements:\n\n1. **Significantly Reduced Interaction Latency**. The end-to-end speech interaction latency has been reduced from about **4 seconds** to **1.5 seconds**, enabling near-instant interaction and greatly improving user experience.  \n\n2. **Enhanced Multimodal Performance**.  The average performance on multimodal benchmarks such as *MME*, *MMBench*, and *MathVista* has been significantly increased from **59.8** to **70.8**.\n\n3. **Improvement in Speech Processing**. The speech processing capabilities have been refined to a new level, with ASR WER (Word Error Rate, Test Other) reduced from **18.4** to **7.5**. Besides, we replace the independent TTS module of VITA-1.0 with an **end-to-end TTS module**, which accepts the LLM's embedding as input.  \n\n4. **Progressive Training Strategy**. By this manner, the adding of speech has little effect on other multi-modal performance (vision-language). The average image understanding performance only drops from 71.3 to 70.8.\n\n\n## 📈 Experimental Results\n\n- **Evaluation on image and video understanding benchmarks.**\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fasset\u002Fvita_mllm_performance.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n- **VITA-1.5 outperforms professional speech models on ASR benchmarks.**\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fasset\u002Fvita_15_audio_2.jpg\" width=\"96%\" height=\"96%\">\n\u003C\u002Fp>\n\n- **Adding the audio modality has little effect on image and video understanding capability**.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fasset\u002Fvita_15_audio_training.png\" width=\"68%\" height=\"50%\">\n\u003C\u002Fp>\n\n## ⭐ Training\n### Requirements and Installation\n```\ngit clone https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA\ncd VITA\nconda create -n vita python=3.10 -y\nconda activate vita\npip install --upgrade pip\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\n### Data Preparation\n- An example json file of the training data:\n```\n[\n    ...\n    {\n        \"set\": \"sharegpt4\",\n        \"id\": \"000000000164\",\n        \"conversations\": [\n            {\n                \"from\": \"human\",\n                \"value\": \"\u003Cimage>\\n\u003Caudio>\\n\"\n            },\n            {\n                \"from\": \"gpt\",  \u002F\u002F follow the setting of llave, \"gpt\" is only used to indicate that this is the ground truth of the model output\n                \"value\": \"This is a well-organized kitchen with a clean, modern aesthetic. The kitchen features a white countertop against a white wall, creating a bright and airy atmosphere. \"\n            }\n        ],\n        \"image\": \"coco\u002Fimages\u002Ftrain2017\u002F000000000164.jpg\",\n        \"audio\": [\n            \"new_value_dict_0717\u002Foutput_wavs\u002Ff61cf238b7872b4903e1fc15dcb5a50c.wav\"\n        ]\n    },\n    ...\n]\n```\n\n- The `set` field is used to retrieve the image or video folder for data loading. You should add its key-value pair to the `FolderDict` in [.\u002Fvita\u002Fconfig\u002Fdataset_config.py](.\u002Fvita\u002Fconfig\u002Fdataset_config.py):\n```\nAudioFolder = \"\"\nFolderDict = {\n    #### NaturalCap\n    \"sharegpt4\": \"\",\n}\n#### NaturalCap\nShareGPT4V = {\"chat_path\": \"\"}\n```\n\n- Set the JSON path for `\"chat_path\"` in the corresponding dictionary in [.\u002Fvita\u002Fconfig\u002Fdataset_config.py](.\u002Fvita\u002Fconfig\u002Fdataset_config.py).\n- Set the audio folder path for `AudioFolder` in [.\u002Fvita\u002Fconfig\u002Fdataset_config.py](.\u002Fvita\u002Fconfig\u002Fdataset_config.py).\n- Add the data class in `DataConfig` in [.\u002Fvita\u002Fconfig\u002Finit.py](.\u002Fvita\u002Fconfig\u002F__init__.py):\n```\nfrom .dataset_config import *\n\nNaturalCap = [ShareGPT4V]\n\nDataConfig = {\n    \"Pretrain_video\": NaturalCap,\n}\n```\n\n\n### Continual Training\n- Download the required weights: (1) [VITA-1.5 checkpoint](https:\u002F\u002Fhuggingface.co\u002FVITA-MLLM\u002FVITA-1.5\u002Ftree\u002Fmain), (2) [InternViT-300M-448px](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px), and (3) [Our pretrained audio encoder](https:\u002F\u002Fhuggingface.co\u002FVITA-MLLM\u002FVITA-1.5\u002Ftree\u002Fmain\u002Faudio-encoder-Qwen2-7B-1107-weight-base-11wh-tunning) in Stage-2 audio-language alignment (refer to Fig. 3 in the paper).\n\n- Replace the paths in [.\u002Fscript\u002Ftrain\u002FfinetuneTaskNeg_qwen_nodes.sh](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVITA-Temp\u002Fblob\u002Fmain\u002Fscript\u002Ftrain\u002FfinetuneTaskNeg_qwen_nodes.sh):\n```\n    ...\n    --model_name_or_path VITA1.5_ckpt \\\n    ...\n    --vision_tower InternViT-300M-448px \\\n    ...\n    --audio_encoder audio-encoder-Qwen2-7B-1107-weight-base-11wh-tunning \\\n    ...\n```\n\n- Execute the following commands to start the training process:\n\n```\nexport PYTHONPATH=.\u002F\nexport PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True\nOUTPUT_DIR=\u002Fmnt\u002Fcfs\u002Flhj\u002Fvideomllm_ckpt\u002Foutputs\u002Fvita_video_audio\nbash script\u002Ftrain\u002FfinetuneTaskNeg_qwen_nodes.sh ${OUTPUT_DIR}\n```\n\n\n## 📐 Inference\n### Quick Start\n- Text query\n```\nCUDA_VISIBLE_DEVICES=2 python video_audio_demo.py \\\n    --model_path [vita\u002Fpath] \\\n    --image_path asset\u002Fvita_newlog.jpg \\\n    --model_type qwen2p5_instruct \\\n    --conv_mode qwen2p5_instruct \\\n    --question \"Describe this images.\"\n```\n\n- Audio query\n```\nCUDA_VISIBLE_DEVICES=4 python video_audio_demo.py \\\n    --model_path [vita\u002Fpath] \\\n    --image_path asset\u002Fvita_newlog.png \\\n    --model_type qwen2p5_instruct \\\n    --conv_mode qwen2p5_instruct \\\n    --audio_path asset\u002Fq1.wav\n```\n\n-  Noisy audio query\n```\nCUDA_VISIBLE_DEVICES=4 python video_audio_demo.py \\\n    --model_path [vita\u002Fpath] \\\n    --image_path asset\u002Fvita_newlog.png \\\n    --model_type qwen2p5_instruct \\\n    --conv_mode qwen2p5_instruct \\\n    --audio_path asset\u002Fq2.wav\n```\n\n\n### Demo\n\nWe have accelerated the model using [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm). \nSince VITA has not yet been integrated into vLLM, you need to make some modifications to the vLLM code to adapt it for VITA.\n\n\n```bash\nconda create -n vita_demo python==3.10\nconda activate vita_demo\npip install -r web_demo\u002Fweb_demo_requirements.txt\n\n# Backup a new weight file\ncp -rL  VITA_ckpt\u002F demo_VITA_ckpt\u002F\n\nmv demo_VITA_ckpt\u002Fconfig.json demo_VITA_ckpt\u002Forigin_config.json\n\ncd .\u002Fweb_demo\u002Fvllm_tools\ncp -rf qwen2p5_model_weight_file\u002F*  ..\u002F..\u002Fdemo_VITA_ckpt\u002F\ncp -rf vllm_file\u002F*  your_anaconda\u002Fenvs\u002Fvita_demo\u002Flib\u002Fpython3.10\u002Fsite-packages\u002Fvllm\u002Fmodel_executor\u002Fmodels\u002F\n```\n\n\n\n\n#### 📍 Basic Demo\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F43edd44a-8c8d-43ea-9d2b-beebe909377a\n\n\n\n```bash\npython -m web_demo.web_ability_demo  demo_VITA_ckpt\u002F\n```\n\n\n\n#### 📍 Real-Time Interactive Demo\n\nTo run the real-time interactive demo, you need to make the following preparations:\n\n- Make sure that you have executed the above instructions under the [Demo](#demo) section (`cp` files out from the `vllm_tools`).\n\n- Prepare a VAD (Voice Activity Detection) module. \nYou can choose to download [silero_vad.onnx](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad\u002Ftree\u002Fv4.0\u002Ffiles) and [silero_vad.jit](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad\u002Ftree\u002Fv4.0\u002Ffiles), and place these files in the `.\u002Fweb_demo\u002Fwakeup_and_vad\u002Fresource\u002F` directory.\n\n- For a better real-time interactive experience, you need to set `max_dynamic_patch` to 1 in `demo_VITA_ckpt\u002Fconfig.json`. \nWhen you run the basic demo, you can set it to the default value of 12 to enhance the model's visual capabilities.\n\n```bash\npip install flask==3.1.0 flask-socketio==5.5.0 cryptography==44.0.0 timm==1.0.12\npython -m web_demo.server --model_path demo_VITA_ckpt --ip 0.0.0.0 --port 8081\n```\n\n\n## 📏Evaluating on MLLM Benchmarks\n### [VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit)\nModify the model path of `vita_qwen2` in `VLMEvalKit\u002Fvlmeval\u002Fconfig.py`\n```\nvita_series = { \n    'vita': partial(VITA, model_path='\u002Fpath\u002Fto\u002Fmodel'),\n    'vita_qwen2': partial(VITAQwen2, model_path='\u002Fpath\u002Fto\u002Fmodel'),\n}\n```\n\nFollow the [instuctions in VLMEvalKit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit\u002Fblob\u002Fmain\u002Fdocs\u002Fen\u002FQuickstart.md) to set the GPT as the judge model.\n\nIf the openai api are not available, you can use a local model as the judge. In our experiments, we find that [Qwen1.5-1.8B-Chat](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen1.5-1.8B-Chat) judge can work well compared to GPT-4, except in MM-Vet. To start the judge:\n```\nCUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server \u002Fmnt\u002Fcfs\u002Flhj\u002Fmodel_weights\u002FQwen1.5-1.8B-Chat --server-port 23333\n```\nThen configure the `.env` file in the `VLMEvalKit` folder:\n```\nOPENAI_API_KEY=sk-123456\nOPENAI_API_BASE=http:\u002F\u002F0.0.0.0:23333\u002Fv1\u002Fchat\u002Fcompletions\nLOCAL_LLM=\u002Fmnt\u002Fcfs\u002Flhj\u002Fmodel_weights\u002FQwen1.5-1.8B-Chat\n```\nEvaluating on these benchmarks:\n```\nCUDA_VISIBLE_DEVICES=0 python run.py --data MMBench_TEST_EN_V11 MMBench_TEST_CN_V11 MMStar MMMU_DEV_VAL MathVista_MINI HallusionBench AI2D_TEST OCRBench MMVet MME --model vita_qwen2 --verbose\n```\n\n### Video-MME\n#### Data Preparation\nDownload the [Video-MME dataset](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) and extract the frames, saving them as images to improve IO efficiency.\n\n#### Evaluation\n```\ncd .\u002Fvideomme\n```\nRun the model on Video-MME in the setting of wo\u002F subtitles:\n```\nVIDEO_TYPE=\"s,m,l\"\nNAMES=(lyd jyg wzh wzz zcy by dyh lfy)\nfor((i=0; i\u003C${#NAMES[@]}; i++)) \ndo\n    CUDA_VISIBLE_DEVICES=6 python yt_video_inference_qa_imgs.py \\\n        --model-path [vita\u002Fpath] \\\n        --model_type qwen2p5_instruct \\\n        --conv_mode qwen2p5_instruct \\\n        --responsible_man ${NAMES[i]} \\\n        --video_type $VIDEO_TYPE \\\n        --output_dir qa_wo_sub \\\n        --video_dir [Video-MME-imgs] | tee logs\u002Finfer.log\ndone\n\n```\nRun the model on Video-MME in the setting of w\u002F subtitles:\n```\nVIDEO_TYPE=\"s,m,l\"\nNAMES=(lyd jyg wzh wzz zcy by dyh lfy)\nfor((i=0; i\u003C${#NAMES[@]}; i++)) \ndo\n    CUDA_VISIBLE_DEVICES=7 python yt_video_inference_qa_imgs.py \\\n        --model-path [vita\u002Fpath] \\\n        --model_type qwen2p5_instruct \\\n        --conv_mode qwen2p5_instruct \\\n        --responsible_man ${NAMES[i]} \\\n        --video_type $VIDEO_TYPE \\\n        --output_dir qa_w_sub \\\n        --video_dir [Video-MME-imgs] \\\n        --use_subtitles | tee logs\u002Finfer.log\ndone\n```\nParse the results:\n```\npython parse_answer.py --video_types \"s,m,l\" --result_dir qa_wo_sub\npython parse_answer.py --video_types \"s,m,l\" --result_dir qa_w_sub\n```\n## ✒️ Citation\n\nIf you find our work helpful for your research, please consider citing our work.   \n\n```bibtex\n@article{fu2025vita,\n  title={VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction},\n  author={Fu, Chaoyou and Lin, Haojia and Wang, Xiong and Zhang, Yi-Fan and Shen, Yunhang and Liu, Xiaoyu and Li, Yangze and Long, Zuwei and Gao, Heting and Li, Ke and others},\n  journal={arXiv preprint arXiv:2501.01957},\n  year={2025}\n}\n\n@article{fu2024vita,\n  title={Vita: Towards open-source interactive omni multimodal llm},\n  author={Fu, Chaoyou and Lin, Haojia and Long, Zuwei and Shen, Yunhang and Zhao, Meng and Zhang, Yifan and Dong, Shaoqi and Wang, Xiong and Yin, Di and Ma, Long and others},\n  journal={arXiv preprint arXiv:2408.05211},\n  year={2024}\n}\n```\n\n\n## &#x1F4E3; Statement\n\n**VITA is trained on large-scale open-source corpus, and its output has randomness. Any content generated by VITA does not represent the views of the model developers. We are not responsible for any problems arising from the use, misuse, and dissemination of VITA, including but not limited to public opinion risks and data security issues.**\n\n\n## 📜 Related Works\n\nExplore our related researches:\n-  **[VITA-1.0]** [VITA: Towards Open-Source Interactive Omni Multimodal LLM](https:\u002F\u002Fvita-home.github.io\u002F)\n-  **[Awesome-MLLM]** [A Survey on Multimodal Large Language Models](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n-  **[MME]** [MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FEvaluation)\n-  **[Video-MME]** [Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) \n\n\n## 👍 Acknowledgement\nVITA is built with reference to the following outstanding works: [LLaVA-1.5](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA), [Bunny](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny), [ChatUnivi](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi), [InternVL](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL), [InternViT](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px), [Qwen-2.5](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5), [VLMEvalkit](https:\u002F\u002Fgithub.com\u002Fopen-compass\u002FVLMEvalKit), and [Mixtral 8*7B](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmixtral-of-experts\u002F).\nThanks！\n\n","VITA-1.5 是一个面向实时视觉和语音交互的多模态大模型，目标是达到GPT-4o级别的性能。该项目采用Python语言开发，支持英文和中文双语处理。VITA-1.5 的核心功能包括实时的视觉理解和语音交互能力，能够处理视频、图像及语音等多种类型的数据，并且具备强大的多模态理解与生成能力。该模型特别适用于需要高效、准确地处理多模态信息的应用场景，如智能客服、虚拟助手、教育辅助工具等。通过ModelScope平台提供的基本演示，用户可以直观体验到VITA-1.5的强大功能。",2,"2026-06-11 03:42:08","high_star"]