[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-70995":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":41,"discoverSource":42},70995,"InternVL","OpenGVLab\u002FInternVL","OpenGVLab","[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o.  接近GPT-4o表现的开源多模态对话模型","https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002F",null,"Python",10059,782,66,304,0,2,12,38,6,43.68,"MIT License",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37],"gpt","gpt-4o","gpt-4v","image-classification","image-text-retrieval","llm","multi-modal","semantic-segmentation","video-classification","vision-language-model","vit-22b","vit-6b","2026-06-12 02:02:46","\u003Cdiv align=\"center\">\n\n# InternVL Family: Closing the Gap to Commercial Multimodal Models with Open-Source Suites —— A Pioneering Open-Source Alternative to GPT-5\n\n\u003Cdiv align=\"center\">\n  \u003Cimg width=\"500\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F930e6814-8a9f-43e1-a284-118a5732daa4\">\n  \u003Cbr>\n\u003C\u002Fdiv>\n\n[\\[🆕 Blog\\]](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F)\n[\\[🤔 FAQs\\]](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Ftutorials\u002Ffaqs.html)\n[\\[🗨️ Chat Demo\\]](https:\u002F\u002Fchat.intern-ai.org.cn\u002F)\n[\\[📖 Document\\]](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002F)\n[\\[🌐 API\\]](https:\u002F\u002Finternlm.intern-ai.org.cn\u002Fapi\u002Fdocument)\n[\\[🚀 Quick Start\\]](#quick-start-with-huggingface)\n\n[\\[🔥 InternVL3.5 Report\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2508.18265)\n[\\[📜 InternVL3.0 Report\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2504.10479)\n[\\[📜 InternVL2.5 MPO\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2411.10442)\n[\\[📜 InternVL2.5 Report\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2412.05271)\n\n[\\[📜 Mini-InternVL Paper\\]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16261)\n[\\[📜 InternVL2 Blog\\]](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-07-02-InternVL-2.0\u002F)\n[\\[📜 InternVL 1.5 Paper\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2404.16821)\n[\\[📜 InternVL 1.0 Paper\\]](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2312.14238)\n\n[\\[📖 2.0 中文解读\\]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F706547971)\n[\\[📖 1.5 中文解读\\]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F699439759)\n[\\[📖 1.0 中文解读\\]](https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F702946079)\n\n[Switch to the Chinese version (切换至中文版)](\u002FREADME_zh.md)\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F9803\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F9803\" alt=\"OpenGVLab%2FInternVL | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003Cimg height=\"55\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fbd62ab46-f0ea-40c6-ab10-7fde671716cc\">\n\n![image\u002Fjpg](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B\u002Fresolve\u002Fmain\u002Fimages\u002Fperformance.jpg)\n\n\u003C\u002Fdiv>\n\n## News 🚀🚀🚀\n\n- `2025\u002F08\u002F30`: 🔥 We open-source the training code of [InternVL3_5-GPT-OSS-20B-A4B](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat_gpt_oss) and CascadeRL, which consists of a [offline RL stage](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Fblob\u002Fmain\u002Finternvl_chat_gpt_oss\u002Fshell\u002Finternvl3_5_gpt_oss\u002Finternvl3_5_gpt_oss_20b_stage3_mpo.sh) and a [online RL stage](https:\u002F\u002Fgithub.com\u002FWeiyun1025\u002Fverl-internvl). The training data for these two stages ([MMPR-v1.2](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FMMPR-v1.2) and [MMPR-Tiny](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FMMPR-Tiny)) are also open-sourced.\n- `2025\u002F08\u002F26`: 🚀 We introduce [InternVL3.5](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2508.18265),  a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. Our largest model, i.e., [InternVL3.5-241B-A28B](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B), attains state-of-the-art results among open-source MLLMs across general multimodal, reasoning, text, and agentic tasks. We also provide a 20B-A4B version (i.e., [InternVL3_5-GPT-OSS-20B-A4B](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview)), which is built up on GPT-OSS-20B-A4B. Notably, we provide two model formats: [the GitHub format](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview#github-format), consistent with prior releases, and [the HF format](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview#huggingface-format), aligned with the official `transformers` standard.\n- `2025\u002F04\u002F17`: We open-source the [data construction pipeline](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Ftools\u002Freasoning_data_pipeline) and [training scripts](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Fshell\u002Finternvl3.0\u002Fmpo) of [MPO](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2411.10442) and [VisualPRM](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.10291). Additionally, the data construction scripts for [MPO](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Fshell\u002Finternvl3.0\u002Fmpo_data_construction) and [VisualPRM](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Fshell\u002Finternvl3.0\u002Fvisualprm_data_construction) are also released for reference.\n- `2025\u002F04\u002F11`: We introduce [InternVL3](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenGVLab\u002Finternvl3-67f7f690be79c2fe9d74fe9d), an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance. InternVL3-78B achieves SoTA performance in both [perception](https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal\u002F?m=REALTIME) and [reasoning performance](https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal-reasoning\u002F?m=REALTIME) among open-source MLLMs. The key designs of InternVL3-78B include [Variable Visual Position Encoding](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2412.09616), [Native Multimodal Pre-Training](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2504.10479), [Mixed Preference Optimization](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2411.10442), and [Multimodal Test-Time Scaling](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.10291).\n- `2025\u002F03\u002F13`: We introduce [VisualPRM](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FVisualPRM-8B), an advanced multimodal Process Reward Model (PRM) with 8B parameters, which improves the overall reasoning performance of InternVL2.5-8B and InternVL2.5-78B by 8.4 and 5.9 points, respectively. The training data for this model, termed [VisualPRM400K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FVisualPRM400K), is also open-sourced. Please refer to our [paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2503.10291) and [project page](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2025-03-13-VisualPRM\u002F) for more details.\n- `2024\u002F12\u002F20`: We release the [InternVL2.5-MPO](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-12-20-InternVL-2.5-MPO\u002F), which is finetuned with [Mixed Preference Optimization](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2411.10442) on [MMPR-v1.1](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FMMPR-v1.1). **The resulting models outperform their counterparts without MPO by an average of 2 points across all model scales on the OpenCompass leaderboard.** These models are available at [HF link](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenGVLab\u002Finternvl25-mpo-6753fed98cd828219b12f849).\n- `2024\u002F12\u002F17`: [InternVL2\u002F2.5](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleMIX\u002Ftree\u002Fdevelop\u002Fpaddlemix\u002Fexamples\u002Finternvl2) is supported in [PaddleMIX](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleMIX) by Paddle Team.\n- `2024\u002F12\u002F05`: We release the [InternVL2.5](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenGVLab\u002Finternvl-25-673e1019b66e2218f68d7c1c), an advanced multimodal large language model (MLLM) series with parameter coverage ranging from 1B to 78B. [InternVL2_5-78B](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-78B) is the first open-source MLLMs to achieve over **70%** on the **MMMU benchmark**, matching the performance of leading closed-source commercial models like GPT-4o. These models are available at [HF link](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenGVLab\u002Finternvl-25-673e1019b66e2218f68d7c1c).\n- `2024\u002F11\u002F14`: We introduce [MMPR](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenGVLab\u002FMMPR), a high-quality, large-scale multimodal reasoning preference dataset, and [MPO](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Fshell\u002Finternvl2.0_mpo), an effective preference optimization algorithm. The resulting model, [InternVL2-8B-MPO](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-8B-MPO), achieves an accuracy of 67.0 on MathVista. Please refer to our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10442), [project page](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-11-14-InternVL-2.0-MPO\u002F) and [document](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fpreference_optimization.html) for more details.\n\n\u003Cdetails>\n\u003Csummary>More News\u003C\u002Fsummary>\n\n\n- `2024\u002F10\u002F21`: We release the Mini-InternVL series. These models achieve impressive performance with minimal size: the 4B model achieves 90% of the performance with just 5% of the model size. For more details, please check our [project page](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Ftree\u002Fmain\u002Finternvl_chat\u002Fshell\u002Fmini_internvl) and [document](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fdomain_adaptation.html).\n- `2024\u002F08\u002F01`: The [Chartmimic](https:\u002F\u002Fchartmimic.github.io\u002F) team evaluated the InternVL2 series models on their benchmark. The InternVL2-26B and 76B models achieved the top two performances among open-source models, with the InternVL2 76B model surpassing GeminiProVision and exhibiting comparable results to Claude-3-opus.\n- `2024\u002F08\u002F01`: InternVL2-Pro achieved the SOTA performance among open-source models on the [CharXiv](https:\u002F\u002Fcharxiv.github.io\u002F#leaderboard) dataset, surpassing many closed-source models such as GPT-4V, Gemini 1.5 Flash, and Claude 3 Sonnet.\n- `2024\u002F07\u002F24`: The [MLVU](https:\u002F\u002Fgithub.com\u002FJUNJIE99\u002FMLVU) team evaluated InternVL-1.5 on their benchmark. The average performance on the multiple-choice task was 50.4%, while the performance on the generative tasks was 4.02. The performance on the multiple-choice task ranked #1 among all open-source MLLMs.\n- `2024\u002F07\u002F04`: We release the [InternVL2 series](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenGVLab\u002Finternvl-20-667d3961ab5eb12c7ed1463e). InternVL2-Pro achieved a 62.0% accuracy on the MMMU benchmark, matching the performance of leading closed-source commercial models like GPT-4o.\n- `2024\u002F07\u002F18`: InternVL2-40B achieved SOTA performance among open-source models on the [Video-MME](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME) dataset, scoring 61.2 when inputting 16 frames and 64.4 when inputting 32 frames. It significantly outperforms other open-source models and is the closest open-source model to GPT-4o mini.\n- `2024\u002F07\u002F18`: InternVL2-Pro achieved the SOTA performance on the [DocVQA](https:\u002F\u002Frrc.cvc.uab.es\u002F?ch=17&com=evaluation&task=1) and [InfoVQA](https:\u002F\u002Frrc.cvc.uab.es\u002F?ch=17&com=evaluation&task=3) benchmarks.\n- `2024\u002F06\u002F19`: We propose Needle In A Multimodal Haystack ([MM-NIAH](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FMM-NIAH)), the first benchmark designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents.\n- `2024\u002F05\u002F30`: We release [ShareGPT-4o](https:\u002F\u002Fsharegpt4o.github.io\u002F), a large-scale dataset that we plan to open-source with 200K images, 10K videos, and 10K audios with detailed descriptions.\n- `2024\u002F05\u002F28`: Thanks to the [lmdeploy](https:\u002F\u002Fgithub.com\u002FInternLM\u002Flmdeploy) team for providing AWQ quantization support. The 4-bit model is available at [OpenGVLab\u002FInternVL-Chat-V1-5-AWQ](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-5-AWQ).\n- `2024\u002F05\u002F13`: InternVL 1.0 can now be used as the [text encoder](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-14B-224px) for diffusion models to support multilingual generation natively in over 110 languages worldwide. See [MuLan](https:\u002F\u002Fgithub.com\u002Fmulanai\u002FMuLan) for more details.\n- `2024\u002F04\u002F18`: InternVL-Chat-V1-5 has been released at [HF link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-5), approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.\n- `2024\u002F02\u002F27`: InternVL is accepted by CVPR 2024 (Oral)! 🎉\n- `2024\u002F02\u002F21`: [InternVL-Chat-V1-2-Plus](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-2-Plus) achieved SOTA performance on MathVista (59.9), MMBench (83.8), and MMVP (58.7). See our [blog](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-02-21-InternVL-1.2\u002F) for more details.\n- `2024\u002F02\u002F12`: InternVL-Chat-V1-2 has been released. It achieves 51.6 on MMMU val and 82.3 on MMBench test. For more details, please refer to our [blog](https:\u002F\u002Finternvl.github.io\u002Fblog\u002F2024-02-21-InternVL-1.2\u002F) and [SFT data](.\u002Finternvl_chat#prepare-training-datasets). The model is now available on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-2), and both training \u002F evaluation data and scripts are open-sourced.\n- `2024\u002F01\u002F24`: InternVL-Chat-V1-1 is released, it supports Chinese and has stronger OCR capability, see [here](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-1).\n- `2024\u002F01\u002F16`: We release our [customized mmcv\u002Fmmsegmentation\u002Fmmdetection code](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL-MMDetSeg), integrated with DeepSpeed, which can be used for training large-scale detection and segmentation models.\n\n\u003C\u002Fdetails>\n\n## Documents\n\n### 🌟 **Get Started**\n\n- **Installation**: 🌱 [Installation Guide](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Finstallation.html) | 📄 [requirements.txt](.\u002Frequirements.txt)\n- **Chat Data Format**: 📝 [Meta File](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fchat_data_format.html#meta-file) | ✏️ [Text](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fchat_data_format.html#pure-text-data) | 🖼️ [Single-Image](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fchat_data_format.html#single-image-data) | 🖼️🖼️ [Multi-Image](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fchat_data_format.html#multi-image-data) | 🎥 [Video](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Fchat_data_format.html#video-data)\n- **Local Chat Demo**: 🤖 [Streamlit Demo](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Fget_started\u002Flocal_chat_demo.html#streamlit-demo)\n- **InternVL-Chat API**: 🌐 [InternVL2.5 API](https:\u002F\u002Finternlm.intern-ai.org.cn\u002Fapi\u002Fdocument)\n- **Tutorials**: 🚀 [Enhancing InternVL2 on COCO Caption Using LoRA Fine-Tuning](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Ftutorials\u002Fcoco_caption_finetune.html)\n\n### 🏆 **InternVL Family**\n\n- **InternVL 3.0**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Fquick_start.html) | ✨ [Finetune](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Ffinetune.html) | 📊 [Evaluate](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Fevaluation.html) | 📦 [Deploy](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Fdeployment.html) | 🎯 [MPO](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl3.0\u002Fpreference_optimization.html)\n- **InternVL 2.5**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Fquick_start.html) | ✨ [Finetune](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Ffinetune.html) | 📊 [Evaluate](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Fevaluation.html) | 📦 [Deploy](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Fdeployment.html) | 🎯 [MPO](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.5\u002Fpreference_optimization.html)\n- **InternVL 2.0**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fquick_start.html) | ✨ [Finetune](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Ffinetune.html) | 📊 [Evaluate](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fevaluation.html) | 📦 [Deploy](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fdeployment.html) | 🎯 [MPO](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl2.0\u002Fpreference_optimization.html)\n- **InternVL 1.5**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.5\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.5\u002Fquick_start.html) | ✨ [Finetune](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.5\u002Ffinetune.html) | 📊 [Evaluate](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.5\u002Fevaluation.html) | 📦 [Deploy](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.5\u002Fdeployment.html)\n- **InternVL 1.2**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.2\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.2\u002Fquick_start.html) | ✨ [Finetune](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.2\u002Ffinetune.html) | 📊 [Evaluate](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.2\u002Fevaluation.html)\n- **InternVL 1.1**: 📖 [Intro](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.1\u002Fintroduction.html) | ⚡ [Quick Start](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.1\u002Fquick_start.html) | 📊 [Evaluation](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.1\u002Fevaluation.html)\n- **InternVL 1.0**: 🖼️ [Classification](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.0\u002Fclassification.html) | 📊 [CLIP-Benchmark](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.0\u002Fclip_benchmark.html) | 🎨 [Segmentation](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.0\u002Fsegmentation.html) | 💬 [Chat-LLaVA](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.0\u002Finternvl_chat_llava.html) | ✨ [InternVL-G](https:\u002F\u002Finternvl.readthedocs.io\u002Fen\u002Flatest\u002Finternvl1.0\u002Finternvl_g.html)\n\n## Model Zoo\n\n#### Multimodal Large Language Model (InternVL 3.5)\n\nTo maintain consistency with earlier generations, we provide two model formats: [the GitHub format](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B), consistent with prior releases, and [the HF format](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B-HF), aligned with the official Transformers standard.\n\n> If you want to convert the checkpoint between these two formats, please refer to the scripts about [custom2hf](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Fblob\u002Fmain\u002Finternvl_chat\u002Ftools\u002Finternvl_custom2hf.py) and [hf2custom](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL\u002Fblob\u002Fmain\u002Finternvl_chat\u002Ftools\u002Finternvl_hf2custom.py).\n\n**Github Format**\n| Model                 | #Vision Param | #Language Param | #Total Param | HF Link                                                                        | ModelScope Link                                                                          |\n| --------------------- | ------------- | --------------- | ------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------------- |\n| InternVL3.5-1B        | 0.3B          | 0.8B            | 1.1B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-1B)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-1B)                      |\n| InternVL3.5-2B        | 0.3B          | 2.0B            | 2.3B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-2B)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-2B)                      |\n| InternVL3.5-4B        | 0.3B          | 4.4B            | 4.7B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-4B)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-4B)                      |\n| InternVL3.5-8B        | 0.3B          | 8.2B            | 8.5B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-8B)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-8B)                      |\n| InternVL3.5-14B       | 0.3B          | 14.8B           | 15.1B        | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-14B)                     | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-14B)                     |\n| InternVL3.5-38B       | 5.5B          | 32.8B           | 38.4B        | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-38B)                     | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-38B)                     |\n| InternVL3.5-20B-A4B   | 0.3B          | 20.9B           | 21.2B-A4B    | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview) | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview) |\n| InternVL3.5-30B-A3B   | 0.3B          | 30.5B           | 30.8B-A3B    | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-30B-A3B)                 | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-30B-A3B)                 |\n| InternVL3.5-241B-A28B | 5.5B          | 235.1B          | 240.7B-A28B  | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B)               | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-241B-A28B)               |\n\n**HuggingFace Format**\n\n| Model                    | #Vision Param | #Language Param | #Total Param | HF Link                                                                           | ModelScope Link                                                                             |\n| ------------------------ | ------------- | --------------- | ------------ | --------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------- |\n| InternVL3.5-1B-HF        | 0.3B          | 0.8B            | 1.1B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-1B-HF)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-1B-HF)                      |\n| InternVL3.5-2B-HF        | 0.3B          | 2.0B            | 2.3B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-2B-HF)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-2B-HF)                      |\n| InternVL3.5-4B-HF        | 0.3B          | 4.4B            | 4.7B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-4B-HF)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-4B-HF)                      |\n| InternVL3.5-8B-HF        | 0.3B          | 8.2B            | 8.5B         | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-8B-HF)                      | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-8B-HF)                      |\n| InternVL3.5-14B-HF       | 0.3B          | 14.8B           | 15.1B        | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-14B-HF)                     | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-14B-HF)                     |\n| InternVL3.5-38B-HF       | 5.5B          | 32.8B           | 38.4B        | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-38B-HF)                     | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-38B-HF)                     |\n| InternVL3.5-20B-A4B-HF   | 0.3B          | 20.9B           | 21.2B-A4B    | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview-HF) | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-GPT-OSS-20B-A4B-Preview-HF) |\n| InternVL3.5-30B-A3B-HF   | 0.3B          | 30.5B           | 30.8B-A3B    | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-30B-A3B-HF)                 | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-30B-A3B-HF)                 |\n| InternVL3.5-241B-A28B-HF | 5.5B          | 235.1B          | 240.7B-A28B  | [🤗 link](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3_5-241B-A28B-HF)               | [🤖 link](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3_5-241B-A28B-HF)               |\n\n\n#### Multimodal Large Language Model (InternVL 3.0)\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model Name\u003C\u002Fth>\n    \u003Cth>Vision Part\u003C\u002Fth>\n    \u003Cth>Language Part\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-1B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT&#8209;300M&#8209;448px&#8209;V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-0.5B\">Qwen2.5&#8209;0.5B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-1B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-1B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-2B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-1.5B\">Qwen2.5-1.5B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-2B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-2B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-8B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-7B\">Qwen2.5-7B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-8B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-8B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-9B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm3-8b-instruct\">internlm3-8b-instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-9B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-9B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-14B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-14B\">Qwen2.5-14B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-14B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-14B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-38B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-32B\">Qwen2.5-32B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-38B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-38B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL3-78B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-72B\">Qwen2.5-72B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-78B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL3-78B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Multimodal Large Language Model (InternVL 2.5)\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model Name\u003C\u002Fth>\n    \u003Cth>Vision Part\u003C\u002Fth>\n    \u003Cth>Language Part\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-1B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT&#8209;300M&#8209;448px&#8209;V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-0.5B-Instruct\">Qwen2.5&#8209;0.5B&#8209;Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-1B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-1B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-2B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-1_8b-chat\">internlm2_5-1_8b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-2B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-2B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-4B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-3B-Instruct\">Qwen2.5-3B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-4B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-4B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-8B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-7b-chat\">internlm2_5-7b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-8B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-8B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-26B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-20b-chat\">internlm2_5-20b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-26B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-26B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-38B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-32B-Instruct\">Qwen2.5-32B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-38B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-38B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-78B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-72B-Instruct\">Qwen2.5-72B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-78B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-78B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model Name\u003C\u002Fth>\n    \u003Cth>Vision Part\u003C\u002Fth>\n    \u003Cth>Language Part\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-1B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT&#8209;300M&#8209;448px&#8209;V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-0.5B-Instruct\">Qwen2.5&#8209;0.5B&#8209;Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-1B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-1B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-2B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-1_8b-chat\">internlm2_5-1_8b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-2B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-2B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-4B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-3B-Instruct\">Qwen2.5-3B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-4B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-4B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-8B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">InternViT-300M-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-7b-chat\">internlm2_5-7b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-8B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-8B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-26B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-20b-chat\">internlm2_5-20b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-26B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-26B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-38B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-32B-Instruct\">Qwen2.5-32B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-38B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-38B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2_5-78B-MPO\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">InternViT-6B-448px-V2_5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-72B-Instruct\">Qwen2.5-72B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2_5-78B-MPO\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2_5-78B-MPO\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Multimodal Large Language Model (InternVL 2.0)\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model Name\u003C\u002Fth>\n    \u003Cth>Vision Part\u003C\u002Fth>\n    \u003Cth>Language Part\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-1B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px\">InternViT-300M-448px\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2-0.5B-Instruct\">Qwen2-0.5B-Instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-1B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-1B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-2B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px\">InternViT-300M-448px\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2-chat-1_8b\">internlm2-chat-1-8b\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-2B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-2B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-4B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px\">InternViT-300M-448px\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-128k-instruct\">Phi&#8209;3&#8209;mini&#8209;128k&#8209;instruct\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-4B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-4B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-8B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px\">InternViT-300M-448px\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2_5-7b-chat\">internlm2_5-7b-chat\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-8B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-8B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-26B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-5\">InternViT-6B-448px-V1-5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Finternlm\u002Finternlm2-chat-20b\">internlm2-chat-20b\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-26B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-26B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2-40B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-5\">InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FNousResearch\u002FNous-Hermes-2-Yi-34B\">Nous&#8209;Hermes&#8209;2&#8209;Yi&#8209;34B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-40B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-40B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL2&#8209;Llama3-76B\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-5\">InternViT-6B-448px-V1-5\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FNousResearch\u002FHermes-2-Theta-Llama-3-70B\">Hermes‑2‑Theta‑\u003Cbr>Llama‑3‑70B\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL2-Llama3-76B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL2-Llama3-76B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Multimodal Large Language Model (InternVL 1.0-1.5)\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model\u003C\u002Fth>\n    \u003Cth>Date\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n    \u003Cth>Note\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Mini&#8209;InternVL&#8209;Chat&#8209;4B&#8209;V1&#8209;5\u003C\u002Ftd>\n    \u003Ctd>2024.05.28\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FMini-InternVL-Chat-4B-V1-5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FMini-InternVL-Chat-4B-V1-5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>🚀🚀 16% of the model size, 90% of the performance\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Mini-InternVL-Chat-2B-V1-5\u003C\u002Ftd>\n    \u003Ctd>2024.05.19\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FMini-InternVL-Chat-2B-V1-5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FMini-InternVL-Chat-2B-V1-5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>🚀 8% of the model size, 80% of the performance\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-V1-5\u003C\u002Ftd>\n    \u003Ctd>2024.04.18\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-V1-5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>support 4K image; super strong OCR; Approaching the performance of GPT-4V and Gemini Pro on various benchmarks like MMMU, DocVQA, ChartQA, MathVista, etc.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-V1-2-Plus\u003C\u002Ftd>\n    \u003Ctd>2024.02.21\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-2-Plus\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-V1-2-Plus\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>more SFT data and stronger\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-V1-2\u003C\u002Ftd>\n    \u003Ctd>2024.02.11\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-2\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-V1-2\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>scaling up LLM to 34B\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-V1-1\u003C\u002Ftd>\n    \u003Ctd>2024.01.24\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-V1-1\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-V1-1\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>support Chinese and stronger OCR\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-19B\u003C\u002Ftd>\n    \u003Ctd>2023.12.25\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-ViT-6B-Vicuna-13B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-ViT-6B-Vicuna-13B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>English multimodal dialogue\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL-Chat-13B\u003C\u002Ftd>\n    \u003Ctd>2023.12.25\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-Chat-ViT-6B-Vicuna-7B\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-Chat-ViT-6B-Vicuna-7B\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>English multimodal dialogue\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### CLIP-like Model (InternVL 1.0-2.5)\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model\u003C\u002Fth>\n    \u003Cth>Date\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n    \u003Cth>Note\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-300M-448px-V2_5\u003C\u002Ftd>\n    \u003Ctd>2024.12.05\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-300M-448px-V2_5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>🚀🚀 A more powerful lightweight visual encoder. (🔥new)\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-6B-448px-V2_5\u003C\u002Ftd>\n    \u003Ctd>2024.12.05\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-6B-448px-V2_5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>🚀🚀 A stronger visual encoder to extract visual features. (🔥new)\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-300M-448px\u003C\u002Ftd>\n    \u003Ctd>2024.05.25\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-300M-448px\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-300M-448px\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>distilled small vision foundation model with 300M parameters \u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT&#8209;6B&#8209;448px&#8209;V1&#8209;5\u003C\u002Ftd>\n    \u003Ctd>2024.04.20\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-5\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-6B-448px-V1-5\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>support dynamic resolution and super strong OCR feature extraction capability by incremental pre-training\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-6B-448px-V1-2\u003C\u002Ftd>\n    \u003Ctd>2024.02.11\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-2\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-6B-448px-V1-2\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>support 448 resolution by incremental pre-training\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-6B-448px-V1-0\u003C\u002Ftd>\n    \u003Ctd>2024.01.30\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-448px-V1-0\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-6B-448px-V1-0\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>support 448 resolution by incremental pre-training\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternViT-6B-224px\u003C\u002Ftd>\n    \u003Ctd>2023.12.22\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternViT-6B-224px\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternViT-6B-224px\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>the first version of InternViT-6B, extracted from InternVL‑14B‑224px\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Vision-Language Foundation Model (InternVL 1.0)\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>Model\u003C\u002Fth>\n    \u003Cth>Date\u003C\u002Fth>\n    \u003Cth>HF&nbsp;Link\u003C\u002Fth>\n    \u003Cth>MS&nbsp;Link\u003C\u002Fth>\n    \u003Cth>Note\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>InternVL&#8209;14B&#8209;224px\u003C\u002Ftd>\n    \u003Ctd>2023.12.22\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL-14B-224px\">🤗 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenGVLab\u002FInternVL-14B-224px\">🤖 link\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>vision-language foundation model, InternViT-6B + QLLaMA, can be used for image-text retrieval like CLIP\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## TODO List\n\n- [x] Release training \u002F evaluation code for InternVL2.5 series\n- [x] Support liger kernels to save GPU memory\n- [x] Release the code, model, and data of MPO\n- [x] Support multimodal packed dataset\n- [ ] Support vLLM and Ollama\n- [ ] Support video and PDF input in online demo\n- [ ] Release InternVL2 with VisionLLMv2 integration\n- [x] Rebuild documents using readthedocs\n- [x] Support fine-tuning different LLMs with LoRA\n- [x] Release `requirements.txt` for InternVL2\n- [x] Release training \u002F evaluation code for InternVL2 series\n- [x] Release Streamlit web UI for InternVL1.5 and InternVL2\n\n## What can InternVL do?\n\n\u003Cdetails>\n  \u003Csummary>Visual Perception (click to expand)\u003C\u002Fsummary>\n\n- Linear-Probe Image Classification [\\[see details\\]](.\u002Fclassification#-evaluation)\n\n  ViT-22B uses the private JFT-3B dataset.\n\n  | method              | #param | IN-1K | IN-ReaL | IN-V2 | IN-A  | IN-R  | IN-Sketch |\n  | ------------------- | :----: | :---: | :-----: | :---: | :---: | :---: | :-------: |\n  | OpenCLIP-G          |  1.8B  | 86.2  |  89.4   | 77.2  | 63.8  | 87.8  |   66.4    |\n  | DINOv2-g            |  1.1B  | 86.5  |  89.6   | 78.4  | 75.9  | 78.8  |   62.5    |\n  | EVA-01-CLIP-g       |  1.1B  | 86.5  |  89.3   | 77.4  | 70.5  | 87.7  |   63.1    |\n  | MAWS-ViT-6.5B       |  6.5B  | 87.8  |    -    |   -   |   -   |   -   |     -     |\n  | ViT-22B\\*           | 21.7B  | 89.5  |  90.9   | 83.2  | 83.8  | 87.4  |     -     |\n  | InternViT-6B (ours) |  5.9B  | 88.2  |  90.4   | 79.9  | 77.5  | 89.8  |   69.1    |\n\n- Semantic Segmentation [\\[see details\\]](.\u002Fsegmentation#-evaluation)\n\n  | method                | decoder | #param (train\u002Ftotal) | crop size | mIoU         |\n  | --------------------- | :-----: | :------------------: | :-------: | ------------ |\n  | OpenCLIP-G (frozen)   | Linear  |     0.3M \u002F 1.8B      |    512    | 39.3         |\n  | ViT-22B (frozen)      | Linear  |     0.9M \u002F 21.7B     |    504    | 34.6         |\n  | InternViT-6B (frozen) | Linear  |     0.5M \u002F 5.9B      |    504    | 47.2 (+12.6) |\n  | ViT-22B (frozen)      | UperNet |     0.8B \u002F 22.5B     |    504    | 52.7         |\n  | InternViT-6B (frozen) | UperNet |     0.4B \u002F 6.3B      |    504    | 54.9 (+2.2)  |\n  | ViT-22B               | UperNet |    22.5B \u002F 22.5B     |    504    | 55.3         |\n  | InternViT-6B          | UperNet |     6.3B \u002F 6.3B      |    504    | 58.9 (+3.6)  |\n\n- Zero-Shot Image Classification [\\[see details\\]](.\u002Fclip_benchmark#imagenet-variants-and-objectnet)\n\n  | method            | IN-1K | IN-A  | IN-R  | IN-V2 | IN-Sketch | ObjectNet |\n  | ----------------- | :---: | :---: | :---: | :---: | :-------: | :-------: |\n  | OpenCLIP-G        | 80.1  | 69.3  | 92.1  | 73.6  |   68.9    |   73.0    |\n  | EVA-02-CLIP-E+    | 82.0  | 82.1  | 94.5  | 75.7  |   71.6    |   79.6    |\n  | ViT-22B\\*         | 85.9  | 90.1  | 96.0  | 80.9  |     -     |   87.6    |\n  | InternVL-C (ours) | 83.2  | 83.8  | 95.5  | 77.3  |   73.9    |   80.6    |\n\n- Multilingual Zero-Shot Image Classification [\\[see details\\]](.\u002Fclip_benchmark#multilingual-imagenet-1k)\n\n  EN: English, ZH: Chinese, JP: Japanese, Ar: Arabic, IT: Italian\n\n  | method            | IN-1K (EN) | IN-1K (ZH) | IN-1K (JP) | IN-1K (AR) | IN-1K (IT) |\n  | ----------------- | :--------: | :--------: | :--------: | :--------: | :--------: |\n  | Taiyi-CLIP-ViT-H  |     -      |    54.4    |     -      |     -      |     -      |\n  | WuKong-ViT-L-G    |     -      |    57.5    |     -      |     -      |     -      |\n  | CN-CLIP-ViT-H     |     -      |    59.6    |     -      |     -      |     -      |\n  | AltCLIP-ViT-L     |    74.5    |    59.6    |     -      |     -      |     -      |\n  | EVA-02-CLIP-E+    |    82.0    |     -      |     -      |     -      |    41.2    |\n  | OpenCLIP-XLM-R-H  |    77.0    |    55.7    |    53.1    |    37.0    |    56.8    |\n  | InternVL-C (ours) |    83.2    |    64.5    |    61.5    |    44.9    |    65.7    |\n\n- Zero-Shot Video Classification\n\n  | method            | #frame | K400  | K600  | K700  |\n  | ----------------- | :----: | :---: | :---: | :---: |\n  | OpenCLIP-G        |   1    | 65.9  | 66.1  | 59.2  |\n  | EVA-02-CLIP-E+    |   1    | 69.8  | 69.3  | 63.4  |\n  | InternVL-C (ours) |   1    | 71.0  | 71.3  | 65.7  |\n  | ViCLIP            |   8    | 75.7  | 73.5  | 66.4  |\n  | InternVL-C (ours) |   8    | 79.4  | 78.8  | 71.5  |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Cross-Modal Retrieval (click to expand)\u003C\u002Fsummary>\n\n- English Zero-Shot Image-Text Retrieval [\\[see details\\]](.\u002Fclip_benchmark#flickr30k--coco)\n\n  \u003Ctable>\n    \u003Ctr align=center>\n        \u003Ctd rowspan=\"3\" align=left>\u003Cb>model\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"6\" align=center>\u003Cb>Flickr30K\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"6\" align=center>\u003Cb>COCO\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd rowspan=\"3\" align=center>\u003Cb>avg\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n     \u003Ctr align=center>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>image-to-text\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>text-to-image\u003C\u002Fb>\u003C\u002Ftd>\n         \u003Ctd colspan=\"3\" align=center>\u003Cb>image-to-text\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>text-to-image\u003C\u002Fb>\u003C\u002Ftd>\n     \u003C\u002Ftr>\n     \u003Ctr>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n     \u003C\u002Ftr>\n  \u003Ctr align=center>\n        \u003Ctd align=left>OpenCLIP-G\u003C\u002Ftd>\n        \u003Ctd>92.9\u003C\u002Ftd>\n        \u003Ctd>99.3\u003C\u002Ftd>\n        \u003Ctd>99.8\u003C\u002Ftd>\n        \u003Ctd>79.5\u003C\u002Ftd>\n        \u003Ctd>95.0\u003C\u002Ftd>\n        \u003Ctd>97.1\u003C\u002Ftd>\n        \u003Ctd>67.3\u003C\u002Ftd>\n        \u003Ctd>86.9\u003C\u002Ftd>\n        \u003Ctd>92.6\u003C\u002Ftd>\n        \u003Ctd>51.4\u003C\u002Ftd>\n        \u003Ctd>74.9\u003C\u002Ftd>\n        \u003Ctd>83.0\u003C\u002Ftd>\n        \u003Ctd>85.0\u003C\u002Ftd>\n     \u003C\u002Ftr>\n  \u003Ctr align=center>\n        \u003Ctd align=left>EVA-02-CLIP-E+\u003C\u002Ftd>\n        \u003Ctd>93.9\u003C\u002Ftd>\n        \u003Ctd>99.4\u003C\u002Ftd>\n        \u003Ctd>99.8\u003C\u002Ftd>\n        \u003Ctd>78.8\u003C\u002Ftd>\n        \u003Ctd>94.2\u003C\u002Ftd>\n        \u003Ctd>96.8\u003C\u002Ftd>\n        \u003Ctd>68.8\u003C\u002Ftd>\n        \u003Ctd>87.8\u003C\u002Ftd>\n        \u003Ctd>92.8\u003C\u002Ftd>\n        \u003Ctd>51.1\u003C\u002Ftd>\n        \u003Ctd>75.0\u003C\u002Ftd>\n        \u003Ctd>82.7\u003C\u002Ftd>\n        \u003Ctd>85.1\u003C\u002Ftd>\n     \u003C\u002Ftr>\n    \u003Ctr align=center>\n        \u003Ctd align=left>EVA-CLIP-8B\u003C\u002Ftd>\n        \u003Ctd>95.6\u003C\u002Ftd>\n        \u003Ctd>99.6\u003C\u002Ftd>\n        \u003Ctd>99.9\u003C\u002Ftd>\n        \u003Ctd>80.8\u003C\u002Ftd>\n        \u003Ctd>95.5\u003C\u002Ftd>\n        \u003Ctd>97.6\u003C\u002Ftd>\n        \u003Ctd>70.3\u003C\u002Ftd>\n        \u003Ctd>89.3\u003C\u002Ftd>\n        \u003Ctd>93.9\u003C\u002Ftd>\n        \u003Ctd>53.0\u003C\u002Ftd>\n        \u003Ctd>76.0\u003C\u002Ftd>\n        \u003Ctd>83.4\u003C\u002Ftd>\n        \u003Ctd>86.2\u003C\u002Ftd>\n     \u003C\u002Ftr>\n  \u003Ctr align=center>\n        \u003Ctd align=left>InternVL-C (ours)\u003C\u002Ftd>\n        \u003Ctd>94.7\u003C\u002Ftd>\n        \u003Ctd>99.6\u003C\u002Ftd>\n        \u003Ctd>99.9\u003C\u002Ftd>\n        \u003Ctd>81.7\u003C\u002Ftd>\n        \u003Ctd>96.0\u003C\u002Ftd>\n        \u003Ctd>98.2\u003C\u002Ftd>\n        \u003Ctd>70.6\u003C\u002Ftd>\n        \u003Ctd>89.0\u003C\u002Ftd>\n        \u003Ctd>93.5\u003C\u002Ftd>\n        \u003Ctd>54.1\u003C\u002Ftd>\n        \u003Ctd>77.3\u003C\u002Ftd>\n        \u003Ctd>84.6\u003C\u002Ftd>\n        \u003Ctd>86.6\u003C\u002Ftd>\n     \u003C\u002Ftr>\n  \u003Ctr align=center>\n        \u003Ctd align=left>InternVL-G (ours)\u003C\u002Ftd>\n        \u003Ctd>95.7\u003C\u002Ftd>\n        \u003Ctd>99.7\u003C\u002Ftd>\n        \u003Ctd>99.9\u003C\u002Ftd>\n        \u003Ctd>85.0\u003C\u002Ftd>\n        \u003Ctd>97.0\u003C\u002Ftd>\n        \u003Ctd>98.6\u003C\u002Ftd>\n        \u003Ctd>74.9\u003C\u002Ftd>\n        \u003Ctd>91.3\u003C\u002Ftd>\n        \u003Ctd>95.2\u003C\u002Ftd>\n        \u003Ctd>58.6\u003C\u002Ftd>\n        \u003Ctd>81.3\u003C\u002Ftd>\n        \u003Ctd>88.0\u003C\u002Ftd>\n        \u003Ctd>88.8\u003C\u002Ftd>\n     \u003C\u002Ftr>\n\n  \u003C\u002Ftable>\n\n- Chinese Zero-Shot Image-Text Retrieval [\\[see details\\]](.\u002Fclip_benchmark#flickr30k-cn--coco-cn)\n\n  \u003Ctable>\n    \u003Ctr  align=center>\n        \u003Ctd rowspan=\"3\" align=left>\u003Cb>model\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"6\" align=center>\u003Cb>Flickr30K-CN\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"6\" align=center>\u003Cb>COCO-CN\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd rowspan=\"3\" align=center>\u003Cb>avg\u003C\u002Fb>\u003C\u002Ftd>\n\n  \u003C\u002Ftr>\n     \u003Ctr  align=center>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>image-to-text\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>text-to-image\u003C\u002Fb>\u003C\u002Ftd>\n         \u003Ctd colspan=\"3\" align=center>\u003Cb>image-to-text\u003C\u002Fb>\u003C\u002Ftd>\n        \u003Ctd colspan=\"3\" align=center>\u003Cb>text-to-image\u003C\u002Fb>\u003C\u002Ftd>\n     \u003C\u002Ftr>\n     \u003Ctr>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n        \u003Ctd>R@1\u003C\u002Ftd>\n        \u003Ctd>R@5\u003C\u002Ftd>\n        \u003Ctd>R@10\u003C\u002Ftd>\n     \u003C\u002Ftr>\n\n  \u003Ctr align=center>\n        \u003Ctd align=left>CN-CLIP-ViT-H\u003C\u002Ftd>\n        \u003Ctd>81.6\u003C\u002Ftd>\n        \u003Ctd>97.5\u003C\u002Ftd>\n        \u003Ctd>98.8\u003C\u002Ftd>\n        \u003Ctd>71.2\u003C\u002Ftd>\n        \u003Ctd>91.4\u003C\u002Ftd>\n        \u003Ctd>95.5\u003C\u002Ftd>\n        \u003Ctd>63.0\u003C\u002Ftd>\n        \u003Ctd>86.6\u003C\u002Ftd>\n        \u003Ctd>92.9\u003C\u002Ftd>\n        \u003Ctd>69.2\u003C\u002Ftd>\n        \u003Ctd>89.9\u003C\u002Ftd>\n        \u003Ctd>96.1\u003C\u002Ftd>\n        \u003Ctd>86.1\u003C\u002Ftd>\n     \u003C\u002Ftr>\n\n  \u003Ctr align=center>\n        \u003Ctd align=left>OpenCLIP-XLM-R-H\u003C\u002Ftd>\n        \u003Ctd>86.1\u003C\u002Ftd>\n        \u003Ctd>97.5\u003C\u002Ftd>\n        \u003Ctd>99.2\u003C\u002Ftd>\n        \u003Ctd>71.0\u003C\u002Ftd>\n        \u003Ctd>90.5\u003C\u002Ftd>\n        \u003Ctd>94.9\u003C\u002Ftd>\n        \u003Ctd>70.0\u003C\u002Ftd>\n        \u003Ctd>91.5\u003C\u002Ftd>\n        \u003Ctd>97.0\u003C\u002Ftd>\n        \u003Ctd>66.1\u003C\u002Ftd>\n        \u003Ctd>90.8\u003C\u002Ftd>\n        \u003Ctd>96.0\u003C\u002Ftd>\n        \u003Ctd>87.6\u003C\u002Ftd>\n     \u003C\u002Ftr>\n\n  \u003Ctr align=center>\n        \u003Ctd align=left>InternVL-C (ours)\u003C\u002Ftd>\n        \u003Ctd>90.3\u003C\u002Ftd>\n        \u003Ctd>98.8\u003C\u002Ftd>\n        \u003Ctd>99.7\u003C\u002Ftd>\n        \u003Ctd>75.1\u003C\u002Ftd>\n        \u003Ctd>92.9\u003C\u002Ftd>\n        \u003Ctd>96.4\u003C\u002Ftd>\n        \u003Ctd>68.8\u003C\u002Ftd>\n        \u003Ctd>92.0\u003C\u002Ftd>\n        \u003Ctd>96.7\u003C\u002Ftd>\n        \u003Ctd>68.9\u003C\u002Ftd>\n        \u003Ctd>91.9\u003C\u002Ftd>\n        \u003Ctd>96.5\u003C\u002Ftd>\n        \u003Ctd>89.0\u003C\u002Ftd>\n     \u003C\u002Ftr>\n  \u003Ctr align=center>\n        \u003Ctd align=left>InternVL-G (ours)\u003C\u002Ftd>\n        \u003Ctd>92.9\u003C\u002Ftd>\n        \u003Ctd>99.4\u003C\u002Ftd>\n        \u003Ctd>99.8\u003C\u002Ftd>\n        \u003Ctd>77.7\u003C\u002Ftd>\n        \u003Ctd>94.8\u003C\u002Ftd>\n        \u003Ctd>97.3\u003C\u002Ftd>\n        \u003Ctd>71.4\u003C\u002Ftd>\n        \u003Ctd>93.9\u003C\u002Ftd>\n        \u003Ctd>97.7\u003C\u002Ftd>\n        \u003Ctd>73.8\u003C\u002Ftd>\n        \u003Ctd>94.4\u003C\u002Ftd>\n        \u003Ctd>98.1\u003C\u002Ftd>\n        \u003Ctd>90.9\u003C\u002Ftd>\n     \u003C\u002Ftr>\n\n  \u003C\u002Ftable>\n\n- Multilingual Zero-Shot Image-Text Retrieval on XTD [\\[see details\\]](.\u002Fclip_benchmark#xtd)\n\n  | method            |  EN   |  ES   |  FR   |  ZH   |  IT   |  KO   |  RU   |  JP   | average |\n  | ----------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-----: |\n  | AltCLIP           | 95.4  | 94.1  | 92.9  | 95.1  | 94.2  | 94.4  | 91.8  | 91.7  |  93.7   |\n  | OpenCLIP-XLM-R-H  | 97.3  | 96.1  | 94.5  | 94.7  | 96.0  | 90.2  | 93.9  | 94.0  |  94.6   |\n  | InternVL-C (ours) | 97.3  | 95.7  | 95.1  | 95.6  | 96.0  | 92.2  | 93.3  | 95.5  |  95.1   |\n  | InternVL-G (ours) | 98.6  | 97.7  | 96.5  | 96.7  | 96.9  | 95.1  | 94.8  | 96.1  |  96.6   |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Multimodal Dialogue\u003C\u002Fsummary>\n\n\u003C\u002Fdetails>\n\n## Quick Start with HuggingFace\n\n\u003Cdetails>\n  \u003Csummary>using InternViT-6B for visual feature extraction (click to expand)\u003C\u002Fsummary>\n\n```python\nimport torch\nfrom PIL import Image\nfrom transformers import AutoModel, CLIPImageProcessor\n\nmodel = AutoModel.from_pretrained(\n    'OpenGVLab\u002FInternViT-6B-448px-V2_5',\n    torch_dtype=torch.bfloat16,\n    low_cpu_mem_usage=True,\n    trust_remote_code=True).cuda().eval()\n\nimage = Image.open('.\u002Fexamples\u002Fimage1.jpg').convert('RGB')\n\nimage_processor = CLIPImageProcessor.from_pretrained('OpenGVLab\u002FInternViT-6B-448px-V1-5')\n\npixel_values = image_processor(images=image, return_tensors='pt').pixel_values\npixel_values = pixel_values.to(torch.bfloat16).cuda()\n\noutputs = model(pixel_values)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>using InternVL-C(ontrastive) and InternVL-G(enerative) for cross-modal retrieval (click to expand)\u003C\u002Fsummary>\n\n```python\nimport torch\nfrom PIL import Image\nfrom transformers import AutoModel, CLIPImageProcessor\nfrom transformers import AutoTokenizer\n\n\nmodel = AutoModel.from_pretrained(\n    'OpenGVLab\u002FInternVL-14B-224px',\n    torch_dtype=torch.bfloat16,\n    low_cpu_mem_usage=True,\n    trust_remote_code=True).cuda().eval()\n\nimage_processor = CLIPImageProcessor.from_pretrained('OpenGVLab\u002FInternVL-14B-224px')\n\ntokenizer = AutoTokenizer.from_pretrained(\n    'OpenGVLab\u002FInternVL-14B-224px', use_fast=False, add_eos_token=True)\ntokenizer.pad_token_id = 0  # set pad_token_id to 0\n\nimages = [\n    Image.open('.\u002Fexamples\u002Fimage1.jpg').convert('RGB'),\n    Image.open('.\u002Fexamples\u002Fimage2.jpg').convert('RGB'),\n    Image.open('.\u002Fexamples\u002Fimage3.jpg').convert('RGB')\n]\nprefix = 'summarize:'\ntexts = [\n    prefix + 'a photo of a red panda',  # English\n    prefix + '一张熊猫的照片',  # Chinese\n    prefix + '二匹の猫の写真'  # Japanese\n]\n\npixel_values = image_processor(images=images, return_tensors='pt').pixel_values\npixel_values = pixel_values.to(torch.bfloat16).cuda()\ninput_ids = tokenizer(texts, return_tensors='pt', max_length=80,\n                      truncation=True, padding='max_length').input_ids.cuda()\n\n# InternVL-C\nlogits_per_image, logits_per_text = model(\n    image=pixel_values, text=input_ids, mode='InternVL-C')\nprobs = logits_per_image.softmax(dim=-1)\n# tensor([[9.9609e-01, 5.2185e-03, 6.0070e-08],\n#         [2.2949e-02, 9.7656e-01, 5.9903e-06],\n#         [3.2932e-06, 7.4863e-05, 1.0000e+00]], device='cuda:0',\n#        dtype=torch.bfloat16, grad_fn=\u003CSoftmaxBackward0>)\n\n# InternVL-G\nlogits_per_image, logits_per_text = model(\n    image=pixel_values, text=input_ids, mode='InternVL-G')\nprobs = logits_per_image.softmax(dim=-1)\n# tensor([[9.9609e-01, 3.1738e-03, 3.6322e-08],\n#         [8.6060e-03, 9.9219e-01, 2.8759e-06],\n#         [1.7583e-06, 3.1233e-05, 1.0000e+00]], device='cuda:0',\n#        dtype=torch.bfloat16, grad_fn=\u003CSoftmaxBackward0>)\n\n# please set add_eos_token to False for generation\ntokenizer.add_eos_token = False\nimage = Image.open('.\u002Fexamples\u002Fimage1.jpg').convert('RGB')\npixel_values = image_processor(images=image, return_tensors='pt').pixel_values\npixel_values = pixel_values.to(torch.bfloat16).cuda()\n\ntokenized = tokenizer(\"English caption:\", return_tensors='pt')\npred = model.generate(\n    pixel_values=pixel_values,\n    input_ids=tokenized.input_ids.cuda(),\n    attention_mask=tokenized.attention_mask.cuda(),\n    num_beams=5,\n    min_new_tokens=8,\n)\ncaption = tokenizer.decode(pred[0].cpu(), skip_special_tokens=True).strip()\n# English caption: a red panda sitting on top of a wooden platform\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>using InternVL 2.5 for multimodal chat (click to expand)\u003C\u002Fsummary>\n\nHere, we take the smaller `OpenGVLab\u002FInternVL2_5-8B` as an example:\n\n```python\nimport numpy as np\nimport torch\nimport torchvision.transforms as T\nfrom decord import VideoReader, cpu\nfrom PIL import Image\nfrom torchvision.transforms.functional import InterpolationMode\nfrom transformers import AutoModel, AutoTokenizer\n\nIMAGENET_MEAN = (0.485, 0.456, 0.406)\nIMAGENET_STD = (0.229, 0.224, 0.225)\n\ndef build_transform(input_size):\n    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD\n    transform = T.Compose([\n        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),\n        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),\n        T.ToTensor(),\n        T.Normalize(mean=MEAN, std=STD)\n    ])\n    return transform\n\ndef find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):\n    best_ratio_diff = float('inf')\n    best_ratio = (1, 1)\n    area = width * height\n    for ratio in target_ratios:\n        target_aspect_ratio = ratio[0] \u002F ratio[1]\n        ratio_diff = abs(aspect_ratio - target_aspect_ratio)\n        if ratio_diff \u003C best_ratio_diff:\n            best_ratio_diff = ratio_diff\n            best_ratio = ratio\n        elif ratio_diff == best_ratio_diff:\n            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:\n                best_ratio = ratio\n    return best_ratio\n\ndef dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):\n    orig_width, orig_height = image.size\n    aspect_ratio = orig_width \u002F orig_height\n\n    # calculate the existing image aspect ratio\n    target_ratios = set(\n        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if\n        i * j \u003C= max_num and i * j >= min_num)\n    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])\n\n    # find the closest aspect ratio to the target\n    target_aspect_ratio = find_closest_aspect_ratio(\n        aspect_ratio, target_ratios, orig_width, orig_height, image_size)\n\n    # calculate the target width and height\n    target_width = image_size * target_aspect_ratio[0]\n    target_height = image_size * target_aspect_ratio[1]\n    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]\n\n    # resize the image\n    resized_img = image.resize((target_width, target_height))\n    processed_images = []\n    for i in range(blocks):\n        box = (\n            (i % (target_width \u002F\u002F image_size)) * image_size,\n            (i \u002F\u002F (target_width \u002F\u002F image_size)) * image_size,\n            ((i % (target_width \u002F\u002F image_size)) + 1) * image_size,\n            ((i \u002F\u002F (target_width \u002F\u002F image_size)) + 1) * image_size\n        )\n        # split the image\n        split_img = resized_img.crop(box)\n        processed_images.append(split_img)\n    assert len(processed_images) == blocks\n    if use_thumbnail and len(processed_images) != 1:\n        thumbnail_img = image.resize((image_size, image_size))\n        processed_images.append(thumbnail_img)\n    return processed_images\n\ndef load_image(image_file, input_size=448, max_num=12):\n    image = Image.open(image_file).convert('RGB')\n    transform = build_transform(input_size=input_size)\n    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)\n    pixel_values = [transform(image) for image in images]\n    pixel_values = torch.stack(pixel_values)\n    return pixel_values\n\n# If you have an 80G A100 GPU, you can put the entire model on a single GPU.\n# Otherwise, you need to load a model using multiple GPUs, please refer to the `Multiple GPUs` section.\npath = 'OpenGVLab\u002FInternVL2_5-8B'\nmodel = AutoModel.from_pretrained(\n    path,\n    torch_dtype=torch.bfloat16,\n    low_cpu_mem_usage=True,\n    trust_remote_code=True).eval().cuda()\ntokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=False)\n\n# set the max number of tiles in `max_num`\npixel_values = load_image('.\u002Fexamples\u002Fimage1.jpg', max_num=12).to(torch.bfloat16).cuda()\ngeneration_config = dict(max_new_tokens=1024, do_sample=False)\n\n# pure-text conversation (纯文本对话)\nquestion = 'Hello, who are you?'\nresponse, history = model.chat(tokenizer, None, question, generation_config, history=None, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\nquestion = 'Can you tell me a story?'\nresponse, history = model.chat(tokenizer, None, question, generation_config, history=history, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# single-image single-round conversation (单图单轮对话)\nquestion = '\u003Cimage>\\nPlease describe the image shortly.'\nresponse = model.chat(tokenizer, pixel_values, question, generation_config)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# single-image multi-round conversation (单图多轮对话)\nquestion = '\u003Cimage>\\nPlease describe the image in detail.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config, history=None, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\nquestion = 'Please write a poem according to the image.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config, history=history, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# multi-image multi-round conversation, combined images (多图多轮对话，拼接图像)\npixel_values1 = load_image('.\u002Fexamples\u002Fimage1.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values2 = load_image('.\u002Fexamples\u002Fimage2.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)\n\nquestion = '\u003Cimage>\\nDescribe the two images in detail.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config,\n                               history=None, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\nquestion = 'What are the similarities and differences between these two images.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config,\n                               history=history, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# multi-image multi-round conversation, separate images (多图多轮对话，独立图像)\npixel_values1 = load_image('.\u002Fexamples\u002Fimage1.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values2 = load_image('.\u002Fexamples\u002Fimage2.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)\nnum_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]\n\nquestion = 'Image-1: \u003Cimage>\\nImage-2: \u003Cimage>\\nDescribe the two images in detail.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config,\n                               num_patches_list=num_patches_list,\n                               history=None, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\nquestion = 'What are the similarities and differences between these two images.'\nresponse, history = model.chat(tokenizer, pixel_values, question, generation_config,\n                               num_patches_list=num_patches_list,\n                               history=history, return_history=True)\nprint(f'User: {question}\\nAssistant: {response}')\n\n# batch inference, single image per sample (单图批处理)\npixel_values1 = load_image('.\u002Fexamples\u002Fimage1.jpg', max_num=12).to(torch.bfloat16).cuda()\npixel_values2 = load_image('.\u002Fexamples\u002Fimage2.jpg', max_num=12).to(torch.bfloat16).cuda()\nnum_patches_list = [pixel_values1.size(0), pixel_values2.size(0)]\npixel_values = torch.cat((pixel_values1, pixel_values2), dim=0)\n\nquestions = ['\u003Cimage>\\nDescribe the image in detail.'] * len(num_patches_list)\nresponses = model.batch_chat(tokenizer, pixel_values,\n                             num_patches_list=num_patches_list,\n                             questions=questions,\n                             generation_config=generation_config)\nfor question, response in zip(questions, responses):\n    print(f'User: {question}\\nAssistant: {response}')\n\n# video multi-round conversation (视频多轮对话)\ndef get_index(bound, fps, max_frame, first_idx=0, num_segments=32):\n    if bound:\n        start, end = bound[0], bound[1]\n    else:\n        start, end = -100000, 100000\n    start_idx = max(first_idx, round(start * fps))\n    end_idx = min(round(end * fps), max_frame)\n    seg_size = float(end_idx - start_idx) \u002F num_segments\n    frame_indices = np.array([\n        int(start_idx + (seg_size \u002F 2) + np.round(seg_size * idx))\n        for idx in range(num_segments)\n    ])\n    return frame_indices\n\ndef load_video(video_path, bound=None, input_size=448, max_num=1, num_segments=32):\n    vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)\n    max_frame = len(vr) - 1\n    fps = float(vr.get_avg_fps())\n\n    pixel_values_list, num_patches_list = [], []\n    transform = build_transform(input_size=input_size)\n    frame_indices = get_index(bound, fps, max_frame, first_idx=0, num_segments=num_segments)\n    for frame_index in frame_indices:\n        img = Image.fromarray(vr[frame_index].asnumpy()).convert('RGB')\n        img = dynamic_preprocess(img, image_size=input_size, use_thumbnail=True, max_num=max_num)\n        pixel_values = [transform(tile) for tile in img]\n        pixel_values = torch.stack(pixel_values)\n        num_patches_list.append(pixel_values.shape[0])\n        pixel_values_list.append(pixel_values)\n    pixel_values = torch.cat(pixel_values_list)\n    return pixel_values, num_patches_list\n\nvideo_path = '.\u002Fexamples\u002Fred-panda.mp4'\npixel_values, num_patches_list = load_video(video_path, num_segments=8, max_num=1)\npixel_values = pixel_values.to(torch.bfloat16).cuda()\nvideo_prefix = ''.join([f'Frame-{i+1}: \u003Cimage>\\n' for i in range(len(num_patches_list))])\nquestion = video_prefix + 'What is the red panda doing?'\n# Frame1: \u003Cimage>\\nFrame2: \u003Cimage>\\n...\\nFrame8: \u003Cimage>\\n{question}\nresponse, history = model.chat","InternVL 是一个开源的多模态对话模型，旨在提供接近GPT-4o表现的替代方案。该项目的核心功能包括图像分类、图像文本检索、语义分割和视频分类等，基于多种视觉-语言模型（如VIT-22B, VIT-6B）构建，能够处理复杂的多模态任务。技术上，InternVL利用了大规模预训练技术和强化学习方法来提升模型的推理能力和效率。适合需要高性能多模态理解与生成能力的应用场景，例如智能客服、内容审核及自动图文创作等领域。","2026-06-11 03:35:21","high_star"]