[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72713":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":10,"languages":10,"totalLinesOfCode":10,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":37,"readmeContent":38,"aiSummary":39,"trendingCount":15,"starSnapshotCount":15,"syncStatus":40,"lastSyncTime":41,"discoverSource":42},72713,"Awesome-Multimodal-Large-Language-Models","BradyFU\u002FAwesome-Multimodal-Large-Language-Models","BradyFU",":sparkles::sparkles:Latest Advances on Multimodal Large Language Models","",null,17870,1126,290,47,0,8,25,96,24,44.16,false,"main",[24,25,26,27,28,29,30,31,32,33,34,35,36],"chain-of-thought","in-context-learning","instruction-following","instruction-tuning","large-language-models","large-vision-language-model","large-vision-language-models","multi-modality","multimodal-chain-of-thought","multimodal-in-context-learning","multimodal-instruction-tuning","multimodal-large-language-models","visual-instruction-tuning","2026-06-12 02:03:07","# Awesome-Multimodal-Large-Language-Models\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fimages\u002Fmig_logo.png\" width=\"90%\" height=\"90%\">\n\u003C\u002Fp>\n\n## ✨ Highlights of NJU-MiG\n\n> 🔥🔥 **Surveys of MLLMs**  |  **[💬 WeChat (MLLM微信交流群)](.\u002Fimages\u002Fwechat-group.png)**\n\n- 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks) \n\n- 🌟 **A Survey of Unified Multimodal Understanding and Generation: Advances and Challenges**  \narXiv 2025, [Paper](https:\u002F\u002Fwww.techrxiv.org\u002Fdoi\u002Fpdf\u002F10.36227\u002Ftechrxiv.176289261.16802577), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FUnified) \n\n- **A Survey on Multimodal Large Language Models**  \nNSR 2024, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13549.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models)\n\n\n---\n\n\n> 🔥🔥 **VITA Series Omni MLLMs** | **[💬 WeChat (VITA微信交流群)](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA\u002Fblob\u002Fmain\u002Fasset\u002Fwechat-group.jpg)**\n\n- **VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**  \nNeurIPS 2025 Highlight, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957.pdf), [Project](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA)\n\n- **VITA: Towards Open-Source Interactive Omni Multimodal LLM**  \narXiv 2024, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211.pdf), [Project](https:\u002F\u002Fvita-home.github.io\u002F)\n\n- **VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**  \nNeurIPS 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739.pdf), [Project](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio)\n\n\n---\n\n\n> 🔥🔥 **MME Series MLLM Benchmarks**\n\n- 🔥 **Video-MME-v2: Towards the Next Stage in Video Understanding Evaluation**\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\".\u002Fimages\u002Fvideo-mme-v2-logo.png\" width=\"100%\" height=\"100%\">\n\u003C\u002Fp>\n\n\u003Cfont size=7>\u003Cdiv align='center' > [[🍎 Project Page](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F)] [[📖 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015)] [[🤗 Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMME-Benchmarks\u002FVideo-MME-v2)] [[🏆 Leaderboard](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F#leaderboard)]  \u003C\u002Fdiv>\u003C\u002Ffont>\n\n- 🌟 **MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs**  \narXiv 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296.pdf), [Project](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Ftree\u002FBenchmarks)\n\n- **MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models**  \nNeurIPS 2025 DB Highlight, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.13394.pdf), [Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flmms-lab\u002FMME), [Eval Tool](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FAwesome-Multimodal-Large-Language-Models\u002Fblob\u002FEvaluation\u002Ftools\u002Feval_tool.zip), [✒️ Citation](.\u002Fimages\u002Fbib_mme.txt)\n\n- **Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis**  \nCVPR 2025, [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.21075.pdf), [Project](https:\u002F\u002Fvideo-mme.github.io\u002F), [Dataset](https:\u002F\u002Fgithub.com\u002FBradyFU\u002FVideo-MME?tab=readme-ov-file#-dataset)\n\n\n---\n\n\u003Cfont size=5>\u003Ccenter>\u003Cb> Table of Contents \u003C\u002Fb> \u003C\u002Fcenter>\u003C\u002Ffont>\n- [Awesome Papers](#awesome-papers)\n  - [Multimodal Instruction Tuning (& Latest Works)](#multimodal-instruction-tuning--latest-works)\n  - [Multimodal Hallucination](#multimodal-hallucination)\n  - [Multimodal In-Context Learning](#multimodal-in-context-learning)\n  - [Multimodal Chain-of-Thought](#multimodal-chain-of-thought)\n  - [LLM-Aided Visual Reasoning](#llm-aided-visual-reasoning)\n  - [Foundation Models](#foundation-models)\n  - [Evaluation](#evaluation)\n  - [Multimodal RLHF](#multimodal-rlhf)\n  - [Others](#others)\n- [Awesome Datasets](#awesome-datasets)\n  - [Datasets of Pre-Training for Alignment](#datasets-of-pre-training-for-alignment)\n  - [Datasets of Multimodal Instruction Tuning](#datasets-of-multimodal-instruction-tuning)\n  - [Datasets of In-Context Learning](#datasets-of-in-context-learning)\n  - [Datasets of Multimodal Chain-of-Thought](#datasets-of-multimodal-chain-of-thought)\n  - [Datasets of Multimodal RLHF](#datasets-of-multimodal-rlhf)\n  - [Benchmarks for Evaluation](#benchmarks-for-evaluation)\n  - [Others](#others-1)\n---\n\n# Awesome Papers\n\n## Multimodal Instruction Tuning (& Latest Works)\n|  Title  |   Venue  |   Date   |   Code   |   Demo   |\n|:--------|:--------:|:--------:|:--------:|:--------:|\n| [**DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence**](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-V4-Pro\u002Fblob\u002Fmain\u002FDeepSeek_V4.pdf) | DeepSeek | 2026-04-24 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-V4-Pro) | - |\n| [**Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.6-27b) | Blog | 2026-04-22 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3.6-27B) | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| [**Xiaomi MiMo-V2.5**](https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-5) | Blog | 2026-04-22 | - | [Demo](https:\u002F\u002Faistudio.xiaomimimo.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMME-Benchmarks\u002FVideo-MME-v2.svg?style=social&label=Star) \u003Cbr> [**Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2604.05015) \u003Cbr> | arXiv | 2026-04-06 | [Github](https:\u002F\u002Fgithub.com\u002FMME-Benchmarks\u002FVideo-MME-v2) | [Demo](https:\u002F\u002Fvideo-mme-v2.netlify.app\u002F) |\n| [**Introducing Muse Spark: Scaling Towards Personal Superintelligence**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fintroducing-muse-spark-msl\u002F) | Blog | 2026-04-08 | - | [Demo](https:\u002F\u002Fmeta.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-QinYu.svg?style=social&label=Star) \u003Cbr> [**VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing**](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) \u003Cbr> | arXiv | 2026-04-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-QinYu) | Local Demo |\n| [**Gemma 4: Byte for byte, the most capable open models**](https:\u002F\u002Fdeepmind.google\u002Fmodels\u002Fgemma\u002Fgemma-4\u002F) | Blog | 2026-04-02 | - | [Demo](https:\u002F\u002Faistudio.google.com\u002Fprompts\u002Fnew_chat?model=gemma-4-31b-it&utm_source=deepmind.google&utm_medium=referral&utm_campaign=gdm&utm_content=) |\n| [**Qwen3.6-Plus: Towards Real World Agents**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.6) | Blog | 2026-04-02 | - | - |\n| [**Qwen3.5-Omni: Scaling Up, Toward Native Omni-Modal AGI**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5-omni) | Blog | 2026-03-30 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F?spm=a2ty_o06.30285417.0.0.6d26c921GDrWrb) |\n| [**Xiaomi MiMo-V2-Omni**](https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-omni) | Blog | 2026-03-18 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL-U.svg?style=social&label=Star) \u003Cbr> [**InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.09877) \u003Cbr> | arXiv | 2026-03-10 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL-U) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FOmni-Diffusion.svg?style=social&label=Star) \u003Cbr> [**Omni-Diffusion: Unified Multimodal Understanding and Generation with Masked Discrete Diffusion**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.06577) \u003Cbr> | arXiv | 2026-03-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FOmni-Diffusion) | - |\n| [**Beyond Language Modeling: An Exploration of Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2603.03276) | arXiv | 2026-03-03 | - | - |\n| [**Gemini 3.1 Pro: A smarter model for your most complex tasks**](https:\u002F\u002Fblog.google\u002Finnovation-and-ai\u002Fmodels-and-research\u002Fgemini-models\u002Fgemini-3-1-pro\u002F) | Blog | 2026-02-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3.5.svg?style=social&label=Star) \u003Cbr> [**Qwen3.5: Towards Native Multimodal Agents**](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3.5) \u003Cbr> | Blog | 2026-02-16 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3.5) | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-o.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-o 4.5**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-4_5) \u003Cbr> | Blog | 2026-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](https:\u002F\u002Fminicpm-omni.openbmb.cn\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-K2.5.svg?style=social&label=Star) \u003Cbr> [**Kimi K2.5: Visual Agentic Intelligence**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2602.02276) \u003Cbr> | arXiv | 2026-02-02 | [Github](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-K2.5) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR-2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR 2: Visual Causal Flow**](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2\u002Fblob\u002Fmain\u002FDeepSeek_OCR2_paper.pdf) \u003Cbr> | DeepSeek | 2026-01-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR-2) | - |\n| [**Seed1.8 Model Card: Towards Generalized Real-World Agency**](https:\u002F\u002Flf3-static.bytednsdoc.com\u002Fobj\u002Feden-cn\u002Flapzild-tss\u002FljhwZthlaukjlkulzlp\u002Fresearch\u002FSeed-1.8-Modelcard.pdf) | Bytedance Seed | 2025-12-18 | - | - |\n| [**Introducing GPT-5.2**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5-2\u002F) | OpenAI | 2025-12-11 | - | - |\n| [**Introducing Mistral 3**](https:\u002F\u002Fmistral.ai\u002Fnews\u002Fmistral-3) | Blog | 2025-12-02 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmistralai\u002Fmistral-large-3) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen3-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.21631) \u003Cbr> | arXiv | 2025-11-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-VL-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEmu3.5.svg?style=social&label=Star) \u003Cbr> [**Emu3.5: Native Multimodal Models are World Learners**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.26583) \u003Cbr> | arXiv | 2025-10-30 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3.5) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.21817.pdf) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-E) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-OCR.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-OCR: Contexts Optical Compression**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.18234) \u003Cbr> | arXiv | 2025-10-21 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FOmniVinci.svg?style=social&label=Star) \u003Cbr> [**OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.15870) \u003Cbr> | arXiv | 2025-10-17 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FOmniVinci) | - |\n| [**NExT-OMNI: Towards Any-to-Any Omnimodal Foundation Models with Discrete Flow Matching**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13721) | arXiv | 2025-10-16 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSenseTime-FVG\u002FInteractiveOmni.svg?style=social&label=Star) \u003Cbr> [**InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.13747) | arXiv | 2025-10-15 | [Github](https:\u002F\u002Fgithub.com\u002FSenseTime-FVG\u002FInteractiveOmni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.09607) \u003Cbr> | arXiv | 2025-10-10 | [Github](https:\u002F\u002Fgithub.com\u002FTencent\u002FVITA\u002Ftree\u002FVITA-VLA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.23661) \u003Cbr> | arXiv | 2025-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLLaVA-OneVision-1.5) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Flmms-lab\u002FLLaVA-OneVision-1.5) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen3-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen3-Omni Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.17765) \u003Cbr> | arXiv | 2025-09-22 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-Omni-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.18265) \u003Cbr> | arXiv | 2025-08-27 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Fchat.intern-ai.org.cn\u002F) |\n| **MiniCPM-V 4.5: A GPT-4o Level MLLM for Single Image, Multi Image and Video Understanding on Your Phone** | - | 2025-08-26 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o) | [Demo](http:\u002F\u002F101.126.42.235:30910\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FThyme.svg?style=social&label=Star) \u003Cbr> [**Thyme: Think Beyond Images**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.11630) \u003Cbr> | arXiv | 2025-08-18 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FThyme) | [Demo](https:\u002F\u002Fthyme-vl.github.io\u002F) |\n| [**Introducing GPT-5**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-gpt-5\u002F) | OpenAI | 2025-08-07 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frednote-hilab\u002Fdots.vlm1.svg?style=social&label=Star) \u003Cbr> [**dots.vlm1**](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) \u003Cbr> | rednote-hilab | 2025-08-06 | [Github](https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.vlm1) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Frednote-hilab\u002Fdots-vlm1-demo) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**Step3: Cost-Effective Multimodal Intelligence**](https:\u002F\u002Fstepfun.ai\u002Fresearch\u002Fstep3) \u003Cbr> | StepFun | 2025-07-31 | [Github](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep3) | [Demo](https:\u002F\u002Fstepfun.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FGLM-4.1V-Thinking.svg?style=social&label=Star) \u003Cbr> [**GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2507.01006) \u003Cbr> | arXiv | 2025-07-02 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4.1V-Thinking) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTHUDM\u002FGLM-4.1V-9B-Thinking-API-Demo) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FDenseWorld-1M.svg?style=social&label=Star) \u003Cbr> [**DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.24102) \u003Cbr> | arXiv | 2025-06-30 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FDenseWorld-1M) | - |\n| [**Qwen VLo: From \"Understanding\" the World to \"Depicting\" It**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqwen-vlo\u002F) | Qwen | 2025-06-26 | - | [Demo](https:\u002F\u002Fchat.qwen.ai\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1.svg?style=social&label=Star) \u003Cbr> [**MMSearch-R1: Incentivizing LMMs to Search**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.20670) \u003Cbr> | arXiv | 2025-06-25 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Fmultimodal-search-r1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FShow-o.svg?style=social&label=Star) \u003Cbr> [**Show-o2: Improved Native Unified Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.15564) \u003Cbr> | arXiv | 2025-06-18 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FShow-o) | - |\n| [**Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v2_5_report.pdf) | Google | 2025-06-17 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fegolife-ai\u002FEgo-R1.svg?style=social&label=Star) \u003Cbr> [**Ego-R1: Chain-of-Tool-Thought for Ultra-Long Egocentric Video Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.13654) \u003Cbr> | arXiv | 2025-06-16 | [Github](https:\u002F\u002Fgithub.com\u002Fegolife-ai\u002FEgo-R1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FXiaomiMiMo\u002FMiMo-VL.svg?style=social&label=Star) \u003Cbr> [**MiMo-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.03569) \u003Cbr> | arXiv | 2025-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-VL) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwusize\u002FOpenUni.svg?style=social&label=Star) \u003Cbr> [**OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.23661) \u003Cbr> | arXiv | 2025-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fwusize\u002FOpenUni) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance-seed\u002FBAGEL.svg?style=social&label=Star) \u003Cbr> [**Emerging Properties in Unified Multimodal Pretraining**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14683) \u003Cbr> | arXiv | 2025-05-23 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance-seed\u002FBAGEL) | [Demo](https:\u002F\u002Fdemo.bagel-ai.org\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FGen-Verse\u002FMMaDA.svg?style=social&label=Star) \u003Cbr> [**MMaDA: Multimodal Large Diffusion Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.15809) \u003Cbr> | arXiv | 2025-05-21 | [Github](https:\u002F\u002Fgithub.com\u002FGen-Verse\u002FMMaDA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FGen-Verse\u002FMMaDA) |\n| [**UniGen: Enhanced Training & Test-Time Strategies for Unified Multimodal Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.14682) | arXiv | 2025-05-20 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FJiuhaiChen\u002FBLIP3o.svg?style=social&label=Star) \u003Cbr> [**BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.09568) \u003Cbr> | arXiv | 2025-05-14 | [Github](https:\u002F\u002Fgithub.com\u002FJiuhaiChen\u002FBLIP3o) | Local Demo |\n| [**Seed1.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.07062) | arXiv | 2025-05-11 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models.svg?style=social&label=Star) \u003Cbr> [**Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04921) \u003Cbr> | arXiv | 2025-05-08 | [Github](https:\u002F\u002Fgithub.com\u002FHITsz-TMG\u002FAwesome-Large-Multimodal-Reasoning-Models) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA-Audio.svg?style=social&label=Star) \u003Cbr> [**VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.03739) \u003Cbr> | arXiv | 2025-05-06 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA-Audio) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkywork-R1V.svg?style=social&label=Star) \u003Cbr> [**Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.16656) \u003Cbr> | arXiv | 2025-04-23 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkywork-R1V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEAGLE.svg?style=social&label=Star) \u003Cbr> [**Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15271) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEAGLE) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fquicksviewer\u002Fquicksviewer.svg?style=social&label=Star) \u003Cbr> [**An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.15270) \u003Cbr> | arXiv | 2025-04-21 | [Github](https:\u002F\u002Fgithub.com\u002Fquicksviewer\u002Fquicksviewer) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10479) \u003Cbr> | arXiv | 2025-04-14 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com\u002F) |\n| [**Introducing GPT-4.1 in the API**](https:\u002F\u002Fopenai.com\u002Findex\u002Fgpt-4-1\u002F) | OpenAI | 2025-04-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMoonshotAI\u002FKimi-VL.svg?style=social&label=Star) \u003Cbr> [**Kimi-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.07491) \u003Cbr> | arXiv | 2025-04-10 | [Github](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmoonshotai\u002FKimi-VL-A3B-Thinking) |\n| [**The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation**](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-4-multimodal-intelligence\u002F) | Meta | 2025-04-05 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmeta-llama\u002Fllama-4-67f0c30d9fe03840bc9d0164) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-Omni.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-Omni Technical Report**](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni\u002Fblob\u002Fmain\u002Fassets\u002FQwen2.5_Omni.pdf) \u003Cbr> | Qwen | 2025-03-26 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-Omni-7B-Demo) |\n| [**Addendum to GPT-4o System Card: Native image generation**](https:\u002F\u002Fcdn.openai.com\u002F11998be9-5319-4302-bfbf-1167e093f1fb\u002FNative_Image_Generation_System_Card.pdf) | OpenAI | 2025-03-25 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FSparrow.svg?style=social&label=Star) \u003Cbr> [**Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.19951) \u003Cbr> | arXiv | 2025-03-17 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FSparrow) | - |\n| [**Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01879) | arXiv | 2025-03-07 | - | - |\n| [**Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.01743) | arXiv | 2025-03-03 | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-multimodal-instruct) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002Fphi-4-multimodal) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FLong-VITA.svg?style=social&label=Star) \u003Cbr> [**Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.05177) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FLong-VITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2.5-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2.5-VL Technical Report**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13923) \u003Cbr> | arXiv | 2025-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2.5-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5.svg?style=social&label=Star) \u003Cbr> [**Baichuan-Omni-1.5 Technical Report**](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5\u002Fblob\u002Fmain\u002Fbaichuan_omni_1_5.pdf) \u003Cbr> | Tech Report | 2025-01-26 | [Github](https:\u002F\u002Fgithub.com\u002Fbaichuan-inc\u002FBaichuan-Omni-1.5) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FLlamaV-o1.svg?style=social&label=Star) \u003Cbr> [**LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06186) \u003Cbr> | arXiv | 2025-01-10 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FLlamaV-o1) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.01957) \u003Cbr> | arXiv | 2025-01-03 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**QVQ: To See the World with Wisdom**](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) \u003Cbr> | Qwen | 2024-12-25 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fqwenlm.github.io\u002Fblog\u002Fqvq-72b-preview\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL2.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10302) \u003Cbr> | arXiv | 2024-12-13 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL2) | - |\n| [**Apollo: An Exploration of Video Understanding in Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10360) | arXiv | 2024-12-13 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.09596) \u003Cbr> | arXiv | 2024-12-12 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002FInternLM-XComposer-2.5-OmniLive) | Local Demo |\n| [**StreamChat: Chatting with Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.08646) | arXiv | 2024-12-11 | Coming soon | - |\n| [**CompCap: Improving Multimodal Large Language Models with Composite Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05243) | arXiv | 2024-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgls0425\u002FLinVT.svg?style=social&label=Star) \u003Cbr> [**LinVT: Empower Your Image-level Large Language Model to Understand Videos**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05185) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fgls0425\u002FLinVT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.05271) \u003Cbr> | arXiv | 2024-12-06 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**NVILA: Efficient Frontier Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.04468) \u003Cbr> | arXiv | 2024-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | [Demo](https:\u002F\u002Fvila.mit.edu) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Finst-it\u002Finst-it.svg?style=social&label=Star) \u003Cbr> [**Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.03565) \u003Cbr> | arXiv | 2024-12-04 | [Github](https:\u002F\u002Fgithub.com\u002Finst-it\u002Finst-it) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTimeMarker-LLM\u002FTimeMarker.svg?style=social&label=Star) \u003Cbr> [**TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18211) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FTimeMarker-LLM\u002FTimeMarker\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-Research\u002FChatRex.svg?style=social&label=Star) \u003Cbr> [**ChatRex: Taming Multimodal LLM for Joint Perception and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.18363) \u003Cbr> | arXiv | 2024-11-27 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-Research\u002FChatRex) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FLongVU.svg?style=social&label=Star) \u003Cbr> [**LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.17434) \u003Cbr> | arXiv | 2024-10-22 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FLongVU) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FVision-CAIR\u002FLongVU) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshikiw\u002FModality-Integration-Rate.svg?style=social&label=Star) \u003Cbr> [**Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.07167) \u003Cbr> | arXiv | 2024-10-09 | [Github](https:\u002F\u002Fgithub.com\u002Fshikiw\u002FModality-Integration-Rate) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frese1f\u002Faurora.svg?style=social&label=Star) \u003Cbr> [**AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.03051) \u003Cbr> | arXiv | 2024-10-04 | [Github](https:\u002F\u002Fgithub.com\u002Frese1f\u002Faurora) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Femova-ollm\u002FEMOVA.svg?style=social&label=Star) \u003Cbr> [**EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18042) \u003Cbr> | CVPR | 2024-09-26 | [Github](https:\u002F\u002Fgithub.com\u002Femova-ollm\u002FEMOVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FEmova-ollm\u002FEMOVA-demo) | \n| [**Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.17146) | arXiv | 2024-09-25 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fallenai\u002FMolmoE-1B-0924) | [Demo](https:\u002F\u002Fmolmo.allenai.org) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen2-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.12191) \u003Cbr> | arXiv | 2024-09-18 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen2-VL) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FIDEA-FinAI\u002FChartMoE.svg?style=social&label=Star) \u003Cbr> [**ChartMoE: Mixture of Expert Connector for Advanced Chart Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.03277) \u003Cbr> | ICLR | 2024-09-05 | [Github](https:\u002F\u002Fgithub.com\u002FIDEA-FinAI\u002FChartMoE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FLongLLaVA.svg?style=social&label=Star) \u003Cbr> [**LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.02889) \u003Cbr> | arXiv | 2024-09-04 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FLongLLaVA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FEagle.svg?style=social&label=Star) \u003Cbr> [**EAGLE: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15998) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FEagle) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FNVEagle\u002FEagle-X5-13B-Chat) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshufangxun\u002FLLaVA-MoD.svg?style=social&label=Star) \u003Cbr> [**LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.15881) \u003Cbr> | arXiv | 2024-08-28 | [Github](https:\u002F\u002Fgithub.com\u002Fshufangxun\u002FLLaVA-MoD) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models**](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2408.04840) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVITA-MLLM\u002FVITA.svg?style=social&label=Star) \u003Cbr> [**VITA: Towards Open-Source Interactive Omni Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.05211) \u003Cbr> | arXiv | 2024-08-09 | [Github](https:\u002F\u002Fgithub.com\u002FVITA-MLLM\u002FVITA) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-NeXT.svg?style=social&label=Star) \u003Cbr> [**LLaVA-OneVision: Easy Visual Task Transfer**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.03326) \u003Cbr> | arXiv | 2024-08-06 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-NeXT) | [Demo](https:\u002F\u002Fllava-onevision.lmms-lab.com) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FMiniCPM-V.svg?style=social&label=Star) \u003Cbr> [**MiniCPM-V: A GPT-4V Level MLLM on Your Phone**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.01800) \u003Cbr> | arXiv | 2024-08-03 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5) |\n| [**VILA^2: VILA Augmented VILA**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.17453) | arXiv | 2024-07-24 | - | - |\n| [**SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.15841) | arXiv | 2024-07-22 | - | - |\n| [**EVLM: An Efficient Vision-Language Model for Visual Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.14177) | arXiv | 2024-07-19 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjiyt17\u002FIDA-VLM.svg?style=social&label=Star) \u003Cbr> [**IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.07577) \u003Cbr> | arXiv | 2024-07-10 | [Github](https:\u002F\u002Fgithub.com\u002Fjiyt17\u002FIDA-VLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.03320) \u003Cbr> | arXiv | 2024-07-03 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FlxtGH\u002FOMG-Seg.svg?style=social&label=Star) \u003Cbr> [**OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19389) \u003Cbr> | arXiv | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FlxtGH\u002FOMG-Seg) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FZZZHANG-jx\u002FDocKylin.svg?style=social&label=Star) \u003Cbr> [**DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.19101) \u003Cbr> | AAAI | 2024-06-27 | [Github](https:\u002F\u002Fgithub.com\u002FZZZHANG-jx\u002FDocKylin) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcambrian-mllm\u002Fcambrian.svg?style=social&label=Star) \u003Cbr> [**Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16860) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002Fcambrian-mllm\u002Fcambrian) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FEvolvingLMMs-Lab\u002FLongVA.svg?style=social&label=Star) \u003Cbr> [**Long Context Transfer from Language to Vision**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.16852) \u003Cbr> | arXiv | 2024-06-24 | [Github](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002FLongVA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.15704) \u003Cbr> | ICML | 2024-06-22 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FTroL.svg?style=social&label=Star) \u003Cbr> [**TroL: Traversal of Layers for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.12246) \u003Cbr> | EMNLP | 2024-06-18 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FTroL) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbaaivision\u002FEVE.svg?style=social&label=Star) \u003Cbr> [**Unveiling Encoder-Free Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11832) \u003Cbr> | arXiv | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEVE) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fshowlab\u002FVideoLLM-online.svg?style=social&label=Star) \u003Cbr> [**VideoLLM-online: Online Video Large Language Model for Streaming Video**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.11816) \u003Cbr> | CVPR | 2024-06-17 | [Github](https:\u002F\u002Fgithub.com\u002Fshowlab\u002FVideoLLM-online) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwentaoyuan\u002FRoboPoint.svg?style=social&label=Star) \u003Cbr> [**RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.10721) \u003Cbr> | CoRL | 2024-06-15 | [Github](https:\u002F\u002Fgithub.com\u002Fwentaoyuan\u002FRoboPoint) | [Demo](https:\u002F\u002F007e03d34429a2517b.gradio.live\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fwlin-at\u002FCaD-VI) \u003Cbr> [**Comparison Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.09240) \u003Cbr> | arXiv | 2024-06-13 | [Github](https:\u002F\u002Fwlin-at.github.io\u002Fcad_vi) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fyfzhang114\u002FSliME.svg?style=social&label=Star) \u003Cbr> [**Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.08487) \u003Cbr> | arXiv | 2024-06-12 | [Github](https:\u002F\u002Fgithub.com\u002Fyfzhang114\u002FSliME) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDAMO-NLP-SG\u002FVideoLLaMA2.svg?style=social&label=Star) \u003Cbr> [**VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.07476) \u003Cbr> | arXiv | 2024-06-11 | [Github](https:\u002F\u002Fgithub.com\u002FDAMO-NLP-SG\u002FVideoLLaMA2) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FParrot.svg?style=social&label=Star) \u003Cbr> [**Parrot: Multilingual Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.02539) \u003Cbr> | arXiv | 2024-06-04 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FParrot) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAIDC-AI\u002FOvis.svg?style=social&label=Star) \u003Cbr> [**Ovis: Structural Embedding Alignment for Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.20797) \u003Cbr> | arXiv | 2024-05-31 | [Github](https:\u002F\u002Fgithub.com\u002FAIDC-AI\u002FOvis\u002F) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgordonhu608\u002FMQT-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Matryoshka Query Transformer for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.19315) \u003Cbr> | arXiv | 2024-05-29 | [Github](https:\u002F\u002Fgithub.com\u002Fgordonhu608\u002FMQT-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fgordonhu\u002FMQT-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Falibaba\u002Fconv-llava.svg?style=social&label=Star) \u003Cbr> [**ConvLLaVA: Hierarchical Backbones as Visual Encoder for Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15738) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002Falibaba\u002Fconv-llava) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMeteor.svg?style=social&label=Star) \u003Cbr> [**Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.15574) \u003Cbr> | arXiv | 2024-05-24 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMeteor) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBK-Lee\u002FMeteor) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYifanXu74\u002FLibra.svg?style=social&label=Star) \u003Cbr> [**Libra: Building Decoupled Vision System on Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.10140) \u003Cbr> | ICML | 2024-05-16 | [Github](https:\u002F\u002Fgithub.com\u002FYifanXu74\u002FLibra) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSHI-Labs\u002FCuMo.svg?style=social&label=Star) \u003Cbr> [**CuMo: Scaling Multimodal LLM with Co-Upcycled Mixture-of-Experts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.05949) \u003Cbr> | arXiv | 2024-05-09 | [Github](https:\u002F\u002Fgithub.com\u002FSHI-Labs\u002FCuMo) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.16821) \u003Cbr> | arXiv | 2024-04-25 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fgraphic-design-ai\u002Fgraphist.svg?style=social&label=Star) \u003Cbr> [**Graphic Design with Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.14368) \u003Cbr> | arXiv | 2024-04-22 | [Github](https:\u002F\u002Fgithub.com\u002Fgraphic-design-ai\u002Fgraphist) | - |\n| [**BRAVE: Broadening the visual encoding of vision-language models**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07204) | ECCV | 2024-04-10 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.06512.pdf) \u003Cbr> | arXiv | 2024-04-09 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FWillow123\u002FInternLM-XComposer) |\n| [**Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05719.pdf) | arXiv | 2024-04-08 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fboheumd\u002FMA-LMM.svg?style=social&label=Star) \u003Cbr> [**MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.05726.pdf) \u003Cbr> | CVPR | 2024-04-08 | [Github](https:\u002F\u002Fgithub.com\u002Fboheumd\u002FMA-LMM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FVitron.svg?style=social&label=Star) \u003Cbr> [**VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing**](https:\u002F\u002Fhaofei.vip\u002Fdownloads\u002Fpapers\u002FSkywork_Vitron_2024.pdf) \u003Cbr> | NeurIPS | 2024-04-04 | [Github](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FVitron) | Local Demo |\n| [**TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model**](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3654674) | ACM TKDD | 2024-03-28 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FLITA.svg?style=social&label=Star) \u003Cbr> [**LITA: Language Instructed Temporal-Localization Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.19046) | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FLITA) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FMiniGemini.svg?style=social&label=Star) \u003Cbr> [**Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.18814.pdf) \u003Cbr> | arXiv | 2024-03-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FMiniGemini) | [Demo](http:\u002F\u002F103.170.5.190:7860) |\n| [**MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.09611.pdf) | arXiv | 2024-03-14 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FMoAI.svg?style=social&label=Star) \u003Cbr> [**MoAI: Mixture of All Intelligence for Large Language and Vision Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.07508.pdf) \u003Cbr> | arXiv | 2024-03-12 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FMoAI) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdeepseek-ai\u002FDeepSeek-VL.svg?style=social&label=Star) \u003Cbr> [**DeepSeek-VL: Towards Real-World Vision-Language Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.05525) \u003Cbr> | arXiv | 2024-03-08 | [Github](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-VL) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fdeepseek-ai\u002FDeepSeek-VL-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.04473.pdf) \u003Cbr> | arXiv | 2024-03-07 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002Fvlrlab-monkey.xyz:7684) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002Fall-seeing.svg?style=social&label=Star) \u003Cbr> [**The All-Seeing Project V2: Towards General Relation Comprehension of the Open World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.19474.pdf) | arXiv | 2024-02-29 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002Fall-seeing) | - |\n| [**GROUNDHOG: Grounding Large Language Models to Holistic Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.16846.pdf) | CVPR | 2024-02-26 | Coming soon | Coming soon |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenMOSS\u002FAnyGPT.svg?style=social&label=Star) \u003Cbr> [**AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.12226.pdf) \u003Cbr> | arXiv | 2024-02-19 | [Github](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FAnyGPT) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FMomentor.svg?style=social&label=Star) \u003Cbr> [**Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11435.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FDCDmllm\u002FMomentor) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FFreedomIntelligence\u002FALLaVA.svg?style=social&label=Star) \u003Cbr> [**ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11684.pdf) \u003Cbr> | arXiv | 2024-02-18 | [Github](https:\u002F\u002Fgithub.com\u002FFreedomIntelligence\u002FALLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002FFreedomIntelligence\u002FALLaVA-3B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel.svg?style=social&label=Star) \u003Cbr> [**CoLLaVO: Crayon Large Language and Vision mOdel**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.11248.pdf) \u003Cbr> | arXiv | 2024-02-17 | [Github](https:\u002F\u002Fgithub.com\u002FByungKwanLee\u002FCoLLaVO-Crayon-Large-Language-and-Vision-mOdel) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTRI-ML\u002Fprismatic-vlms.svg?style=social&label=Star) \u003Cbr> [**Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.07865) \u003Cbr> | ICML | 2024-02-12 | [Github](https:\u002F\u002Fgithub.com\u002FTRI-ML\u002Fprismatic-vlms) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogCoM.svg?style=social&label=Star) \u003Cbr> [**CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.04236.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogCoM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM V2: Faster and Stronger Baseline for Vision Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.03766.pdf) \u003Cbr> | arXiv | 2024-02-06 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWEIYanbin1999\u002FGITA.svg?style=social&label=Star) \u003Cbr> [**GITA: Graph to Visual and Textual Integration for Vision-Language Graph Reasoning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.02130) \u003Cbr> | NeurIPS | 2024-02-03 | [Github](https:\u002F\u002Fgithub.com\u002FWEIYanbin1999\u002FGITA\u002F) | - |\n| [**Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.17981.pdf) | arXiv | 2024-01-31 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**LLaVA-NeXT: Improved reasoning, OCR, and world knowledge**](https:\u002F\u002Fllava-vl.github.io\u002Fblog\u002F2024-01-30-llava-next\u002F) | Blog | 2024-01-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FMoE-LLaVA.svg?style=social&label=Star) \u003Cbr> [**MoE-LLaVA: Mixture of Experts for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.15947.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FMoE-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FMoE-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.16420.pdf) \u003Cbr> | arXiv | 2024-01-29 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | [Demo](https:\u002F\u002Fopenxlab.org.cn\u002Fapps\u002Fdetail\u002FWillowBreeze\u002FInternLM-XComposer) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002F01-ai\u002FYi.svg?style=social&label=Star) \u003Cbr> [**Yi-VL**](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) \u003Cbr> | - | 2024-01-23 | [Github](https:\u002F\u002Fgithub.com\u002F01-ai\u002FYi\u002Ftree\u002Fmain\u002FVL) | Local Demo |\n| [**SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.12168.pdf) | arXiv | 2024-01-22 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FChartAst.svg?style=social&label=Star) \u003Cbr> [**ChartAssisstant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.02384) \u003Cbr> | ACL | 2024-01-04 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FChartAst) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FMobileVLM.svg?style=social&label=Star) \u003Cbr> [**MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16886.pdf) \u003Cbr> | arXiv | 2023-12-28 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FMobileVLM) | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FInternVL.svg?style=social&label=Star) \u003Cbr> [**InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14238.pdf) \u003Cbr> | CVPR | 2023-12-21 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) | [Demo](https:\u002F\u002Finternvl.opengvlab.com) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FCircleRadon\u002FOsprey.svg?style=social&label=Star) \u003Cbr> [**Osprey: Pixel Understanding with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.10032.pdf) \u003Cbr> | CVPR | 2023-12-15 | [Github](https:\u002F\u002Fgithub.com\u002FCircleRadon\u002FOsprey) | [Demo](http:\u002F\u002F111.0.123.204:8000\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogAgent: A Visual Language Model for GUI Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08914.pdf) \u003Cbr> | arXiv | 2023-12-14 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Coming soon]() |\n| [**Pixel Aligned Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.09237.pdf) | arXiv | 2023-12-14 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNVlabs\u002FVILA.svg?style=social&label=Star) \u003Cbr> [**VILA: On Pre-training for Visual Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.07533) \u003Cbr> | CVPR | 2023-12-13 | [Github](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA) | Local Demo |\n| [**See, Say, and Segment: Teaching LMMs to Overcome False Premises**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.08366.pdf) | arXiv | 2023-12-13 | [Coming soon]() | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUcas-HaoranWei\u002FVary.svg?style=social&label=Star) \u003Cbr> [**Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06109.pdf) \u003Cbr> | ECCV | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FVary) | [Demo](http:\u002F\u002Fregion-31.seetacloud.com:22701\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fkakaobrain\u002Fhoneybee.svg?style=social&label=Star) \u003Cbr> [**Honeybee: Locality-enhanced Projector for Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.06742.pdf) \u003Cbr> | CVPR | 2023-12-11 | [Github](https:\u002F\u002Fgithub.com\u002Fkakaobrain\u002Fhoneybee) | - |\n| [**Gemini: A Family of Highly Capable Multimodal Models**](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_1_report.pdf) | Google | 2023-12-06 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcsuhan\u002FOneLLM.svg?style=social&label=Star) \u003Cbr> [**OneLLM: One Framework to Align All Modalities with Language**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.03700.pdf) \u003Cbr> | arXiv | 2023-12-06 | [Github](https:\u002F\u002Fgithub.com\u002Fcsuhan\u002FOneLLM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fcsuhan\u002FOneLLM) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMeituan-AutoML\u002FLenna.svg?style=social&label=Star) \u003Cbr> [**Lenna: Language Enhanced Reasoning Detection Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02433.pdf) \u003Cbr> | arXiv | 2023-12-05 | [Github](https:\u002F\u002Fgithub.com\u002FMeituan-AutoML\u002FLenna) | - | \n| [**VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02310.pdf) | arXiv | 2023-12-04 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRenShuhuai-Andy\u002FTimeChat.svg?style=social&label=Star) \u003Cbr> [**TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.02051.pdf) \u003Cbr> | arXiv | 2023-12-04 | [Github](https:\u002F\u002Fgithub.com\u002FRenShuhuai-Andy\u002FTimeChat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmu-cai\u002Fvip-llava.svg?style=social&label=Star) \u003Cbr> [**Making Large Multimodal Models Understand Arbitrary Visual Prompts**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00784.pdf) \u003Cbr> | CVPR | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fmu-cai\u002Fvip-llava) | [Demo](https:\u002F\u002Fpages.cs.wisc.edu\u002F~mucai\u002Fvip-llava.html) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fvlm-driver\u002FDolphins.svg?style=social&label=Star) \u003Cbr> [**Dolphins: Multimodal Language Model for Driving**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.00438.pdf) \u003Cbr> | arXiv | 2023-12-01 | [Github](https:\u002F\u002Fgithub.com\u002Fvlm-driver\u002FDolphins) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpen3DA\u002FLL3DA.svg?style=social&label=Star) \u003Cbr> [**LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18651.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FOpen3DA\u002FLL3DA) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhuangb23\u002FVTimeLLM.svg?style=social&label=Star) \u003Cbr> [**VTimeLLM: Empower LLM to Grasp Video Moments**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18445.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002Fhuangb23\u002FVTimeLLM\u002F) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-DocOwl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.18248.pdf) \u003Cbr> | arXiv | 2023-11-30 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-DocOwl\u002Ftree\u002Fmain\u002FPaperOwl) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLaMA-VID.svg?style=social&label=Star) \u003Cbr> [**LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.17043.pdf) \u003Cbr> | arXiv | 2023-11-28 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLaMA-VID) | [Coming soon]() |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fdvlab-research\u002FLLMGA.svg?style=social&label=Star) \u003Cbr> [**LLMGA: Multimodal Large Language Model based Generation Assistant**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16500.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Fdvlab-research\u002FLLMGA) | [Demo](https:\u002F\u002Fbaa55ef8590b623f18.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftingxueronghua\u002FChartLlama-code.svg?style=social&label=Star) \u003Cbr> [**ChartLlama: A Multimodal LLM for Chart Understanding and Generation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.16483.pdf) \u003Cbr> | arXiv | 2023-11-27 | [Github](https:\u002F\u002Fgithub.com\u002Ftingxueronghua\u002FChartLlama-code) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**ShareGPT4V: Improving Large Multi-Modal Models with Better Captions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12793.pdf) \u003Cbr> | arXiv | 2023-11-21 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer\u002Ftree\u002Fmain\u002Fprojects\u002FShareGPT4V) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLin-Chen\u002FShareGPT4V-7B) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Frshaojimmy\u002FJiuTian.svg?style=social&label=Star) \u003Cbr> [**LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.11860.pdf) \u003Cbr> | arXiv | 2023-11-20 | [Github](https:\u002F\u002Fgithub.com\u002Frshaojimmy\u002FJiuTian) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fembodied-generalist\u002Fembodied-generalist.svg?style=social&label=Star) \u003Cbr> [**An Embodied Generalist Agent in 3D World**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.12871.pdf) \u003Cbr> | arXiv | 2023-11-18 | [Github](https:\u002F\u002Fgithub.com\u002Fembodied-generalist\u002Fembodied-generalist) | [Demo](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=mlnjz4eSjB4) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FVideo-LLaVA.svg?style=social&label=Star) \u003Cbr> [**Video-LLaVA: Learning United Visual Representation by Alignment Before Projection**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.10122.pdf) \u003Cbr> | arXiv | 2023-11-16 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FVideo-LLaVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FVideo-LLaVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FChat-UniVi.svg?style=social&label=Star) \u003Cbr> [**Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.08046) \u003Cbr> | CVPR | 2023-11-14 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FChat-UniVi) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX2FD\u002FLVIS-INSTRUCT4V.svg?style=social&label=Star) \u003Cbr> [**To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07574.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FX2FD\u002FLVIS-INSTRUCT4V) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlpha-VLLM\u002FLLaMA2-Accessory.svg?style=social&label=Star) \u003Cbr> [**SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.07575.pdf) \u003Cbr> | arXiv | 2023-11-13 | [Github](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLLaMA2-Accessory) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FYuliang-Liu\u002FMonkey.svg?style=social&label=Star) \u003Cbr> [**Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.06607.pdf) \u003Cbr> | CVPR | 2023-11-11 | [Github](https:\u002F\u002Fgithub.com\u002FYuliang-Liu\u002FMonkey) | [Demo](http:\u002F\u002F27.17.184.224:7681\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase.svg?style=social&label=Star) \u003Cbr> [**LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.05437.pdf) \u003Cbr> | arXiv | 2023-11-09 | [Github](https:\u002F\u002Fgithub.com\u002FLLaVA-VL\u002FLLaVA-Plus-Codebase) | [Demo](https:\u002F\u002Fllavaplus.ngrok.io\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-ChatV\u002FNExT-Chat.svg?style=social&label=Star) \u003Cbr> [**NExT-Chat: An LMM for Chat, Detection and Segmentation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04498.pdf) \u003Cbr> | arXiv | 2023-11-08 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-ChatV\u002FNExT-Chat) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FX-PLUG\u002FmPLUG-Owl.svg?style=social&label=Star) \u003Cbr> [**mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04257.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FmPLUG-Owl\u002Ftree\u002Fmain\u002FmPLUG-Owl2) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fdamo\u002FmPLUG-Owl2\u002Fsummary) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FLuodian\u002FOtter.svg?style=social&label=Star) \u003Cbr> [**OtterHD: A High-Resolution Multi-modality Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.04219.pdf) \u003Cbr> | arXiv | 2023-11-07 | [Github](https:\u002F\u002Fgithub.com\u002FLuodian\u002FOtter) | - |\n| [**CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03354.pdf) | arXiv | 2023-11-06 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmbzuai-oryx\u002FgroundingLMM.svg?style=social&label=Star) \u003Cbr> [**GLaMM: Pixel Grounding Large Multimodal Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03356.pdf) \u003Cbr> | CVPR | 2023-11-06 | [Github](https:\u002F\u002Fgithub.com\u002Fmbzuai-oryx\u002FgroundingLMM) | [Demo](https:\u002F\u002Fglamm.mbzuai-oryx.ngrok.app\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRUCAIBox\u002FComVint.svg?style=social&label=Star) \u003Cbr> [**What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01487.pdf) \u003Cbr> | arXiv | 2023-11-02| [Github](https:\u002F\u002Fgithub.com\u002FRUCAIBox\u002FComVint) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FVision-CAIR\u002FMiniGPT-4.svg?style=social&label=Star) \u003Cbr> [**MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.09478.pdf) \u003Cbr> | arXiv | 2023-10-14 | [Github](https:\u002F\u002Fgithub.com\u002FVision-CAIR\u002FMiniGPT-4) | Local Demo | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fbytedance\u002FSALMONN.svg?style=social&label=Star) \u003Cbr> [**SALMONN: Towards Generic Hearing Abilities for Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.13289) \u003Cbr> | ICLR | 2023-10-20 | [Github](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FSALMONN) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fapple\u002Fml-ferret.svg?style=social&label=Star) \u003Cbr> [**Ferret: Refer and Ground Anything Anywhere at Any Granularity**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.07704.pdf) \u003Cbr> | arXiv | 2023-10-11 | [Github](https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-ferret) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTHUDM\u002FCogVLM.svg?style=social&label=Star) \u003Cbr> [**CogVLM: Visual Expert For Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.03079.pdf) \u003Cbr> | arXiv | 2023-10-09 | [Github](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) | [Demo](http:\u002F\u002F36.103.203.44:7861\u002F) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fhaotian-liu\u002FLLaVA.svg?style=social&label=Star) \u003Cbr> [**Improved Baselines with Visual Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.03744.pdf) \u003Cbr> | arXiv | 2023-10-05 | [Github](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) | [Demo](https:\u002F\u002Fllava.hliu.cc\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPKU-YuanGroup\u002FLanguageBind.svg?style=social&label=Star) \u003Cbr> [**LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01852.pdf) \u003Cbr> | ICLR | 2023-10-03 | [Github](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FLanguageBind) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FLanguageBind\u002FLanguageBind) | \n![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSY-Xuan\u002FPink.svg?style=social&label=Star) \u003Cbr> [**Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00582.pdf) | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002FSY-Xuan\u002FPink) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fthunlp\u002FMuffin.svg?style=social&label=Star) \u003Cbr> [**Reformulating Vision-Language Foundation Models and Datasets Towards Universal Multimodal Assistants**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.00653.pdf) \u003Cbr> | arXiv | 2023-10-01 | [Github](https:\u002F\u002Fgithub.com\u002Fthunlp\u002FMuffin) | Local Demo | \n| [**AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.16058.pdf) | arXiv | 2023-09-27 | - | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FInternLM\u002FInternLM-XComposer.svg?style=social&label=Star) \u003Cbr> [**InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.15112.pdf) \u003Cbr> | arXiv | 2023-09-26 | [Github](https:\u002F\u002Fgithub.com\u002FInternLM\u002FInternLM-XComposer) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FRunpeiDong\u002FDreamLLM.svg?style=social&label=Star) \u003Cbr> [**DreamLLM: Synergistic Multimodal Comprehension and Creation**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.11499.pdf) \u003Cbr> | ICLR | 2023-09-20 | [Github](https:\u002F\u002Fgithub.com\u002FRunpeiDong\u002FDreamLLM) | [Coming soon]() |\n| [**An Empirical Study of Scaling Instruction-Tuned Large Multimodal Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.09958.pdf) | arXiv | 2023-09-18 | [Coming soon]() | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSihengLi99\u002FTextBind.svg?style=social&label=Star) \u003Cbr> [**TextBind: Multi-turn Interleaved Multimodal Instruction-following**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.08637.pdf) \u003Cbr> | arXiv | 2023-09-14 | [Github](https:\u002F\u002Fgithub.com\u002FSihengLi99\u002FTextBind) | [Demo](https:\u002F\u002Failabnlp.tencent.com\u002Fresearch_demos\u002Ftextbind\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FNExT-GPT\u002FNExT-GPT.svg?style=social&label=Star) \u003Cbr> [**NExT-GPT: Any-to-Any Multimodal LLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.05519.pdf) \u003Cbr> | arXiv | 2023-09-11 | [Github](https:\u002F\u002Fgithub.com\u002FNExT-GPT\u002FNExT-GPT) | [Demo](https:\u002F\u002Ffc7a82a1c76b336b6f.gradio.live\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FUCSC-VLAA\u002FSight-Beyond-Text.svg?style=social&label=Star) \u003Cbr> [**Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.07120.pdf) \u003Cbr> | arXiv | 2023-09-13 | [Github](https:\u002F\u002Fgithub.com\u002FUCSC-VLAA\u002FSight-Beyond-Text) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenGVLab\u002FLLaMA-Adapter.svg?style=social&label=Star) \u003Cbr> [**ImageBind-LLM: Multi-modality Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.03905.pdf) \u003Cbr> | arXiv | 2023-09-07 | [Github](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FLLaMA-Adapter) | [Demo](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002F) |\n| [**Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.02591.pdf) | arXiv | 2023-09-05 | - | - | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenRobotLab\u002FPointLLM.svg?style=social&label=Star) \u003Cbr> [**PointLLM: Empowering Large Language Models to Understand Point Clouds**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16911.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FOpenRobotLab\u002FPointLLM) | [Demo](http:\u002F\u002F101.230.144.196\u002F) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FHYPJUDY\u002FSparkles.svg?style=social&label=Star) \u003Cbr> [**✨Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.16463.pdf) \u003Cbr> | arXiv | 2023-08-31 | [Github](https:\u002F\u002Fgithub.com\u002FHYPJUDY\u002FSparkles) | Local Demo |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fopendatalab\u002FMLLM-DataEngine.svg?style=social&label=Star) \u003Cbr> [**MLLM-DataEngine: An Iterative Refinement Approach for MLLM**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13566.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMLLM-DataEngine) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPVIT-official\u002FPVIT.svg?style=social&label=Star) \u003Cbr> [**Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.13437.pdf) \u003Cbr> | arXiv | 2023-08-25 | [Github](https:\u002F\u002Fgithub.com\u002FPVIT-official\u002FPVIT) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPVIT\u002Fpvit) |  \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FQwenLM\u002FQwen-VL.svg?style=social&label=Star) \u003Cbr> [**Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12966.pdf) \u003Cbr> | arXiv | 2023-08-24 | [Github](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen-VL) | [Demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fqwen\u002FQwen-VL-Chat-Demo\u002Fsummary) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FOpenBMB\u002FVisCPM.svg?style=social&label=Star) \u003Cbr> [**Large Multilingual Models Pivot Zero-Shot Multimodal Learning across Languages**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.12038.pdf) \u003Cbr> | ICLR | 2023-08-23 | [Github](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVisCPM) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002Fviscpm-chat) | \n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ficoz69\u002FStableLLAVA.svg?style=social&label=Star) \u003Cbr> [**StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.10253.pdf) \u003Cbr> | arXiv | 2023-08-20 | [Github](https:\u002F\u002Fgithub.com\u002Ficoz69\u002FStableLLAVA) | - |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmlpc-ucsd\u002FBLIVA.svg?style=social&label=Star) \u003Cbr> [**BLIVA: A Simple Multimodal LLM for Better Handling of Text-rich Visual Questions**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.09936.pdf) \u003Cbr> | arXiv | 2023-08-19 | [Github](https:\u002F\u002Fgithub.com\u002Fmlpc-ucsd\u002FBLIVA) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmlpc-lab\u002FBLIVA) |\n| ![Star](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDCDmllm\u002FCheetah.svg?style=social&label=Star) \u003Cbr> [**Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions**","该项目专注于多模态大语言模型的最新进展，包括相关研究、基准测试和实际应用。核心功能涵盖了多模态理解与生成、上下文学习、指令调优等，并提供了详尽的评估基准如MME和Video-MME系列。技术特点上，项目强调了对视觉和语言数据的联合处理能力，以及在实时交互场景中的表现。适合于需要跨模态信息处理的研究者、开发者及机构使用，尤其是在构建或优化能够处理图像、文本乃至语音等多类型数据的AI系统时。",2,"2026-06-11 03:43:17","high_star"]