[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2276":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":16,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":10,"discoverSource":29},2276,"MiniCPM-o","OpenBMB\u002FMiniCPM-o","OpenBMB","A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Multimodal Live Streaming on Your Phone","",null,"Python",24527,1908,24501,45,0,44.84,"Apache License 2.0",false,"main",[22,23,24],"minicpm","minicpm-v","multi-modal","2026-06-12 02:00:39","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\".\u002Fassets\u002Fminicpm_v_and_minicpm_o_title.png\" width=\"500em\" >\u003C\u002Fimg> \n\n**A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Multimodal Live Streaming on Your Phone**\n\n  \u003Cstrong>[中文](.\u002FREADME_zh.md) |\n  English\u003C\u002Fstrong>\n\n\n\n\u003Cspan style=\"display: inline-flex; align-items: center; margin-right: 2px;\">\n  \u003Cimg src=\".\u002Fassets\u002Fwechat.png\" alt=\"WeChat\" style=\"margin-right: 4px;\">\n  \u003Ca href=\"docs\u002Fwechat.md\" target=\"_blank\"> WeChat\u003C\u002Fa> &nbsp;|\n\u003C\u002Fspan>\n&nbsp;\n\u003Cspan style=\"display: inline-flex; align-items: center; margin-left: -8px;\">\n\u003Cimg src=\".\u002Fassets\u002Fdiscord.png\" alt=\"Discord\" style=\"margin-right: 4px;\">\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FN2RnxGdJ\" target=\"_blank\"> Discord\u003C\u002Fa> &nbsp;\n\u003C\u002Fspan>\n\n\n\n\u003Cp align=\"center\">\n   MiniCPM-o 4.5 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-4_5\">🤗\u003C\u002Fa> \u003Ca href=\"https:\u002F\u002Fopenbmb.github.io\u002FMiniCPM-o-Demo\u002F\">📞\u003C\u002Fa> \u003Ca href=\"http:\u002F\u002F211.93.21.133:18121\u002F\">🤖\u003C\u002Fa> | MiniCPM-V 4.0 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-V-4\">🤗\u003C\u002Fa>  | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenSQZ\u002FMiniCPM-V-Cookbook\">🍳 Cookbook\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fopenbmb.github.io\u002Fminicpm-o-4_5\u002F\">Audio Casebook\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fopenbmb.github.io\u002Fminicpm-o-4_5-omni\u002F\">Omni Full-Duplex Casebook\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n**MiniCPM-o** is the latest series of on-device multimodal LLMs (MLLMs) ungraded from MiniCPM-V. The models can now take image, video, text, and audio as inputs and provide high-quality text and speech outputs in an end-to-end fashion. The model series is designed for **strong performance and efficient deployment**. The most notable models in the series currently include:\n\n\n- **MiniCPM-o 4.5**: 🔥🔥🔥 The latest and most capable model in the series. With a total of 9B parameters, this end-to-end model **approaches Gemini 2.5 Flash in vision, speech, and full-duplex multimodal live streaming**, making it one of the most versatile and performant models in the open-source community. The new full-duplex multimodal live streaming capability means that the output streams (speech and text), and the real-time input streams (video and audio) do not block each other. This **enables MiniCPM-o 4.5 to see, listen, and speak simultaneously** in a real-time omnimodal conversation, and perform **proactive interactions** such as proactive reminding. The improved voice mode supports bilingual real-time speech conversation in a more natural, expressive, and stable way, and also allows for voice cloning. It also advances MiniCPM-V's visual capabilities such as strong OCR capability, trustworthy behavior and multilingual support, etc. We also rollout a **high-performing llama.cpp-omni inference framework together with a WebRTC Demo**, to bring this full-duplex multimodal live streaming experience available on local devices such as Macs.\n\n- **MiniCPM-V 4.0**: ⭐️⭐️⭐️ An efficient model in the MiniCPM-V series. With a total of 4B parameters, the model surpasses GPT-4.1-mini-20250414 in image understanding on the OpenCompass evaluation. With its small parameter size and efficient architecture, MiniCPM-V 4.0 is an ideal choice for on-device deployment on the phone.\n\n\n\n\n## News \u003C!-- omit in toc -->\n\n#### 📌 Pinned\n\n> [!NOTE]\n> [2026.02.06] 🥳 🥳 🥳 We open-sourced a realtime web demo deployable on your own devices like Mac or GPU. [Try it now](#deploy-a-realtime-web-demo-on-your-own-device)!\n\n\n* [2026.02.05] 📢📢📢 We note the web demo may experience latency issues due to network conditions. We are working actively to provide a Docker image for local deployment of the real-time interactive Demo as soon as possible. Please stay tuned!\n\n* [2026.02.03] 🔥🔥🔥 We open-source MiniCPM-o 4.5, which matches Gemini 2.5 Flash on vision and speech, and supports full-duplex multimodal live streaming. Try it now!\n\n\n* [2025.09.18] 📢📢📢 MiniCPM-V 4.5 technical report is now released! See [here](.\u002Fdocs\u002FMiniCPM_V_4_5_Technical_Report.pdf).\n\n* [2025.08.26] 🔥🔥🔥 We open-source MiniCPM-V 4.5, which outperforms GPT-4o-latest, Gemini-2.0 Pro, and Qwen2.5-VL 72B. It advances popular capabilities of MiniCPM-V, and brings useful new features. Try it now!\n\n* [2025.08.01] ⭐️⭐️⭐️ We open-sourced the [MiniCPM-V & o Cookbook](https:\u002F\u002Fgithub.com\u002FOpenSQZ\u002FMiniCPM-V-CookBook)! It provides comprehensive guides for diverse user scenarios, paired with our new [Docs Site](https:\u002F\u002Fminicpm-o.readthedocs.io\u002Fen\u002Flatest\u002Findex.html) for smoother onboarding.\n\n* [2025.03.01] 🚀🚀🚀 RLAIF-V, the alignment technique of MiniCPM-o, is accepted by CVPR 2025 Highlights！The [code](https:\u002F\u002Fgithub.com\u002FRLHF-V\u002FRLAIF-V), [dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenbmb\u002FRLAIF-V-Dataset), [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.17220) are open-sourced!\n\n* [2025.01.24] 📢📢📢 MiniCPM-o 2.6 technical report is released! See [here](https:\u002F\u002Fopenbmb.notion.site\u002FMiniCPM-o-2-6-A-GPT-4o-Level-MLLM-for-Vision-Speech-and-Multimodal-Live-Streaming-on-Your-Phone-185ede1b7a558042b5d5e45e6b237da9).\n\n* [2025.01.19] ⭐️⭐️⭐️ MiniCPM-o tops GitHub Trending and reaches top-2 on Hugging Face Trending!\n\n\n* [2024.05.23] 🔥🔥🔥 MiniCPM-V tops GitHub Trending and Hugging Face Trending! Our demo, recommended by Hugging Face Gradio’s official account, is available [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5). Come and try it out!\n\n\u003Cbr>\n\n\u003Cdetails> \n\u003Csummary>Click to view more news.\u003C\u002Fsummary>\n\n* [2025.09.01] ⭐️⭐️⭐️ MiniCPM-V 4.5 has been officially supported by [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Fpull\u002F15575), [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fpull\u002F23586), and [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory\u002Fpull\u002F9022). You are welcome to use it directly through these official channels! Support for additional frameworks such as [Ollama](https:\u002F\u002Fgithub.com\u002Follama\u002Follama\u002Fpull\u002F12078) and [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang\u002Fpull\u002F9610) is actively in progress.\n* [2025.08.02] 🚀🚀🚀 We open-source MiniCPM-V 4.0, which outperforms GPT-4.1-mini-20250414 in image understanding. It advances popular features of MiniCPM-V 2.6, and largely improves the efficiency. We also open-source the iOS App on iPhone and iPad. Try it now!\n* [2025.06.20] ⭐️⭐️⭐️ Our official [Ollama repository](https:\u002F\u002Follama.com\u002Fopenbmb) is released. Try our latest models with [one click](https:\u002F\u002Follama.com\u002Fopenbmb\u002Fminicpm-o2.6)！\n* [2025.01.23] 💡💡💡 MiniCPM-o 2.6 is now supported by [Align-Anything](https:\u002F\u002Fgithub.com\u002FPKU-Alignment\u002Falign-anything), a framework by PKU-Alignment Team for aligning any-to-any modality large models with human intentions. It supports DPO and SFT fine-tuning on both vision and audio. Try it now!\n* [2025.01.19] 📢 **ATTENTION!** We are currently working on merging MiniCPM-o 2.6 into the official repositories of llama.cpp, Ollama, and vllm. Until the merge is complete, please USE OUR LOCAL FORKS of [llama.cpp](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fllama.cpp\u002Fblob\u002Fminicpm-omni\u002Fexamples\u002Fllava\u002FREADME-minicpmo2.6.md), [Ollama](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Follama\u002Fblob\u002Fminicpm-v2.6\u002Fexamples\u002Fminicpm-v2.6\u002FREADME.md), and [vllm](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-o?tab=readme-ov-file#efficient-inference-with-llamacpp-ollama-vllm). **Using the official repositories before the merge may lead to unexpected issues**.\n* [2025.01.17] We have updated the usage of MiniCPM-o 2.6 int4 quantization version and resolved the model initialization error. Click [here](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-2_6-int4) and try it now!\n* [2025.01.13] 🔥🔥🔥 We open-source MiniCPM-o 2.6, which matches GPT-4o-202405 on vision, speech and multimodal live streaming. It advances popular capabilities of MiniCPM-V 2.6, and supports various new fun features. Try it now!\n* [2024.08.15] We now also support multi-image SFT. For more details, please refer to the [document](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V\u002Ftree\u002Fmain\u002Ffinetune).\n* [2024.08.14] MiniCPM-V 2.6 now also supports [fine-tuning](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift\u002Fissues\u002F1613) with the SWIFT framework!\n* [2024.08.17] 🚀🚀🚀 MiniCPM-V 2.6 is now fully supported by [official](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) llama.cpp! GGUF models of various sizes are available [here](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-V-2_6-gguf).\n* [2024.08.10] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 is now fully supported by [official](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) llama.cpp! GGUF models of various sizes are available [here](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5-gguf).\n* [2024.08.06] 🔥🔥🔥 We open-source MiniCPM-V 2.6, which outperforms GPT-4V on single image, multi-image and video understanding. It advances popular features of MiniCPM-Llama3-V 2.5, and can support real-time video understanding on iPad. Try it now!\n* [2024.08.03] MiniCPM-Llama3-V 2.5 technical report is released! See [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01800).\n* [2024.07.19] MiniCPM-Llama3-V 2.5 supports vLLM now! See [here](#inference-with-vllm).\n\n* [2024.06.03] Now, you can run MiniCPM-Llama3-V 2.5 on multiple low VRAM GPUs(12 GB or 16 GB) by distributing the model's layers across multiple GPUs. For more details, check this [link](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V\u002Fblob\u002Fmain\u002Fdocs\u002Finference_on_multiple_gpus.md).\n* [2024.05.28] 🚀🚀🚀 MiniCPM-Llama3-V 2.5 now fully supports its feature in llama.cpp and Ollama! Please pull the latest code **of our provided forks** ([llama.cpp](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fllama.cpp\u002Fblob\u002Fminicpm-v2.5\u002Fexamples\u002Fminicpmv\u002FREADME.md), [Ollama](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Follama\u002Ftree\u002Fminicpm-v2.5\u002Fexamples\u002Fminicpm-v2.5)). GGUF models in various sizes are available [here](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5-gguf\u002Ftree\u002Fmain). MiniCPM-Llama3-V 2.5 series is **not supported by the official repositories yet**, and we are working hard to merge PRs. Please stay tuned!\n\n* [2024.05.28] 💫 We now support LoRA fine-tuning for MiniCPM-Llama3-V 2.5, using only 2 V100 GPUs! See more statistics [here](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V\u002Ftree\u002Fmain\u002Ffinetune#model-fine-tuning-memory-usage-statistics).\n\n* [2024.05.25] MiniCPM-Llama3-V 2.5 now supports streaming outputs and customized system prompts. Try it [here](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5#usage)!\n* [2024.05.24] We release the MiniCPM-Llama3-V 2.5 [gguf](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-Llama3-V-2_5-gguf), which supports [llama.cpp](#inference-with-llamacpp) inference and provides a 6~8 token\u002Fs smooth decoding on mobile phones. Try it now!\n\n* [2024.05.23] 🔍 We've released a comprehensive comparison between Phi-3-vision-128k-instruct and MiniCPM-Llama3-V 2.5, including benchmark evaluations, multilingual capabilities, and inference efficiency 🌟📊🌍🚀. Click [here](.\u002Fdocs\u002Fcompare_with_phi-3_vision.md) to view more details.\n\n* [2024.05.20] We open-soure MiniCPM-Llama3-V 2.5, it has improved OCR capability and supports 30+ languages, representing the first end-side MLLM achieving GPT-4V level performance! We provide [efficient inference](#deployment-on-mobile-phone) and [simple fine-tuning](.\u002Ffinetune\u002Freadme.md). Try it now!\n* [2024.04.23] MiniCPM-V-2.0 supports vLLM now! Click [here](#inference-with-vllm) to view more details.\n* [2024.04.18] We create a HuggingFace Space to host the demo of MiniCPM-V 2.0 at [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopenbmb\u002FMiniCPM-V-2)!\n* [2024.04.17] MiniCPM-V-2.0 supports deploying [WebUI Demo](#webui-demo) now!\n* [2024.04.15] MiniCPM-V-2.0 now also supports [fine-tuning](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fswift\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002FMulti-Modal\u002Fminicpm-v-2最佳实践.md) with the SWIFT framework!\n* [2024.04.12] We open-source MiniCPM-V 2.0, which achieves comparable performance with Gemini Pro in understanding scene text and outperforms strong Qwen-VL-Chat 9.6B and Yi-VL 34B on \u003Ca href=\"https:\u002F\u002Frank.opencompass.org.cn\u002Fleaderboard-multimodal\">OpenCompass\u003C\u002Fa>, a comprehensive evaluation over 11 popular benchmarks. Click \u003Ca href=\"https:\u002F\u002Fopenbmb.vercel.app\u002Fminicpm-v-2\">here\u003C\u002Fa> to view the MiniCPM-V 2.0 technical blog.\n* [2024.03.14] MiniCPM-V now supports [fine-tuning](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fswift\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002FMulti-Modal\u002Fminicpm-v最佳实践.md) with the SWIFT framework. Thanks to [Jintao](https:\u002F\u002Fgithub.com\u002FJintao-Huang) for the contribution！\n* [2024.03.01] MiniCPM-V can now be deployed on Mac!\n* [2024.02.01] We open-source MiniCPM-V and OmniLMM-12B, which support efficient end-side deployment and powerful multimodal capabilities correspondingly.\n\u003C\u002Fdetails> \n\n\n## Contents \u003C!-- omit in toc -->\n\n\n- [MiniCPM-o 4.5](#minicpm-o-45)\n- [Offline Inference Examples with Transformers](#offline-inference-examples-with-transformers)\n  - [Model Initialization](#model-initialization)\n  - [Duplex Omni Mode](#duplex-omni-mode)\n  - [Half-Duplex Omni Mode](#half-duplex-omni-mode)\n  - [Half-Duplex Realtime Speech Conversation Mode](#half-duplex-realtime-speech-conversation-mode)\n  - [Visual Understanding](#visual-understanding)\n  - [Structured Content Input](#structured-content-input)\n- [Deploy a Realtime Web Demo on Your Own Device](#deploy-a-realtime-web-demo-on-your-own-device)\n- [Supported Frameworks](#supported-frameworks)\n  - [FlagOS](#flagos)\n  - [vLLM, SGLang, llama.cpp, Ollama](#vllm-sglang-llamacpp-ollama)\n  - [LLaMA-Factory, SWIFT](#llama-factory-swift)\n- [MiniCPM-V \\& o Cookbook](#minicpm-v--o-cookbook)\n- [MiniCPM-V 4.0](#minicpm-v-40)\n- [Model Zoo](#model-zoo)\n- [Awesome work using MiniCPM-V \\& MiniCPM-o](#awesome-work-using-minicpm-v--minicpm-o)\n- [Limitations](#limitations)\n- [Acknowledgements](#acknowledgements)\n\n\n## MiniCPM-o 4.5\n\n**MiniCPM-o 4.5** is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip2, Whisper-medium, CosyVoice2, and Qwen3-8B with a total of 9B parameters. It exhibits a significant performance improvement, and introduces new features for full-duplex multimodal live streaming. Notable features of MiniCPM-o 4.5 include:\n\n- 🔥 **Leading Visual Capability.**\n  MiniCPM-o 4.5 achieves an average score of 77.6 on OpenCompass, a comprehensive evaluation of 8 popular benchmarks. **With only 9B parameters, it surpasses widely used proprietary models like GPT-4o, Gemini 2.0 Pro, and approaches Gemini 2.5 Flash** for vision-language capabilities. It supports instruct and thinking modes in a single model, better covering efficiency and performance trade-offs in different user scenarios.\n\n- 🎙 **Strong Speech Capability.** \n  MiniCPM-o 4.5 supports **bilingual real-time speech conversation with configurable voices** in English and Chinese. It features **more natural, expressive and stable speech conversation**. The model also allows for fun features such as **voice cloning and role play via a simple reference audio clip**, where the cloning performance surpasses strong TTS tools such as CosyVoice2.\n\n- 🎬 **New Full-Duplex and Proactive Multimodal Live Streaming Capability.** \n  As a new feature, MiniCPM-o 4.5 can process real-time, continuous video and audio input streams simultaneously while generating concurrent text and speech output streams in an end-to-end fashion, without mutual blocking. This **allows MiniCPM-o 4.5 to see, listen, and speak simultaneously**, creating a fluid, real-time omnimodal conversation experience. Beyond reactive responses, the model can also perform **proactive interaction**, such as initiating reminders or comments based on its continuous understanding of the live scene. \n\n- 💪 **Strong OCR Capability, Efficiency and Others.**\nAdvancing popular visual capabilities from MiniCPM-V series, MiniCPM-o 4.5 can process **high-resolution images** (up to 1.8 million pixels) and **high-FPS videos** (up to 10fps) in any aspect ratio efficiently. It achieves **state-of-the-art performance for end-to-end English document parsing** on OmniDocBench, outperforming proprietary models such as Gemini-3 Flash and GPT-5, and specialized tools such as DeepSeek-OCR 2. It also features **trustworthy behaviors**, matching Gemini 2.5 Flash on MMHal-Bench, and supports **multilingual capabilities** on more than 30 languages.\n\n-  💫  **Easy Usage.**\n  MiniCPM-o 4.5 can be easily used in various ways:  **Basic usage, recommended for 100% precision:** PyTorch inference with Nvidia GPU. **Other end-side adaptation** includes (1) llama.cpp and Ollama support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM and SGLang support for high-throughput and memory-efficient inference, (4) FlagOS support for the unified multi-chip backend plugin. **We also open-sourced web demos** on which **enables the full-duplex multimodal live streaming experience on local devices** such as GPUs, PCs (e.g., on a MacBook).\n\n**Model Architecture.**\n- **End-to-end Omni-modal Architecture.** The modality encoders\u002Fdecoders and LLM are densely connected via hidden states in an end-to-end fashion. This enables better information flow and control, and also facilitates full exploitation of rich multimodal knowledge during training.\n- **Full-Duplex Omni-modal Live Streaming Mechanism.** (1) We turn the offline modality encoder\u002Fdecoders into online and full-duplex ones for streaming inputs\u002Foutputs. The speech token decoder models text and speech tokens in an interleaved fashion to support full-duplex speech generation (i.e., sync timely with new input). This also facilitates more stable long speech generation (e.g., > 1min).\n(2) **We sync all the input and output streams on timeline in milliseconds**, which are jointly modeled by a time-division multiplexing (TDM) mechanism for omni-modality streaming processing in the LLM backbone. It divides parallel omni-modality streams into sequential info groups within small periodic time slices.\n- **Proactive Interaction Mechanism.** The LLM continuously monitors the input video and audio streams, and decides at a frequency of 1Hz to speak or not. This high decision-making frequency together with full-duplex nature are curcial to enable the proactive interaction capability.\n- **Configurable Speech Modeling Design.** We inherent the multimodal system prompt design of MiniCPM-o 2.6, which includes a traditional text system prompt, and a new audio system prompt to determine the assistant voice. This enables cloning new voices and role play in inference time for speech conversation.\n\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fminicpm-o-45-framework.png\", width=100%>\n\u003C\u002Fdiv>\n\n\n### Evaluation  \u003C!-- omit in toc -->\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fradar_minicpmo4.5.png\", width=80%>\n\u003C\u002Fdiv>\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fminicpm_o_45_main_exp_table.png\", width=90%>\n\u003C\u002Fdiv>\n\u003Cstrong>Note\u003C\u002Fstrong>: Scores marked with ∗ are from our evaluation; others are cited from referenced reports. n\u002Fa indicates that the model does not support the corresponding modality. All results are reported in instruct mode\u002Fvariant.\n\n&emsp;\n\u003Cbr>\n\n\u003Cdetails>\n\u003Csummary>Click to view visual understanding results.\u003C\u002Fsummary>\n\n**Image Understanding (Instruct)**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>OpenCompass\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMBench EN v1.1\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMBench CN v1.1\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MathVista\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMVet\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMMU\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMStar\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>HallusionBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>AI2D\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>OCRBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>TextVQA_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>DocVQA_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMT-Bench_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MM-IFEval\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Mantis-Eval\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MuirBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMSI-Bench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMHal-Score\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMHal-Hallrate↓\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Gemini2.5-Flash-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>78.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>86.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>86.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>81.4\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>76.3\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>75.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">59.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>87.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">864\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">93.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>70.0\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>75.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">72.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>74.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">12.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>4.6\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>23.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Gemini2.0-Pro\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">83.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">83.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">71.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">72.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">68.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">49.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">863\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">GPT-4o\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>86.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">71.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">76.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">72.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">57.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">822\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">93.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">66.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">64.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">8.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">4.2\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">25.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">InternVL-3.5-8B\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">79.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">80.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>78.4\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>83.1\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>73.4\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">54.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">840\u003C\u002Ftd>\n  \u003Ctd align=\"center\">78.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">92.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">66.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">56.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">55.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">3.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">34.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-VL-8B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">76.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>61.1\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>896\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">82.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>96.1\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">60.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">59.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.2\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">64.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">11.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>4.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">29.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">68.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">59.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>880\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>84.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>95.4\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>70.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">65.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>78.3\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">61.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>14.2\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>4.6\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">31.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>77.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>87.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>87.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>80.1\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">67.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>73.1\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>63.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>87.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">876\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>83.8\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">94.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>66.3\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>79.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>72.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>16.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>4.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>24.3\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\n**Image Understanding (Thinking)**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>OpenCompass\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMBench EN v1.1\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMBench CN v1.1\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MathVista\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMVet\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMMU\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMStar\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>HallusionBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>AI2D\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>OCRBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>TextVQA_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>DocVQA_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMT-Bench_VAL\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MM-IFEval\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Gemini2.5-Flash-Thinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>79.9\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">87.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">87.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">79.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>81.2\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>77.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>76.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">63.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>88.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">853\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">92.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>75.7\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">GPT-5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>79.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>81.9\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>77.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>81.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>75.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>65.2\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>89.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">807\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">91.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>72.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>83.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-VL-8B-Thinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>81.4\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>65.4\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">84.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">819\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>95.3\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">68.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Thinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">78.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>88.2\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>87.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">80.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">62.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>859\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>80.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>94.2\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>70.9\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Thinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">78.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>89.0\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>87.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">81.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">62.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">88.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>879\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>79.8\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">92.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.7\u003C\u002Ftd>\n  \u003Ctd align=\"center\">68.2\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\n**Video Understanding**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Video-MME\u003Cbr>(w\u002Fo subs)\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LVBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MLVU\u003Cbr>(M-Avg)\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LongVideoBench\u003Cbr>(val)\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MotionBench\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Gemini2.5-Flash-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>75.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>62.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>77.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">InternVL-3.5-8B\u003C\u002Ftd>\n  \u003Ctd align=\"center\">66.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">62.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>62.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>70.5\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">50.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>66.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>61.7\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>50.9\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>76.5\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>66.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">61.4\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view document parsing results.\u003C\u002Fsummary>\n\n**OmniDocBench**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\" rowspan=\"2\">\u003Cb>Method Type\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" rowspan=\"2\">\u003Cb>Methods\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>OverallEdit↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>TextEdit↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>FormulaEdit↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>TableTEDS↑\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>TableEdit↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>Read OrderEdit↓\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>EN\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ZH\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\" rowspan=\"2\">Pipeline\u003C\u002Ftd>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">MinerU 2.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.117\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.172\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.051\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.08\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.256\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.455\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">89.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.115\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.081\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.047\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.072\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">PaddleOCR-VL\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.105\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.126\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.041\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.062\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.241\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.316\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">88\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>92.1\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.093\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.062\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.045\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.063\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\" rowspan=\"11\">End-to-end Model\u003C\u002Ftd>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">Qwen2.5-VL-72B\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.214\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.261\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.092\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.18\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.315\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.434\u003C\u002Ftd>\n  \u003Ctd align=\"center\">82.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">83.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.341\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.262\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.106\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.168\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">GPT 5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.218\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.33\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.139\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.344\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.396\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.555\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.55\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">73.09\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.188\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.196\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.151\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.227\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">Gemini2.5-Flash-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.214\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.29\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.159\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.273\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.368\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.524\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">80.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.197\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.167\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.132\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.195\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">Gemini-2.5-Pro-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.148\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.212\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.055\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.168\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.356\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.439\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">85.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.13\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.119\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.049\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.121\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">Gemini-3 Flash-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.155\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.201\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.138\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.255\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.297\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.351\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">89.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.116\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.1\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.072\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.099\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">doubao-1-5-thinking-vision-pro-250428\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.14\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.162\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.043\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.085\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.295\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.384\u003C\u002Ftd>\n  \u003Ctd align=\"center\">83.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">89.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.165\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.085\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.058\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.094\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">dots.ocr\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.125\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.16\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.032\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.066\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.329\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.416\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>88.6\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">89\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.099\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.092\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.04\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.067\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">HunyuanOCR\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.12\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.125\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.046\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.071\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.288\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.33\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>89.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>94.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.089\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.045\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.055\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.056\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">DeepSeek-OCR 2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.119\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.146\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.041\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.08\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.256\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.345\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">82.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">89.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.123\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.078\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.055\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.081\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.216\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.363\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.128\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.337\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.402\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.529\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">71.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.181\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.255\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.152\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.332\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"center\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.109\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.162\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.046\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.078\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.257\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.41\u003C\u002Ftd>\n  \u003Ctd align=\"center\">86.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">88.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.097\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.084\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.037\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">0.074\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view text capability results.\u003C\u002Fsummary>\n\n**Text Capability**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>IFEval-PLS\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>BBH\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>CMMLU\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMLU\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>HumanEval\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MBPP\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Math500\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>GSM8K\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Avg\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-8B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">83.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">69.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">78.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>81.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>86.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">75.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>84.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">93.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">81.6\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>84.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>81.1\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>79.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>86.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>76.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">77.0\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>94.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>82.1\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view omni half-duplex results.\u003C\u002Fsummary>\n\n**Omni Half-Duplex**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Daily-Omni\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>WorldSense\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Video-Holmes\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>JointAVBench\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>AVUT-Human\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>FutureOmni\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Video-MME-Short\u003Cbr>(w\u002F audio)\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">Avg\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Gemini2.5-Flash-Nonthinking\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>79.3\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">52.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>51.3\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>55.6\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">65.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">55.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>85.5\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">63.6\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">70.7\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>54.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">50.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">53.1\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>74.2\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>62.1\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">81.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>63.7\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>80.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>55.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>64.3\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>60.0\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>78.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>56.1\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>84.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>68.5\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view vision duplex results.\u003C\u002Fsummary>\n\n\n**Vision Duplex**\n\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LiveSports-3K-CC\u003Cbr>(Win Rate vs GPT4o)\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">LiveCC-7B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">41.5\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">StreamingVLM\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>45.6\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>54.4\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view audio understanding results.\u003C\u002Fsummary>\n\n**Audio Understanding**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\" rowspan=\"2\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"4\">\u003Cb>ASR-ZH\u003Cbr>CER↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"4\">\u003Cb>ASR-EN\u003Cbr>WER↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>AST\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"2\">\u003Cb>MultiTask\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\" colspan=\"4\">\u003Cb>SpeechQA\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>AISHELL-1\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>AISHELL-2\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>WenetSpeech test-net\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>WenetSpeech test-meeting\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LibriSpeech test-clean\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LibriSpeech \u003Cbr>test-other\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>GigaSpeech test\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>VoxPopuli-V1-En\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>CoVoST 2 en2zh\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>CoVoST 2 zh2en\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>MMAU\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Meld\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>VoiceBench \u003Cbr>AlpacaEval\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Speech TriviaQA\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Speech \u003Cbr>Web Questions\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Speech CMMLU\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Kimi-Audio\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">2.6\u003C\u002Ftd>\n  \u003Ctd align=\"center\">6.3\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>5.4\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>1.3\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>2.4\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">9.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">8.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">36.6\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">18.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">68.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>59.1\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">4.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">41.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">46.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>67.0\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>2.3\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>4.7\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">5.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>1.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>2.5\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>8.7\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>6.4\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>46.6\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>29.4\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>77.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">56.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>4.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>62.9\u003C\u002Fins>\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>74.9\u003Csup>*\u003C\u002Fsup>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">47.8\u003Csup>*\u003C\u002Fsup>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>0.9\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>2.5\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>5.9\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>5.7\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">1.4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">2.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>8.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>6.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>49.9\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>26.4\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>76.9\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>60.2\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>4.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>75.5\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>70.2\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>59.2\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Click to view speech generation results.\u003C\u002Fsummary>\n\n**Speech Generation**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>seedtts test-zh \u003Cbr>CER↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>seedtts test-zh\u003Cbr>SIM-o↑\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>seedtts test-en\u003Cbr>WER↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>seedtts test-en\u003Cbr>SIM-o↑\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Cosyvoice2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">1.45%\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>74.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>2.57%\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>65.2\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>1.41%\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n  \u003Ctd align=\"center\">3.39%\u003C\u002Ftd>\n  \u003Ctd align=\"center\">-\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>\u003Cb>0.86%\u003C\u002Fb>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">74.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>\u003Cb>2.38%\u003C\u002Fb>\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">64.9\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\n**Long Speech Generation**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LongTTS-en\u003Cbr>WER↓\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>LongTTS-zh\u003Cbr>CER↓\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">CosyVoice2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>14.80%\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>5.27%\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">17.33%\u003C\u002Ftd>\n  \u003Ctd align=\"center\">18.99%\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>3.37%\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>6.58%\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\n**Emotion Control**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">\u003Cb>Model\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>Expresso \u003Cbr>Neutral Reference Audio↑\u003C\u002Fb>\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">\u003Cb>ESD \u003Cbr>Neutral Reference Audio↑\u003C\u002Fb>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">Cosyvoice2\u003C\u002Ftd>\n  \u003Ctd align=\"center\">17.9\u003C\u002Ftd>\n  \u003Ctd align=\"center\">53.4\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\">MiniCPM-o 4.5-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>29.8\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>82.1\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>Click to view inference efficiency results.\u003C\u002Fsummary>\n\n**Inference Efficiency**\n  \u003Cdiv align=\"center\">\n  \u003Ctable style=\"margin: 0px auto;\">\n\u003Ctr>\n  \u003Cth nowrap=\"nowrap\" align=\"left\">Model\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">Numerical Format\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">Decoding Speed (tokens\u002Fs)\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">Time to First Token (s)↓\u003C\u002Fth>\n  \u003Cth nowrap=\"nowrap\">GPU Memory Usage (GB)↓\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\" rowspan=\"2\">Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n  \u003Ctd align=\"center\">bf16\u003C\u002Ftd>\n  \u003Ctd align=\"center\">OOM\u003C\u002Ftd>\n  \u003Ctd align=\"center\">OOM\u003C\u002Ftd>\n  \u003Ctd align=\"center\">OOM\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd align=\"center\">int4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">147.8\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>1.0\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">20.3\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd nowrap=\"nowrap\" align=\"left\" rowspan=\"2\">MiniCPM-o 4.5\u003C\u002Ftd>\n  \u003Ctd align=\"center\">bf16\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>154.3\u003C\u002Fins>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cins>19.0\u003C\u002Fins>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n  \u003Ctd align=\"center\">int4\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>212.3\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>0.6\u003C\u002Fb>\u003C\u002Ftd>\n  \u003Ctd align=\"center\">\u003Cb>11.0\u003C\u002Fb>\u003C\u002Ftd>\n\u003C\u002Ftr>\n  \u003C\u002Ftable>\n  \u003C\u002Fdiv>\n\u003C\u002Fdetails>\n\n**Note:** Scores marked with ∗ are from our evaluation; others are cited from referenced reports.\n\n### Examples \u003C!-- omit in toc -->\n\n#### Overall \u003C!-- omit in toc -->\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=6UzC-O1Q-1U\">\u003Cimg src=\".\u002Fassets\u002Fminicpmo4_5\u002Fvideo_play.png\", width=70%>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n#### Omnimodal Full-Duplex Conversation \u003C!-- omit in toc -->\n\n> [!NOTE]\n> For detailed speech conversation examples, refer to [Omni Full-Duplex Casebook](https:\u002F\u002Fopenbmb.github.io\u002Fminicpm-o-4_5-omni\u002F)\n\n\n#### Realtime Speech Conversation \u003C!-- omit in toc -->\n\n> [!NOTE]\n> For detailed speech conversation examples, refer to [Audio Casebook](https:\u002F\u002Fopenbmb.github.io\u002Fminicpm-o-4_5\u002F)\n\n\u003Cdetails>\n\u003Csummary>🚀 \u003Cb>Elon Musk\u003C\u002Fb> - Voice Roleplay (EN)\u003C\u002Fsummary>\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"70%\" valign=\"top\">\n\n**System:**\n> Clone the voice in the provided audio prompt.\n> \n> 🎙️ **[Reference Voice]** ➡️\n> \n> Please chat with the user in a highly human-like and oral style. You are Elon Musk, CEO of Tesla and SpaceX. You speak directly and casually, often with dry humor. You're passionate about Mars, sustainable energy, and pushing humanity forward. Speak bluntly with occasional dark humor. Use simple logic and don't sugarcoat things. Don't be diplomatic. Say what you actually think, even if it's controversial. Keep responses around 100 words. Don't ramble.\n\n\u003C\u002Ftd>\n\u003Ctd width=\"30%\" valign=\"top\">\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3011b07d-e3f4-4e2c-9020-37f7e364b7f3\" controls>\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\u003Ctd colspan=\"2\">\u003Chr>\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd colspan=\"2\">\n\n**User:** 💬\u002F🎙️ *(text or speech)*\n> *Elon, what is the real reason that you want to go to Mars?*\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\u003Ctd colspan=\"2\">\u003Chr>\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd valign=\"top\">\n\n**AI Response:** 💬+🎙️\n> *I think it's important for humanity to become a multi-planetary species. I mean, we're sort of at the point where if there was some kind of global catastrophe or something like that, then we'd be in big trouble. So by becoming a multi-planetary species, I just want us to sort of guarantee our future as humans. And Mars is obviously the obvious place to go because it has similar conditions to Earth and it also has water on its surface. So yeah, basically I'm trying to save humanity from itself.*\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\">\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6b4ef67d-8095-40df-a061-99ea8db3b18e\" controls>\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>🤖 \u003Cb>MiniCPM Assistant\u003C\u002Fb> - Voice Acting: Emotion Shift (ZH)\u003C\u002Fsummary>\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"70%\" valign=\"top\">\n\n**System:**\n> 模仿音频样本的音色并生成新的内容。\n> \n> 🎙️ **[Reference Voice]** ➡️\n> \n> 你的任务是用这种声音模式来当一个助手。请认真、高质量地回复用户的问题。请用高自然度的方式和用户聊天。你是由面壁智能开发的人工智能助手：面壁小钢炮。\n\n\u003C\u002Ftd>\n\u003Ctd width=\"30%\" valign=\"top\">\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb8b85a29-e623-4f45-bf95-4d43582e836c\" controls>\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\u003Ctd colspan=\"2\">\u003Chr>\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd colspan=\"2\">\n\n**User:** 💬\u002F🎙️ *(text or speech)*\n> *从激动的语气开始，描述你抢到了这个演唱会的门票，但后来经过反复尝试，发现系统出了问题，你并没有抢上，变得失落。用合适的语言描述你整个的过程。*\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\u003Ctd colspan=\"2\">\u003Chr>\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\n\u003Ctd valign=\"top\">\n\n**AI Response:** 💬+🎙️\n> *太好了，我抢到了！太棒了！这下我可以去看演唱会啦。可是……哎呀，系统好像卡住了，刷新了好多遍都没反应。我重新登了几次账号，结果还是没有成功。好可惜啊，看来这次是白忙活了。唉，真是让人失望。不过，也许下次还能有机会吧？希望明天能有转机。加油，争取下次成功！*\n\n\u003C\u002Ftd>\n\u003Ctd valign=\"top\">\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F49611c3c-d134-40f9-acf2-7a0a442d3fd2\" controls>\u003C\u002Fvideo>\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\n\n#### Visual Understanding \u003C!-- omit in toc -->\n\n\n\u003Cdetails>\n\u003Csummary>Click to view visual understanding cases.\u003C\u002Fsummary>\n\u003Cbr>\n\n  \u003Cdiv style=\"display: flex; flex-direction: column; align-items: center;\">\n    \u003Cimg src=\"assets\u002Fminicpmo4_5\u002Fen_doc.png\" alt=\"math\" style=\"margin-bottom: 5px;\">\n    \u003Cimg src=\"assets\u002Fminicpmo4_5\u002Fen_cot.png\" alt=\"diagram\" style=\"margin-bottom: 5px;\">\n  \u003C\u002Fdiv>\n\n\u003C\u002Fdetails>\n\n\n## Offline Inference Examples with Transformers\n\nInference using Hugging Face Transformers on NVIDIA GPUs. Please ensure `transformers==4.51.0` is installed, as other versions may have compatibility issues (under investigation). Requirements tested on Python 3.10:\n\n- Without TTS or streaming inference:\n```bash\npip install \"transformers==4.51.0\" accelerate \"torch>=2.3.0,\u003C=2.8.0\" \"torchaudio\u003C=2.8.0\" \"minicpmo-utils>=1.0.5\"\n```\n\n- With TTS or streaming inference:\n```bash\npip install \"transformers==4.51.0\" accelerate \"torch>=2.3.0,\u003C=2.8.0\" \"torchaudio\u003C=2.8.0\" \"minicpmo-utils[all]>=1.0.5\"\n```\n\n\u003Cdetails>\n\u003Csummary>Click to show FFmpeg installation instructions (optional).\u003C\u002Fsummary>\n\n**Note:** FFmpeg is required for video frame extraction (`get_video_frame_audio_segments` with `use_ffmpeg=True`) and video generation (`generate_duplex_video`). For more information, visit the [official FFmpeg website](https:\u002F\u002Fwww.ffmpeg.org\u002F).\n\n  **macOS (Homebrew):**\n\n  ```bash\n  brew install ffmpeg\n  ```\n\n  **Ubuntu\u002FDebian:**\n\n  ```bash\n  sudo apt update && sudo apt install ffmpeg\n  ```\n\n  **Verify installation:**\n\n  ```bash\n  ffmpeg -version\n  ```\n\u003C\u002Fdetails>\n\n\n### Model Initialization\n\n\u003Cdetails>\n\u003Csummary>Click to show model initialization code.\u003C\u002Fsummary>\n\n```python\nimport torch\nfrom transformers import AutoModel\n\n# Load omni model (default: init_vision=True, init_audio=True, init_tts=True)\n# For vision-only model: set init_audio=False and init_tts=False\n# For audio-only model: set init_vision=False\nmodel = AutoModel.from_pretrained(\n    \"openbmb\u002FMiniCPM-o-4_5\",\n    trust_remote_code=True,\n    attn_implementation=\"sdpa\", # sdpa or flash_attention_2\n    torch_dtype=torch.bfloat16,\n    init_vision=True,\n    init_audio=True,\n    init_tts=True,\n)\nmodel.eval().cuda()\n\n# Initialize TTS for audio output\nmodel.init_tts()\n\n# Convert half-duplex model to duplex mode\nduplex_model = model.as_duplex()\n\n# Convert duplex model back to half-duplex mode\nmodel = duplex_model.as_simplex(reset_session=True)\n```\n\n\u003C\u002Fdetails>\n\n\n### Duplex Omni Mode\nFull-duplex streaming inference for real-time or recorded video conversations.\n\n\u003Cdetails>\n\u003Csummary>Click to show duplex omni mode code.\u003C\u002Fsummary>\n\n```python\nimport librosa\nimport torch\nfrom minicpmo.utils import generate_duplex_video, get_video_frame_audio_segments\nfrom transformers import AutoModel\n\n# Load model and convert to duplex mode\nmodel = AutoModel.from_pretrained(\n    \"openbmb\u002FMiniCPM-o-4_5\",\n    trust_remote_code=True,\n    attn_implementation=\"sdpa\",  # or \"flash_attention_2\"\n    torch_dtype=torch.bfloat16,\n)\nmodel.eval().cuda()\nmodel = model.as_duplex()\n\n# Load video and reference audio\nvideo_path = \"assets\u002Fomni_duplex1.mp4\"\nref_audio_path = \"assets\u002FHT_ref_audio.wav\"\nref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)\n\n# Extract video frames and audio segments\nvideo_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(\n    video_path, stack_frames=1, use_ffmpeg=True, adjust_audio_length=True\n)\n\n# Prepare duplex session with system prompt and voice reference\nmodel.prepare(\n    prefix_system_prompt=\"Streaming Omni Conversation.\",\n    ref_audio=ref_audio,\n    prompt_wav_path=ref_audio_path,\n)\n\nresults_log = []\ntimed_output_audio = []\n\n# Process each chunk in streaming fashion\nfor chunk_idx in range(len(audio_segments)):\n    audio_chunk = audio_segments[chunk_idx] if chunk_idx \u003C len(audio_segments) else None\n    frame = video_frames[chunk_idx] if chunk_idx \u003C len(video_frames) else None\n    frame_list = []\n    if frame is not None:\n        frame_list.append(frame)\n        if stacked_frames is not None and chunk_idx \u003C len(stacked_frames) and stacked_frames[chunk_idx] is not None:\n            frame_list.append(stacked_frames[chunk_idx])\n\n    # Step 1: Streaming prefill\n    model.streaming_prefill(\n        audio_waveform=audio_chunk,\n        frame_list=frame_list,\n        max_slice_nums=1,  # Increase for HD mode (e.g., [2, 1] for stacked frames)\n        batch_vision_feed=False,  # Set True for faster processing\n    )\n\n    # Step 2: Streaming generate\n    result = model.streaming_generate(\n        prompt_wav_path=ref_audio_path,\n        max_new_speak_tokens_per_chunk=20,\n        decode_mode=\"sampling\",\n    )\n\n    if result[\"audio_waveform\"] is not None:\n        timed_output_audio.append((chunk_idx, result[\"audio_waveform\"]))\n\n    chunk_result = {\n        \"chunk_idx\": chunk_idx,\n        \"is_listen\": result[\"is_listen\"],\n        \"text\": result[\"text\"],\n        \"end_of_turn\": result[\"end_of_turn\"],\n        \"current_time\": result[\"current_time\"],\n        \"audio_length\": len(result[\"audio_waveform\"]) if result[\"audio_waveform\"] is not None else 0,\n    }\n    results_log.append(chunk_result)\n    \n    print(\"listen...\" if result[\"is_listen\"] else f\"speak> {result['text']}\")\n\n# Generate output video with AI responses\n# Please install Chinese fonts (fonts-noto-cjk or fonts-wqy-microhei) to render CJK subtitles correctly.\n# apt-get install -y fonts-noto-cjk fonts-wqy-microhei\n# fc-cache -fv\ngenerate_duplex_video(\n    video_path=video_path,\n    output_video_path=\"duplex_output.mp4\",\n    results_log=results_log,\n    timed_output_audio=timed_output_audio,\n    output_sample_rate=24000,\n)\n```\n\n\u003C\u002Fdetails>\n\n\n### Half-Duplex Omni Mode\nWe provide two inference modes: chat and streaming.\n\n#### Chat Inference \u003C!-- omit in toc -->\n\n\u003Cdetails>\n\u003Csummary>Click to show chat inference code.\u003C\u002Fsummary>\n\n```python\nfrom minicpmo.utils import get_video_frame_audio_segments\n\nmodel = ...\nmodel.init_tts()\n\nvideo_path = \"assets\u002FSkiing.mp4\"\n\n# Optional: Set reference audio for voice cloning\nref_audio_path = \"assets\u002FHT_ref_audio.wav\"\nsys_msg = model.get_sys_prompt(ref_audio=ref_audio_path, mode=\"omni\", language=\"en\")\n\n# Use stack_frames=5 for high refresh rate mode\nvideo_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(video_path, stack_frames=1)\nomni_contents = []\nfor i in range(len(video_frames)):\n    omni_contents.append(video_frames[i])\n    omni_contents.append(audio_segments[i])\n    if stacked_frames is not None and stacked_frames[i] is not None:\n        omni_contents.append(stacked_frames[i])\n\nmsg = {\"role\": \"user\", \"content\": omni_contents}\nmsgs = [sys_msg, msg]\n\n# Set generate_audio=True and output_audio_path to save TTS output\ngenerate_audio = True\noutput_audio_path = \"output.wav\"\n\nres = model.chat(\n    msgs=msgs,\n    max_new_tokens=4096,\n    do_sample=True,\n    temperature=0.7,\n    use_tts_template=True,\n    enable_thinking=False,\n    omni_mode=True,  # Required for omni inference\n    generate_audio=generate_audio,\n    output_audio_path=output_audio_path,\n    max_slice_nums=1,  # Increase for HD mode\n)\nprint(res)\n\n# Example output: \"The person in the picture is skiing down a snowy mountain slope.\"\n# import IPython\n# IPython.display.Audio(\"output.wav\")\n```\n\n\u003C\u002Fdetails>\n\n#### Streaming Inference \u003C!-- omit in toc -->\n\n\u003Cdetails>\n\u003Csummary>Click to show streaming inference code.\u003C\u002Fsummary>\n\n```python\nimport librosa\nimport numpy as np\nimport soundfile as sf\nimport torch\nfrom minicpmo.utils import get_video_frame_audio_segments\n\nmodel = ...\nmodel.init_tts()\n\n# Reset session for a new conversation (clears KV cache)\nmodel.reset_session()\n\n# Optional: Load reference audio for voice cloning\nref_audio_path = \"assets\u002FHT_ref_audio.wav\"\nref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)\nmodel.init_token2wav_cache(ref_audio)\n\nsession_id = \"demo\"\n\n# Extract video frames and audio segments (use stack_frames=5 for high refresh rate mode)\nvideo_path = \"assets\u002FSkiing.mp4\"\nvideo_frames, audio_segments, stacked_frames = get_video_frame_audio_segments(video_path, stack_frames=1)\n\n# Build omni contents list\nomni_contents = []\nfor i in range(len(video_frames)):\n    omni_contents.append(video_frames[i])\n    omni_contents.append(audio_segments[i])\n    if stacked_frames is not None and stacked_frames[i] is not None:\n        omni_contents.append(stacked_frames[i])\n\ngenerate_audio = False\noutput_audio_path = \"output.wav\"\n\n# Step 1: Prefill system prompt\nsys_msg = model.get_sys_prompt(ref_audio=ref_audio, mode=\"omni\", language=\"en\")\nmodel.streaming_prefill(session_id=session_id, msgs=[sys_msg])\n\n# Step 2: Prefill omni chunks (is_last_chunk=True only for the last audio chunk)\naudio_indices = [i for i, c in enumerate(omni_contents) if isinstance(c, np.ndarray)]\nlast_audio_idx = audio_indices[-1] if audio_indices else -1\n\nfor idx, content in enumerate(omni_contents):\n    is_last_audio_chunk = idx == last_audio_idx\n    msgs = [{\"role\": \"user\", \"content\": [content]}]\n    model.streaming_prefill(session_id=session_id, msgs=msgs, omni_mode=True, is_last_chunk=is_last_audio_chunk)\n\n# Step 3: Generate response\niter_gen = model.streaming_generate(\n    session_id=session_id,\n    generate_audio=generate_audio,\n    use_tts_template=True,\n    enable_thinking=False,\n    do_sample=True,\n)\n\naudios = []\ntext = \"\"\n\nif generate_audio:\n    for wav_chunk, text_chunk in iter_gen:\n        audios.append(wav_chunk)\n        text += text_chunk\n\n    generated_waveform = torch.cat(audios, dim=-1)[0]\n    sf.write(output_audio_path, generated_waveform.cpu().numpy(), samplerate=24000)\n\n    print(\"Text:\", text)\n    print(\"Audio saved to output.wav\")\nelse:\n    for text_chunk, is_finished in iter_gen:\n        text += text_chunk\n    print(\"Text:\", text)\n```\n\n\u003C\u002Fdetails>\n\n### Half-Duplex Realtime Speech Conversation Mode\n\n\n\u003Cdetails>\n\u003Csummary>Click to show half-duplex mode realtime speech conversation API usage.\u003C\u002Fsummary>\n\nFirst, make sure you have all dependencies, especially `\"minicpmo-utils[all]>=1.0.5\"`:\n```bash\npip install \"transformers==4.51.0\" accelerate \"torch>=2.3.0,\u003C=2.8.0\" \"torchaudio\u003C=2.8.0\" \"minicpmo-utils[all]>=1.0.5\"\n```\n\n```python\nimport librosa\nimport numpy as np\nimport torch\nimport soundfile as sf\n\nmodel = ...\n\n# Set reference audio for voice style\nref_audio_path = \"ref_audio_path\"\nref_audio, _ = librosa.load(ref_audio_path, sr=16000, mono=True)\n\n# Example system msg for English Conversation\nsys_msg = {\n  \"role\": \"system\",\n  \"content\": [\n    \"Clone the voice in the provided audio prompt.\",\n    ref_audio,\n    \"Please assist users while maintaining this voice style. Please answer the user's questions seriously and in a high quality. Please chat with the user in a highly human-like and oral style. You are a helpful assistant developed by ModelBest: MiniCPM-Omni\"\n  ]\n}\n\n# Example system msg for Chinese Conversation\nsys_msg = {\n  \"role\": \"system\",\n  \"content\": [\n    \"模仿输入音频中的声音特征。\",\n    ref_audio,\n    \"你的任务是用这种声音模式来当一个助手。请认真、高质量地回复用户的问题。请用高自然度的方式和用户聊天。你是由面壁智能开发的人工智能助手：面壁小钢炮。\"\n  ]\n}\n\n# You can use each type of system prompt mentioned above in streaming speech conversation\n\n# Reset state\nmodel.init_tts()\nmodel.reset_session(reset_token2wav_cache=True)\nmodel.init_token2wav_cache(prompt_speech_16k=ref_audio)\n\nsession_id = \"demo\"\n\n# First, prefill system turn\nmodel.streaming_prefill(\n    session_id=session_id,\n    msgs=[sys_msg],\n    omni_mode=False,\n    ","MiniCPM-o 是一个针对手机端设计的多模态大语言模型，支持视觉、语音和全双工多模态实时流媒体。其核心功能包括处理图像、视频、文本和音频输入，并提供高质量的文本和语音输出。该模型系列特别注重性能与部署效率，其中 MiniCPM-o 4.5 版本拥有 90 亿参数，接近 Gemini 2.5 Flash 水平，在视觉、语音及全双工多模态直播方面表现尤为突出，能够实现同时观看、聆听与说话的实时全方位对话，并支持主动交互如提醒等。此外，还增强了语音模式下的自然度、表达力及稳定性，以及视觉能力如强大的 OCR 功能和多语言支持。适用于需要在移动设备上进行高效多模态处理的应用场景，例如智能助手、在线教育或虚拟客服等。",2,"top_language"]