[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1672":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},1672,"MiMo-V2.5-ASR","XiaomiMiMo\u002FMiMo-V2.5-ASR","XiaomiMiMo","Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios","",null,"Python",257,24,2,3,0,7,61,4.19,"Apache License 2.0",false,"main",[],"2026-06-12 02:00:31","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002FXiaomiMIMO.png\" width=\"60%\" alt=\"Xiaomi-MiMo\" \u002F>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Ch3>\n    \u003Cb>\n      \u003Cspan>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u003C\u002Fspan>\u003Cbr\u002F>\n      MiMo-V2.5-ASR: Robust Speech Recognition Across\u003Cbr\u002F>\n      Languages, Dialects, and Complex Acoustic Scenarios\u003Cbr\u002F>\n      \u003Cspan>━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u003C\u002Fspan>\n    \u003C\u002Fb>\n  \u003C\u002Fh3>\n\u003C\u002Fdiv>\n\n\u003Cbr\u002F>\n\n\u003Cdiv align=\"center\" style=\"line-height: 1;\">\n  |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FXiaomiMiMo\u002FMiMo-V2.5-ASR\" target=\"_blank\">🤗 HuggingFace\u003C\u002Fa>\n  &nbsp;|\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FXiaomiMiMo\u002FMiMo-V2.5-ASR\" target=\"_blank\">🚀 Online Demo\u003C\u002Fa>\n  &nbsp;|\n  \u003Ca href=\"https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-5-asr\" target=\"_blank\">📰 Blog\u003C\u002Fa>\n  &nbsp;|\n\n  \u003Cbr\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr\u002F>\n\n## Introduction\n\n**MiMo-V2.5-ASR** is a state-of-the-art end-to-end automatic speech recognition (ASR) model developed by the Xiaomi MiMo team. It is built to deliver accurate and robust transcription across Mandarin Chinese and English, multiple Chinese dialects, code-switched speech, song lyrics, knowledge-intensive content, noisy acoustic environments, and multi-speaker conversations. MiMo-V2.5-ASR achieves state-of-the-art results on a wide range of public benchmarks.\n\n## Abstract\n\nAutomatic speech recognition systems are expected to faithfully transcribe speech signals that originate from diverse languages, dialects, accents, and domains, and that are captured under a wide variety of acoustic conditions. While conventional end-to-end models perform well on in-domain data, they still fall short of real-world requirements in challenging scenarios such as dialect mixing, code-switching, knowledge-intensive content, noisy environments, and multi-speaker conversations. Therefore, we present **MiMo-V2.5-ASR**, an end-to-end speech recognition model developed by the Xiaomi MiMo team. Through large-scale mid-training, high-quality supervised fine-tuning, and a novel reinforcement-learning algorithm, MiMo-V2.5-ASR achieves systematic improvements along the following dimensions:\n\n- 🗣️ **Chinese Dialects**: Native support for Wu, Cantonese, Hokkien, Sichuanese, and more.\n- 🔀 **Code-Switch**: Seamless Chinese–English code-switching transcription with no language tags required.\n- 🎵 **Song Recognition**: High-precision lyrics transcription for Chinese and English songs, even with mixed accompaniment and vocals.\n- 🔊 **Noisy Environments**: Robust recognition under heavy noise, far-field capture, and other adverse acoustic conditions.\n- 👥 **Multi-Speaker**: Accurate transcription of overlapping, multi-party conversations such as meetings.\n- 🇬🇧 **Complex English Scenarios**: Leading performance on the Open ASR Leaderboard for challenging English benchmarks such as AMI.\n- 📚 **Knowledge-Intensive Recognition**: Precise recognition of classical poetry, technical terminology, personal names, place names, and other knowledge-dense material.\n- 📝 **Native Punctuation**: Punctuation generated natively from prosody and semantics, delivering ready-to-use transcripts with no post-processing needed.\n\n## Results\n\nMiMo-V2.5-ASR has been evaluated across a broad set of benchmarks spanning standard Mandarin and English, Chinese dialects, lyric recognition, and internal business scenarios. The chart below summarizes the average performance of MiMo-V2.5-ASR across these scenarios.\n\n![Results](assets\u002FMiMo_ASR_Results.png)\n\nFor per-benchmark numbers and specific qualitative cases, please refer to our [blog](https:\u002F\u002Fmimo.xiaomi.com\u002Fmimo-v2-5-asr).\n\n## Model Download\n\n| Models   | 🤗 Hugging Face |\n|-------|-------|\n| MiMo-Audio-Tokenizer | [XiaomiMiMo\u002FMiMo-Audio-Tokenizer](https:\u002F\u002Fhuggingface.co\u002FXiaomiMiMo\u002FMiMo-Audio-Tokenizer) |\n| MiMo-V2.5-ASR | [XiaomiMiMo\u002FMiMo-V2.5-ASR](https:\u002F\u002Fhuggingface.co\u002FXiaomiMiMo\u002FMiMo-V2.5-ASR) |\n\n```bash\npip install huggingface-hub\n\nhf download XiaomiMiMo\u002FMiMo-Audio-Tokenizer --local-dir .\u002Fmodels\u002FMiMo-Audio-Tokenizer\nhf download XiaomiMiMo\u002FMiMo-V2.5-ASR --local-dir .\u002Fmodels\u002FMiMo-V2.5-ASR\n```\n\n## Getting Started\n\nSpin up the MiMo-V2.5-ASR demo in minutes with the built-in Gradio app.\n\n### Prerequisites (Linux)\n\n* Python 3.12\n* CUDA >= 12.0\n\n### Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-V2.5-ASR.git\ncd MiMo-V2.5-ASR\npip install -r requirements.txt\npip install flash-attn==2.7.4.post1\n```\n\n> \\[!Note]\n> If the compilation of flash-attn takes too long, you can download the precompiled wheel and install it manually:\n>\n> * [Download Precompiled Wheel](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Freleases\u002Fdownload\u002Fv2.7.4.post1\u002Fflash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl)\n>\n> ```sh\n> pip install \u002Fpath\u002Fto\u002Fflash_attn-2.7.4.post1+cu12torch2.6cxx11abiFALSE-cp312-cp312-linux_x86_64.whl\n> ```\n\n### Run the Demo\n\n```bash\npython run_mimo_asr.py\n```\n\n![MiMo-V2.5-ASR Demo](assets\u002FMiMo_ASR_Demo.png)\n\nThis launches a local Gradio interface for MiMo-V2.5-ASR. You can:\n\n* Upload an audio file **or** record directly from your microphone.\n* Optionally specify a **language tag** (Chinese \u002F English \u002F Auto) to bias the model for a specific language, or leave it to **Auto** for automatic language detection (recommended for code-switched speech).\n* The demo calls the `asr_sft()` interface under the hood.\n\nTo load the model and tokenizer automatically at startup, pass their paths on the command line:\n\n```bash\npython run_mimo_asr.py \\\n    --model-path .\u002Fmodels\u002FMiMo-V2.5-ASR \\\n    --tokenizer-path .\u002Fmodels\u002FMiMo-Audio-Tokenizer\n```\n\nOtherwise, enter the local paths for `MiMo-Audio-Tokenizer` and `MiMo-V2.5-ASR` in the **Model Configuration** tab, then start transcribing!\n\n## Python API\n\nBasic usage with the `asr_sft` interface:\n\n```python\nfrom src.mimo_audio.mimo_audio import MimoAudio\n\nmodel = MimoAudio(\n    model_path=\".\u002Fmodels\u002FMiMo-V2.5-ASR\",\n    tokenizer_path=\".\u002Fmodels\u002FMiMo-Audio-Tokenizer\",\n)\n\n# Automatic language detection (recommended for code-switching)\ntext = model.asr_sft(\"path\u002Fto\u002Faudio.wav\")\nprint(text)\n\n# With explicit language tag\ntext_zh = model.asr_sft(\"path\u002Fto\u002Faudio.wav\", audio_tag=\"\u003Cchinese>\")\ntext_en = model.asr_sft(\"path\u002Fto\u002Faudio.wav\", audio_tag=\"\u003Cenglish>\")\n```\n\n## Citation\n\n```bibtex\n@misc{coreteam2026mimov25asr,\n      title={MiMo-V2.5-ASR: Robust Speech Recognition Across Languages, Dialects, and Complex Acoustic Scenarios},\n      author={LLM-Core-Team Xiaomi},\n      year={2026},\n      url={https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo-V2.5-ASR},\n}\n```\n\n## Contact\n\nPlease contact us at [mimo@xiaomi.com](mailto:mimo@xiaomi.com) or open an issue if you have any questions.\n","MiMo-V2.5-ASR是一个由小米MiMo团队开发的端到端自动语音识别模型，旨在实现跨语言、方言和复杂声学场景下的鲁棒语音识别。该模型支持普通话和英语、多种中国方言（如吴语、粤语、闽南语、四川话等）、中英文代码切换、歌曲歌词、知识密集型内容、嘈杂环境以及多人对话等多种场景下的高精度转录。通过大规模预训练、高质量监督微调及新颖的强化学习算法，MiMo-V2.5-ASR在多个公开基准测试中取得了领先性能。适用于需要处理多语言、方言混合、噪声环境或多人会议记录等复杂音频场景的应用。","2026-06-06 02:46:06","CREATED_QUERY"]