[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2650":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},2650,"VoxCPM","OpenBMB\u002FVoxCPM","OpenBMB","VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning","https:\u002F\u002Fvoxcpm.com",null,"Python",28441,3212,127,109,0,226,2637,10227,1188,45,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38,39],"audio","deeplearning","minicpm","multilingual","python","pytorch","speech","speech-synthesis","text-to-speech","tts","tts-model","voice-cloning","voice-design","voxcpm","2026-06-12 02:00:42","\u003Ch2 align=\"center\">VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning\u003C\u002Fh2>\n\n\u003Cp align=\"center\">\n  \u003Cb>English\u003C\u002Fb> | \u003Ca href=\".\u002FREADME_zh.md\">中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject%20Page-GitHub-blue\" alt=\"Project Page\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLive%20Playground-Demo-orange\" alt=\"Live Playground\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocs-ReadTheDocs-8CA1AF\" alt=\"Documentation\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM2\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-VoxCPM2-yellow\" alt=\"Hugging Face\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM2\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-VoxCPM2-purple\" alt=\"ModelScope\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopenbmb.github.io\u002Fvoxcpm2-demopage\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemoPage-Audio Samples-red\">\u003C\u002Fa>\n  \n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fvoxcpm_logo.png\" alt=\"VoxCPM Logo\" width=\"35%\">\n  \u003Cbr>\u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F17704\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F17704\" alt=\"OpenBMB%2FVoxCPM | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n\u003Cp align=\"center\">\n  👋 Join our community for discussion and support!\n  \u003Cbr>\n  \u003Ca href=\".\u002Fassets\u002Ffeishu-group.png\" style=\"display:inline-block;vertical-align:middle; margin-left: 10px;\">\n    \u003Cimg src=\".\u002Fassets\u002Ffeishu-logo.png\" width=\"16\" height=\"16\" style=\"vertical-align:middle;\"> Feishu\n  \u003C\u002Fa>\n  &nbsp;|&nbsp;\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FKZUx7tVNwz\" style=\"display:inline-block;vertical-align:middle;\">\n    \u003Cimg src=\".\u002Fassets\u002Fdiscord-logo.png\" width=\"16\" height=\"16\" style=\"vertical-align:middle;\"> Discord\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nVoxCPM is a **tokenizer-free** Text-to-Speech system that directly generates continuous speech representations via an end-to-end **diffusion autoregressive architecture**, bypassing discrete tokenization to achieve highly natural and expressive synthesis.\n\n**VoxCPM2** is the latest major release — a **2B** parameter model trained on **over 2 million hours** of multilingual speech data, now supporting **30 languages**, **Voice Design**, **Controllable Voice Cloning**, and **48kHz** studio-quality audio output. Built on a [MiniCPM-4](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM) backbone.\n\n### ✨ Highlights\n\n- 🌍 **30-Language Multilingual** — Input text in any of the 30 supported languages and synthesize directly, no language tag needed\n- 🎨 **Voice Design** — Create a brand-new voice from a natural-language description alone (gender, age, tone, emotion, pace …), no reference audio required\n- 🎛️ **Controllable Cloning** — Clone any voice from a short reference clip, with optional style guidance to steer emotion, pace, and expression while preserving the original timbre\n- 🎙️ **Ultimate Cloning** — Reproduce every vocal nuance: provide both reference audio and its transcript, and the model continues seamlessly from the reference, faithfully preserving every vocal detail — timbre, rhythm, emotion, and style (same as VoxCPM1.5)\n- 🔊 **48kHz High-Quality Audio** — Accepts 16kHz reference audio and directly outputs 48kHz studio-quality audio via AudioVAE V2's asymmetric encode\u002Fdecode design, with built-in super-resolution — no external upsampler needed\n- 🧠 **Context-Aware Synthesis** — Automatically infers appropriate prosody and expressiveness from text content\n- ⚡ **Real-Time Streaming** — RTF as low as ~0.3 on NVIDIA RTX 4090, and ~0.13 accelerated by [Nano-vLLM](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm) or [vLLM-Omni](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni) — official vLLM omni-modal serving for VoxCPM2 with PagedAttention and an OpenAI-compatible API\n- 📜 **Fully Open-Source & Commercial-Ready** — Weights and code released under the [Apache-2.0](LICENSE) license, free for commercial use\n\n\n\u003Csummary>\u003Cb>🌍 Supported Languages (30)\u003C\u002Fb>\u003C\u002Fsummary>\n\u003Cbr>\nArabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese\n\nChinese Dialect: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话\n\n\n### News\n\n* **[2026.04]** 🔥 We release **VoxCPM2** — 2B, 30 languages, Voice Design & Controllable Voice Cloning, 48kHz audio output! [Weights](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM2) | [Docs](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002F) | [Playground](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenBMB\u002FVoxCPM-Demo)\n* **[2025.12]** 🎉 Open-source **VoxCPM1.5** [weights](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5) with SFT & LoRA fine-tuning. (**🏆 #1 GitHub Trending**)\n* **[2025.09]** 🔥 Release VoxCPM [Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650).\n* **[2025.09]** 🎉 Open-source **VoxCPM-0.5B** [weights](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM-0.5B) (**🏆 #1 HuggingFace Trending**)\n\n---\n\n## Contents\n\n- [Quick Start](#-quick-start)\n  - [Installation](#installation)\n  - [Python API](#python-api)\n  - [CLI Usage](#cli-usage)\n  - [Web Demo](#web-demo)\n  - [Production Deployment](#-production-deployment-nano-vllm)\n- [Models & Versions](#-models--versions)\n- [Performance](#-performance)\n- [Fine-tuning](#%EF%B8%8F-fine-tuning)\n- [Documentation](#-documentation)\n- [Ecosystem & Community](#-ecosystem--community)\n- [Risks and Limitations](#%EF%B8%8F-risks-and-limitations)\n- [Citation](#-citation)\n\n---\n\n## 🚀 Quick Start\n\n### Installation\n\n```sh\npip install voxcpm\n```\n\n> **Requirements:** Python ≥ 3.10 (\u003C3.13), PyTorch ≥ 2.5.0, CUDA ≥ 12.0. See [Quick Start Docs](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fquickstart.html) for details.\n\n### Python API\n\n#### 🗣️ Text-to-Speech\n\n```python\nfrom voxcpm import VoxCPM\nimport soundfile as sf\n\nmodel = VoxCPM.from_pretrained(\n  \"openbmb\u002FVoxCPM2\",\n  load_denoiser=False,\n)\n\nwav = model.generate(\n    text=\"VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.\",\n    cfg_value=2.0,\n    inference_timesteps=10,\n)\nsf.write(\"demo.wav\", wav, model.tts_model.sample_rate)\nprint(\"saved: demo.wav\")\n```\n\nIf you prefer downloading from ModelScope first, you can use:\n\n```bash\npip install modelscope\n```\n\n```python\nfrom modelscope import snapshot_download\nsnapshot_download(\"OpenBMB\u002FVoxCPM2\", local_dir='.\u002Fpretrained_models\u002FVoxCPM2') # specify the local directory to save the model\n\nfrom voxcpm import VoxCPM\nimport soundfile as sf\nmodel = VoxCPM.from_pretrained(\".\u002Fpretrained_models\u002FVoxCPM2\", load_denoiser=False)\n\nwav = model.generate(\n    text=\"VoxCPM2 is the current recommended release for realistic multilingual speech synthesis.\",\n    cfg_value=2.0,\n    inference_timesteps=10,\n)\nsf.write(\"demo.wav\", wav, model.tts_model.sample_rate)\n```\n\n#### 🎨 Voice Design\n\nCreate a voice from a natural-language description — no reference audio needed. **Format:** put the description in parentheses at the start of `text`(e.g. `\"(your voice description)The text to synthesize.\"`):\n\n```python\nwav = model.generate(\n    text=\"(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!\",\n    cfg_value=2.0,\n    inference_timesteps=10,\n)\nsf.write(\"voice_design.wav\", wav, model.tts_model.sample_rate)\n```\n\n#### 🎛️ Controllable Voice Cloning\n\nUpload a reference audio. The model clones the timbre, and you can still use control instructions to adjust speed, emotion, or style.\n\n```python\nwav = model.generate(\n    text=\"This is a cloned voice generated by VoxCPM2.\",\n    reference_wav_path=\"path\u002Fto\u002Fvoice.wav\",\n)\nsf.write(\"clone.wav\", wav, model.tts_model.sample_rate)\n\nwav = model.generate(\n    text=\"(slightly faster, cheerful tone)This is a cloned voice with style control.\",\n    reference_wav_path=\"path\u002Fto\u002Fvoice.wav\",\n    cfg_value=2.0,\n    inference_timesteps=10,\n)\nsf.write(\"controllable_clone.wav\", wav, model.tts_model.sample_rate)\n```\n\n#### 🎙️ Ultimate Cloning\n\nProvide both the reference audio and its exact transcript for audio-continuation-based cloning with every vocal nuance reproduced. For maximum cloning similarity, pass the same reference clip to both `reference_wav_path` and `prompt_wav_path` as shown below:\n\n```python\nwav = model.generate(\n    text=\"This is an ultimate cloning demonstration using VoxCPM2.\",\n    prompt_wav_path=\"path\u002Fto\u002Fvoice.wav\",\n    prompt_text=\"The transcript of the reference audio.\",\n    reference_wav_path=\"path\u002Fto\u002Fvoice.wav\", # optional, for better simliarity \n)\nsf.write(\"hifi_clone.wav\", wav, model.tts_model.sample_rate)\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🔄 Streaming API\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nimport numpy as np\n\nchunks = []\nfor chunk in model.generate_streaming(\n    text=\"Streaming text to speech is easy with VoxCPM!\",\n):\n    chunks.append(chunk)\nwav = np.concatenate(chunks)\nsf.write(\"streaming.wav\", wav, model.tts_model.sample_rate)\n```\n\u003C\u002Fdetails>\n\n### CLI Usage\n\n```bash\n# Voice design (no reference audio needed)\nvoxcpm design \\\n  --text \"VoxCPM2 brings studio-quality multilingual speech synthesis.\" \\\n  --output out.wav\n\n# Controllable voice cloning with style control\nvoxcpm design \\\n  --text \"VoxCPM2 brings studio-quality multilingual speech synthesis.\" \\\n  --control \"Young female voice, warm and gentle, slightly smiling\" \\\n  --output out.wav\n\n# Voice cloning (reference audio)\nvoxcpm clone \\\n  --text \"This is a voice cloning demo.\" \\\n  --reference-audio path\u002Fto\u002Fvoice.wav \\\n  --output out.wav\n\n# Ultimate cloning (prompt audio + transcript)\nvoxcpm clone \\\n  --text \"This is a voice cloning demo.\" \\\n  --prompt-audio path\u002Fto\u002Fvoice.wav \\\n  --prompt-text \"reference transcript\" \\\n  --reference-audio path\u002Fto\u002Fvoice.wav \\ # optional, for better simliarity\n  --output out.wav\n\n# Batch processing\nvoxcpm batch --input examples\u002Finput.txt --output-dir outs\n\n# Help\nvoxcpm --help\n```\n\n### Web Demo\n\n```bash\npython app.py --port 8808  # then open in browser: http:\u002F\u002Flocalhost:8808\n```\n\n### 🚢 Production Deployment (Nano-vLLM)\n\nFor high-throughput serving, use [**Nano-vLLM-VoxCPM**](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm) — a dedicated inference engine built on Nano-vLLM with concurrent request support and an async API.\n\n```bash\npip install nano-vllm-voxcpm\n```\n\n```python\nfrom nanovllm_voxcpm import VoxCPM\nimport numpy as np, soundfile as sf\n\nserver = VoxCPM.from_pretrained(model=\"\u002Fpath\u002Fto\u002FVoxCPM\", devices=[0])\nchunks = list(server.generate(target_text=\"Hello from VoxCPM!\"))\nsf.write(\"out.wav\", np.concatenate(chunks), 48000)\nserver.stop()\n```\n\n> **RTF as low as ~0.13 on NVIDIA RTX 4090** (vs ~0.3 with the standard PyTorch implementation), with support for batched concurrent requests and a FastAPI HTTP server. See the [Nano-vLLM-VoxCPM repo](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm) for deployment details.\n\n### 🏭 Production Serving (vLLM-Omni)\n\nFor production multi-tenant deployments, use [**vLLM-Omni**](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni) — the official vLLM project's omni-modal extension with native **VoxCPM2** support. PagedAttention KV cache, continuous batching, and a drop-in **OpenAI-compatible** `\u002Fv1\u002Faudio\u002Fspeech` endpoint.\n\n```bash\n# Install from source (latest main — vllm-omni is rapidly evolving)\nuv pip install vllm==0.19.0 --torch-backend=auto\ngit clone https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni.git && cd vllm-omni\nuv pip install -e .\n```\n\nSee the [vLLM-Omni installation guide](https:\u002F\u002Fvllm-omni.readthedocs.io\u002Fen\u002Flatest\u002Fgetting_started\u002Finstallation\u002F) for other platforms (ROCm, XPU, MUSA, NPU) and Docker images.\n\n```bash\n# Launch an OpenAI-compatible TTS server (--omni enables omni-modal serving)\nvllm serve openbmb\u002FVoxCPM2 --omni --port 8000\n\n# Call it from any OpenAI client\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Faudio\u002Fspeech \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\":\"openbmb\u002FVoxCPM2\",\"input\":\"Hello from VoxCPM2 on vLLM-Omni!\",\"voice\":\"default\"}' \\\n  --output out.wav\n```\n\n> Built on the upstream vLLM scheduler, with batched concurrent requests, streaming chunk delivery, and multi-GPU deployment out of the box. See the [VoxCPM2 example](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni\u002Ftree\u002Fmain\u002Fexamples\u002Fonline_serving\u002Fvoxcpm2) for full deployment recipes.\n\n> **Full parameter reference, multi-scenario examples, and voice cloning tips →** [Quick Start Guide](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fquickstart.html) | [Usage Guide](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fusage_guide.html) | [Cookbook](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fcookbook.html)\n\n---\n\n## 📦 Models & Versions\n\n| | **VoxCPM2** | **VoxCPM1.5** | **VoxCPM-0.5B** |\n|---|:---:|:---:|:---:|\n| **Status** | 🟢 Latest | Stable | Legacy |\n| **Backbone Parameters** | 2B | 0.6B | 0.5B |\n| **Audio Sample Rate** | 48kHz | 44.1kHz | 16kHz |\n| **LM Token Rate** | 6.25Hz | 6.25Hz | 12.5Hz |\n| **Languages** | 30 | 2 (zh, en) | 2 (zh, en) |\n| **Cloning Mode** | Isolated Reference & Continuation | Continuation only | Continuation only |\n| **Voice Design** | ✅ | — | — |\n| **Controllable Voice Cloning** | ✅ | — | — |\n| **SFT \u002F LoRA** | ✅ | ✅ | ✅ |\n| **RTF (RTX 4090)** | ~0.30 | ~0.15 | ~0.17 |\n| **RTF in Nano-VLLM (RTX 4090)** | ~0.13 | ~0.08 | ~0.10 |\n| **VRAM** | ~8 GB | ~6 GB | ~5 GB |\n| **Weights** | [🤗 HF](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM2) \u002F [MS](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM2) | [🤗 HF](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM1.5) \u002F [MS](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM1.5) | [🤗 HF](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FVoxCPM-0.5B) \u002F [MS](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FVoxCPM-0.5B) |\n| **Technical Report** | Coming soon | — | [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24650) [ICLR 2026](https:\u002F\u002Fopenreview.net\u002Fforum?id=h5KLpGoqzC) |\n| **Demo Page** | [Audio Samples](https:\u002F\u002Fopenbmb.github.io\u002Fvoxcpm2-demopage) | — | [Audio Samples](https:\u002F\u002Fopenbmb.github.io\u002FVoxCPM-demopage) |\n\nVoxCPM2 is built on a **tokenizer-free, diffusion autoregressive** paradigm. The model operates entirely in the latent space of **AudioVAE V2**, following a four-stage pipeline: **LocEnc → TSLM → RALM → LocDiT**, enabling rich expressiveness and 48kHz native audio output.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"assets\u002Fvoxcpm_model.png\" alt=\"VoxCPM2 Model Architecture\" width=\"90%\">\n\u003C\u002Fdiv>\n\n> For full architectural details, VoxCPM2-specific upgrades, and a model comparison table, see the [Architecture Design](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fmodels\u002Farchitecture.html).\n\n---\n\n## 📊 Performance\n\nVoxCPM2 achieves state-of-the-art or comparable results on public zero-shot and controllable TTS benchmarks. \n\n### Seed-TTS-eval\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Seed-TTS-eval WER(⬇)&SIM(⬆) Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Model | Parameters | Open-Source | test-EN | | test-ZH | | test-Hard | |\n|------|------|------|:------------:|:--:|:------------:|:--:|:-------------:|:--:|\n| | | | WER\u002F%⬇ | SIM\u002F%⬆| CER\u002F%⬇| SIM\u002F%⬆ | CER\u002F%⬇ | SIM\u002F%⬆ |\n| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |\n| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |\n| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |\n| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |\n| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |\n| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |\n| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |\n| MaskGCT | 1B | ✅ | 2.62 | 71.7 | 2.27 | 77.4 | - | - |\n| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |\n| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |\n| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |\n| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |\n| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |\n| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |\n| Qwen3-Omni | 30B-A3B | ✅ | 1.39 | - | 1.07 | - | - | - |\n| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |\n| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |\n| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |\n| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |\n| VoxCPM-0.5B | 0.6B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 | 8.87 | 73.0 |\n| VoxCPM1.5 | 0.8B | ✅ | 2.12 | 71.4 | 1.18 | 77.0 | 7.74 | 73.1 |\n| MOSS-TTS |  | ✅ | 1.85 | 73.4 | 1.20 | 78.8 | - | - |\n| Qwen3-TTS | 1.7B | ✅ | 1.23 | 71.7 | 1.22 | 77.0 | 6.76 | 74.8 |\n| FishAudio S2 | 4B | ✅ | 0.99 | - | 0.54 | - | 5.99 | - |\n| LongCat-Audio-DiT | 3.5B | ✅ | 1.50 | 78.6 | 1.09 | 81.8 | 6.04 | 79.7 |\n| **VoxCPM2** | 2B | ✅ | 1.84 | 75.3 | 0.97| 79.5| 8.13 | 75.3 |  \n\u003C\u002Fdetails>\n\n\n### CV3-eval \n\u003Cdetails>\n\u003Csummary>\u003Cb>CV3-eval Multilingual WER\u002FCER(⬇) Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Model | zh | en | hard-zh | hard-en | ja | ko | de | es | fr | it | ru |\n|-------|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n| CosyVoice2 | 4.08 | 6.32 | 12.58| 11.96| 9.13 | 19.7 |- | - | - | - | - |\n| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 10.55 | 7.57 | 5.69 | 6.43 | 4.47 | 11.8 | 10.5 | 6.64 |\n| Fish Audio S2 | 2.65 | 2.43 | 9.10 | 4.40 | 3.96 | 2.76 | 2.22 | 2.00 | 6.26 | 2.04 | 2.78 |\n| **VoxCPM2** | 3.65 | 5.00 | 8.55 | 8.48 | 5.96 | 5.69 | 4.77 | 3.80 | 9.85 | 4.25 | 5.21 |\n\u003C\u002Fdetails>\n\n### MiniMax-Multilingual-Test\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Minimax-MLS-test WER(⬇) Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | **VoxCPM2** |\n|----------|:-------:|:----------:|:--------------------:|:------------:|:-----------:|\n| Arabic | **1.665** | 1.666 | – | 3.500 | 13.046 |\n| Cantonese | 34.111 | 51.513 | – | **30.670** | 38.584 |\n| Chinese | 2.252 | 16.026 | 0.928 | **0.730** | 1.136 |\n| Czech | 3.875 | **2.108** | – | 2.840 | 24.132 |\n| Dutch | 1.143 | **0.803** | – | 0.990 | 0.913 |\n| English | 2.164 | 2.339 | **0.934** | 1.620 | 2.289 |\n| Finnish | 4.666 | 2.964 | – | 3.330 | **2.632** |\n| French | 4.099 | 5.216 | **2.858** | 3.050 | 4.534 |\n| German | 1.906 | 0.572 | 1.235 | **0.550** | 0.679 |\n| Greek | 2.016 | **0.991** | – | 5.740 | 2.844 |\n| Hindi | 6.962 | **5.827** | – | 14.640 | 19.699 |\n| Indonesian | 1.237 | **1.059** | – | 1.460 | 1.084 |\n| Italian | 1.543 | 1.743 | **0.948** | 1.270 | 1.563 |\n| Japanese | 3.519 | 10.646 | 3.823 | **2.760** | 4.628 |\n| Korean | 1.747 | 1.865 | 1.755 | **1.180** | 1.962 |\n| Polish | 1.415 | **0.766** | – | 1.260 | 1.141 |\n| Portuguese | 1.877 | 1.331 | 1.526 | **1.140** | 1.938 |\n| Romanian | 2.878 | **1.347** | – | 10.740 | 21.577 |\n| Russian | 4.281 | 3.878 | 3.212 | **2.400** | 3.634 |\n| Spanish | 1.029 | 1.084 | 1.126 | **0.910** | 1.438 |\n| Thai | 2.701 | 73.936 | – | 4.230 | 2.961 |\n| Turkish | 1.52 | 0.699 | – | 0.870 | 0.817 |\n| Ukrainian | 1.082 | **0.997** | – | 2.300 | 6.316 |\n| Vietnamese | **0.88** | 73.415 | – | 7.410 | 3.307 |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Minimax-MLS-test SIM(⬆) Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | **VoxCPM2** |\n|----------|:-------:|:----------:|:--------------------:|:------------:|:-----------:|\n| Arabic | 73.6 | 70.6 | – | 75.0 | **79.1** |\n| Cantonese | 77.8 | 67.0 | – | 80.5 | **83.5** |\n| Chinese | 78.0 | 67.7 | 79.9 | 81.6 | **82.5** |\n| Czech | 79.6 | 68.5 | – | **79.8** | 78.3 |\n| Dutch | 73.8 | 68.0 | – | 73.0 | **80.8** |\n| English | 75.6 | 61.3 | 77.5 | 79.7 | **85.4** |\n| Finnish | 83.5 | 75.9 | – | 81.9 | **89.0** |\n| French | 62.8 | 53.5 | 62.8 | 69.8 | **73.5** |\n| German | 73.3 | 61.4 | 77.5 | 76.7 | **80.3** |\n| Greek | 82.6 | 73.3 | – | 79.5 | **86.0** |\n| Hindi | 81.8 | 73.0 | – | 82.1 | **85.6** |\n| Indonesian | 72.9 | 66.0 | – | 76.3 | **80.0** |\n| Italian | 69.9 | 57.9 | 81.7 | 74.7 | **78.0** |\n| Japanese | 77.6 | 73.8 | 78.8 | 79.6 | **82.8** |\n| Korean | 77.6 | 70.0 | 79.9 | 81.7 | **83.3** |\n| Polish | 80.2 | 72.9 | – | 81.9 | **88.4** |\n| Portuguese | 80.5 | 71.1 | 81.7 | 78.1 | **83.7** |\n| Romanian | **80.9** | 69.9 | – | 73.3 | 79.7 |\n| Russian | 76.1 | 67.6 | 79.2 | 79.0 | **81.1** |\n| Spanish | 76.2 | 61.5 | 81.4 | 77.6 | **83.1** |\n| Thai | 80.0 | 58.8 | – | 78.6 | **84.0** |\n| Turkish | 77.9 | 59.6 | – | 83.5 | **87.1** |\n| Ukrainian | 73.0 | 64.7 | – | 74.7 | **79.8** |\n| Vietnamese | 74.3 | 36.9 | – | 74.0 | **80.6** |\n\n\u003C\u002Fdetails>\n\n\n### Internal 30-Language ASR Benchmark\n\nWe additionally run an internal multilingual intelligibility benchmark with **30 languages × 500 samples**. ASR transcription is evaluated via **Gemini 3.1 Flash Lite API**.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Internal 30-Language ASR Benchmark (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | Metric | VoxCPM2 | Fish S2-Pro |\n|---|---:|---:|---:|\n| ar (Arabic) | CER | 1.23% | 0.30% |\n| da (Danish) | WER | 2.70% | 3.52% |\n| de (German) | WER | 0.96% | 0.64% |\n| el (Greek) | WER | 3.17% | 4.61% |\n| en (English) | WER | 0.42% | 1.03% |\n| es (Spanish) | WER | 1.33% | 0.64% |\n| fi (Finnish) | WER | 2.24% | 2.80% |\n| fr (French) | WER | 2.16% | 2.34% |\n| he (Hebrew) | CER | 2.98% | 15.27% |\n| hi (Hindi) | CER | 0.79% | 0.91% |\n| id (Indonesian) | WER | 1.36% | 1.68% |\n| it (Italian) | WER | 1.65% | 1.08% |\n| ja (Japanese) | CER | 2.40% | 1.82% |\n| km (Khmer) | CER | 2.05% | 75.15% |\n| ko (Korean) | CER | 0.95% | 0.29% |\n| lo (Lao) | CER | 1.90% | 87.40% |\n| ms (Malay) | WER | 1.75% | 1.41% |\n| my (Burmese) | CER | 1.42% | 85.27% |\n| nl (Dutch) | WER | 1.25% | 1.68% |\n| no (Norwegian) | WER | 2.49% | 3.76% |\n| pl (Polish) | WER | 1.90% | 1.65% |\n| pt (Portuguese) | WER | 1.48% | 1.49% |\n| ru (Russian) | WER | 0.90% | 0.86% |\n| sv (Swedish) | WER | 2.22% | 2.63% |\n| sw (Swahili) | CER | 1.07% | 2.02% |\n| th (Thai) | CER | 0.94% | 1.92% |\n| tl (Tagalog) | WER | 2.63% | 4.00% |\n| tr (Turkish) | WER | 1.65% | 1.65% |\n| vi (Vietnamese) | WER | 1.56% | 5.56% |\n| zh (Chinese) | CER | 0.92% | 1.02% |\n| Average (30 languages) |  | **1.68%** | - |\n\n\u003C\u002Fdetails>\n\n### InstructTTSEval\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Instruction-Guided Voice Design Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Model | InstructTTSEval-ZH | | | InstructTTSEval-EN | | |\n|-------|:---:|:----:|:----:|:----:|:----:|:----:|\n| | APS⬆| DSD⬆ | RP⬆| APS⬆ | DSD⬆ | RP⬆ |\n| Hume | – | – | – | 83.0 | 75.3 | 54.3 |\n| VoxInstruct | 47.5 | 52.3 | 42.6 | 54.9 | 57.0 | 39.3 |\n| Parler-tts-mini | – | – | – | 63.4 | 48.7 | 28.6 |\n| Parler-tts-large | – | – | – | 60.0 | 45.9 | 31.2 |\n| PromptTTS | – | – | – | 64.3 | 47.2 | 31.4 |\n| PromptStyle | – | – | – | 57.4 | 46.4 | 30.9 |\n| VoiceSculptor | 75.7 | 64.7 | 61.5 | – | – | – |\n| Mimo-Audio-7B-Instruct | 75.7 | 74.3 | 61.5 | 80.6 | 77.6 | 59.5 |\n| Qwen3TTS-12Hz-1.7B-VD | **85.2** | **81.1** | **65.1** | 82.9 | 82.4 | 68.4 |\n| **VoxCPM2** | **85.2** | 71.5 | 60.8 | **84.2** | **83.2** | **71.4** |\n\n\u003C\u002Fdetails>\n\n\n\n\n\n\n\n---\n\n## ⚙️ Fine-tuning\n\nVoxCPM supports both **full fine-tuning (SFT)** and **LoRA fine-tuning**. With as little as **5–10 minutes** of audio, you can adapt to a specific speaker, language, or domain.\n\n```bash\n# LoRA fine-tuning (parameter-efficient, recommended)\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v2\u002Fvoxcpm_finetune_lora.yaml\n\n# Full fine-tuning\npython scripts\u002Ftrain_voxcpm_finetune.py \\\n    --config_path conf\u002Fvoxcpm_v2\u002Fvoxcpm_finetune_all.yaml\n\n# WebUI for training & inference\npython lora_ft_webui.py   # then open http:\u002F\u002Flocalhost:7860\n```\n\n> **Full guide →** [Fine-tuning Guide](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Ffinetuning\u002Ffinetune.html) (data preparation, configuration, training, LoRA hot-swapping, FAQ)\n\n---\n\n## 📚 Documentation\n\nFull documentation: **[voxcpm.readthedocs.io](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002F)**\n\n| Topic | Link |\n|---|---|\n| Quick Start & Installation | [Quick Start](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fquickstart.html) |\n| Usage Guide & Cookbook | [User Guide](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fusage_guide.html) |\n| VoxCPM Series | [Models](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Fmodels\u002Fversion_history.html) |\n| Fine-tuning (SFT & LoRA) | [Fine-tuning Guide](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Ffinetuning\u002Ffinetune.html) |\n| FAQ & Troubleshooting | [FAQ](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002Ffaq.html) |\n\n---\n\n## 🌟 Ecosystem & Community\n\n| Project | Description |\n|---|---|\n| [**Nano-vLLM**](https:\u002F\u002Fgithub.com\u002Fa710128\u002Fnanovllm-voxcpm) | High-throughput and Fast GPU serving |\n| [**vLLM-Omni**](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni) | Official vLLM omni-modal serving for VoxCPM2 — PagedAttention, OpenAI-compatible API |\n| [**VoxCPM.cpp**](https:\u002F\u002Fgithub.com\u002Fbluryar\u002FVoxCPM.cpp) | GGML\u002FGGUF: CPU, CUDA, Vulkan inference |\n| [**VoxCPM-ONNX**](https:\u002F\u002Fgithub.com\u002Fbluryar\u002FVoxCPM-ONNX) | ONNX export for CPU inference |\n| [**VoxCPMANE**](https:\u002F\u002Fgithub.com\u002F0seba\u002FVoxCPMANE) | Apple Neural Engine backend |\n| [**voxcpm_rs**](https:\u002F\u002Fgithub.com\u002Fmadushan1000\u002Fvoxcpm_rs) | Rust re-implementation |\n| [**ComfyUI-VoxCPM**](https:\u002F\u002Fgithub.com\u002Fwildminder\u002FComfyUI-VoxCPM) | ComfyUI node-based workflows |\n| [**ComfyUI_RH_VoxCPM**](https:\u002F\u002Fgithub.com\u002FHM-RunningHub\u002FComfyUI_RH_VoxCPM) | Feature-complete ComfyUI workflow for VoxCPM 2 with multi-speaker generation, LoRA, and auto-ASR |\n| [**ComfyUI-VoxCPMTTS**](https:\u002F\u002Fgithub.com\u002F1038lab\u002FComfyUI-VoxCPMTTS) | ComfyUI TTS extension |\n| [**TTS WebUI**](https:\u002F\u002Fgithub.com\u002Frsxdalv\u002Ftts_webui_extension.vox_cpm) | Browser-based TTS extension |\n\n> See the full [Ecosystem](https:\u002F\u002Fvoxcpm.readthedocs.io\u002Fen\u002Flatest\u002F) in the docs. Community projects are not officially maintained by OpenBMB. Built something cool? [Open an issue or PR](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FVoxCPM\u002Fissues) to add it!\n\n---\n\n## ⚠️ Risks and Limitations\n\n- **Potential for Misuse:** VoxCPM's voice cloning can generate highly realistic synthetic speech. It is **strictly forbidden** to use VoxCPM for impersonation, fraud, or disinformation. We strongly recommend clearly marking any AI-generated content.\n- **Controllable Generation Stability:** Voice Design and Controllable Voice Cloning results can vary between runs — you may try to generate 1~3 times to obtain the desired voice or style. We are actively working on improving controllability consistency.\n- **Language Coverage:** VoxCPM2 officially supports 30 languages. For languages not on the list, you are welcome to test directly or try fine-tuning on your own data. We plan to expand language coverage in future releases.\n- **Usage:** This model is released under the Apache-2.0 license. For production deployments, we recommend conducting thorough testing and safety evaluation tailored to your use case.\n\n---\n\n## 📖 Citation\n\nIf you find VoxCPM helpful, please consider citing our work and starring ⭐ the repository!\n\n```bib\n@article{voxcpm2_2026,\n  title   = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning},\n  author  = {VoxCPM Team},\n  journal = {GitHub},\n  year    = {2026},\n}\n\n@article{voxcpm2025,\n  title   = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation\n             and True-to-Life Voice Cloning},\n  author  = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and\n             Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and\n             Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan},\n  journal = {arXiv preprint arXiv:2509.24650},\n  year    = {2025},\n}\n```\n\n## 📄 License\n\nVoxCPM model weights and code are open-sourced under the [Apache-2.0](LICENSE) license.\n\n## 🙏 Acknowledgments\n\n- [DiTAR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03930) for the diffusion autoregressive backbone\n- [MiniCPM-4](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM) for the language model foundation\n- [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) for the Flow Matching-based LocDiT implementation\n- [DAC](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) for the Audio VAE backbone\n- Our community users for trying VoxCPM, reporting issues, sharing ideas, and contributing—your support helps the project keep getting better\n\n## Institutions\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fmodelbest.cn\u002F\">\u003Cimg src=\"assets\u002Fmodelbest_logo.png\" width=\"28px\"> ModelBest\u003C\u002Fa>\n  &nbsp;&nbsp;&nbsp;\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fthuhcsi\">\u003Cimg src=\"assets\u002Fthuhcsi_logo.png\" width=\"28px\"> THUHCSI\u003C\u002Fa>\n\u003C\u002Fp>\n\n## ⭐ Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenBMB\u002FVoxCPM&type=Date)](https:\u002F\u002Fstar-history.com\u002F#OpenBMB\u002FVoxCPM&Date)\n","VoxCPM 是一个无需分词器的文本转语音系统，通过端到端的扩散自回归架构直接生成连续语音表示，从而实现高度自然和富有表现力的语音合成。其最新版本 VoxCPM2 拥有 20 亿参数，基于超过 200 万小时的多语言语音数据训练而成，支持 30 种语言、语音设计、可控的声音克隆以及 48kHz 的高保真音频输出。该系统特别适合需要高质量多语言语音合成的应用场景，如跨语言内容创作、虚拟助手、游戏开发等。此外，VoxCPM2 还允许用户仅凭自然语言描述创建全新的声音，并能够从短参考音频中克隆任何声音，同时提供情感等风格指导选项。",2,"2026-06-11 02:50:38","top_language"]