[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74027":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74027,"OmniVoice","k2-fsa\u002FOmniVoice","k2-fsa","High-Quality Voice Cloning TTS for 600+ Languages","",null,"Python",7298,1139,48,43,0,138,517,850,414,115.17,"Apache License 2.0",false,"master",[],"2026-06-12 04:01:12","# OmniVoice 🌍\n\n\u003Cp align=\"center\">\n  \u003Cimg width=\"200\" height=\"200\" alt=\"OmniVoice\" src=\"https:\u002F\u002Fzhu-han.github.io\u002Fomnivoice\u002Fpics\u002Fomnivoice.jpg\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fk2-fsa\u002FOmniVoice\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-FFD21E\" alt=\"Hugging Face Model\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fk2-fsa\u002FOmniVoice\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Space-blue\" alt=\"Hugging Face Space\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.00688\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-B31B1B.svg\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fzhu-han.github.io\u002Fomnivoice\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub.io-Demo_Page-blue?logo=GitHub&style=flat-square\">\u003C\u002Fa>\n  &nbsp;\n  \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fk2-fsa\u002FOmniVoice\u002Fblob\u002Fmaster\u002Fdocs\u002FOmniVoice.ipynb\">\u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nOmniVoice is a state-of-the-art massively multilingual zero-shot text-to-speech (TTS) model supporting over 600 languages. Built on a novel diffusion language model-style architecture, it generates high-quality speech with superior inference speed, supporting voice cloning and voice design.\n\n**Contents**: [Key Features](#key-features) | [Installation](#installation) | [Quick Start](#quick-start) | [Python API](#python-api) | [Command-Line Tools](#command-line-tools) | [Training & Evaluation](#training--evaluation) | [Discussion](#discussion--communication) | [Citation](#citation)\n\n## Key Features\n\n- **600+ Languages Supported**: The broadest language coverage among zero-shot TTS models ([full list](docs\u002Flanguages.md)).\n- **Voice Cloning**: State-of-the-art voice cloning quality.\n- **Voice Design**: Control voices via assigned speaker attributes (gender, age, pitch, dialect\u002Faccent, whisper, etc.).\n- **Fine-grained Control**: Non-verbal symbols (e.g., `[laughter]`) and pronunciation correction via pinyin or phonemes.\n- **Fast Inference**: RTF as low as 0.025 (40x faster than real-time).\n- **Diffusion Language Model-style Architecture**: A clean, streamlined, and scalable design that delivers both quality and speed.\n\n---\n\n## Installation\n\nChoose **one** of the following methods: **pip** or **uv**.\n\n### pip\n\n> We recommend using a fresh virtual environment (e.g., `conda`, `venv`, etc.) to avoid conflicts.\n\n**Step 1**: Install PyTorch\n\n\u003Cdetails>\n\u003Csummary>NVIDIA GPU\u003C\u002Fsummary>\n\n```bash\n# Install pytorch with your CUDA version, e.g.\npip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128\n```\n> See [PyTorch official site](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) for other versions installation.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Apple Silicon\u003C\u002Fsummary>\n\n```bash\npip install torch==2.8.0 torchaudio==2.8.0\n```\n\n\u003C\u002Fdetails>\n\n**Step 2**: Install OmniVoice (choose one)\n\n```bash\n# From PyPI (stable release)\npip install omnivoice\n\n# From the latest source on GitHub (no need to clone)\npip install git+https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002FOmniVoice.git\n\n# For development (clone first, editable install)\ngit clone https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002FOmniVoice.git\ncd OmniVoice\npip install -e .\n```\n\n### uv\n\nClone the repository and sync dependencies:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002FOmniVoice.git\ncd OmniVoice\nuv sync\n```\n\n> **Tip**: Can use mirror with `uv sync --default-index \"https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\"`\n\n---\n\n## Quick Start\n\nTry OmniVoice without coding:\n\n- Launch the local web UI: `omnivoice-demo --ip 0.0.0.0 --port 8001`\n\n- Or try it directly on [HuggingFace Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fk2-fsa\u002FOmniVoice)\n\n- Or run it in Google Colab: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fk2-fsa\u002FOmniVoice\u002Fblob\u002Fmaster\u002Fdocs\u002FOmniVoice.ipynb)\n\n> If you have trouble connecting to HuggingFace when downloading the pre-trained models, set `export HF_ENDPOINT=\"https:\u002F\u002Fhf-mirror.com\"` before running.\n\nFor full usage, see the [Python API](#python-api) and [Command-Line Tools](#command-line-tools) sections below.\n\n---\n\n## Python API\n\nOmniVoice supports three generation modes. All features in this section are also available via [command-line tools](#command-line-tools).\n\n### Voice Cloning\n\nClone a voice from a short reference audio. Provide `ref_audio` and `ref_text`:\n\n```python\nfrom omnivoice import OmniVoice\nimport soundfile as sf\nimport torch\n\nmodel = OmniVoice.from_pretrained(\n    \"k2-fsa\u002FOmniVoice\",\n    device_map=\"cuda:0\",\n    dtype=torch.float16\n)\n# Apple Silicon users: use device_map=\"mps\" instead\n\naudio = model.generate(\n    text=\"Hello, this is a test of zero-shot voice cloning.\",\n    ref_audio=\"ref.wav\",\n    ref_text=\"Transcription of the reference audio.\",\n) # audio is a list of `np.ndarray` with shape (T,) at 24 kHz.\n\n# If you don't want to input `ref_text` manually, you can directly omit the `ref_text`.\n# The model will use Whisper ASR to auto-transcribe it.\n\nsf.write(\"out.wav\", audio[0], 24000)\n```\n\n> **Tips**\n>\n> - Use a 3–10 seconds reference audio clip. Longer audio slows down inference and may degrade cloning quality.\n> - For standard pronunciation, use a reference audio in the **same language** as the target speech. In cross-lingual voice cloning (i.e., the reference audio and target speech are in different languages), the generated speech will carry an accent from the reference audio's language.\n> - For better results with Arabic numerals, normalize them to words first (e.g., \"123\" → \"one hundred twenty-three\") with text normalization tools (e.g., [WeTextProcessing](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002FWeTextProcessing)).\n>\n> For more tips, see [docs\u002Ftips.md](docs\u002Ftips.md).\n\n### Voice Design\n\nDescribe the desired voice with speaker attributes — no reference audio needed.\nSupported attributes: **gender** (male\u002Ffemale), **age** (child to elderly),\n**pitch** (very low to very high), **style** (whisper), **English accent**\n(American, British, etc.), and **Chinese dialect** (四川话, 陕西话, etc.).\nAttributes are comma-separated and freely combinable across categories.\n\n```python\naudio = model.generate(\n    text=\"Hello, this is a test of zero-shot voice design.\",\n    instruct=\"female, low pitch, british accent\",\n)\n```\n\n> **Note**: Voice design was trained on Chinese and English data only. It can generalize to other languages, but results can be unstable for some low-resource languages.\n\nSee [docs\u002Fvoice-design.md](docs\u002Fvoice-design.md) for the full attribute\nreference, Chinese equivalents, and usage tips.\n\n### Auto Voice\n\nLet the model choose a voice automatically:\n\n```python\naudio = model.generate(text=\"This is a sentence without any voice prompt.\")\n```\n\n### Generation Parameters\n\nAll above three modes share the same `model.generate()` API. You can further control the generation behavior via keyword arguments:\n\n```python\naudio = model.generate(\n    text=\"...\",\n    num_step=32,  # diffusion steps (or 16 for faster inference)\n    speed=1.0,     # speed factor (>1.0 faster, \u003C1.0 slower)\n    duration=10.0, # fixed output duration in seconds (overrides speed)\n    # ... more options\n)\n```\nSee more detailed control in [docs\u002Fgeneration-parameters.md](docs\u002Fgeneration-parameters.md).\n\n### Non-Verbal & Pronunciation Control\n\nOmniVoice supports inline **non-verbal symbols** and **pronunciation correction** within the input text.\n\n**Non-verbal symbols**: Insert tags like `[laughter]` directly in the text to add expressive non-verbal sounds.\n\n```python\naudio = model.generate(text=\"[laughter] You really got me. I didn't see that coming at all.\")\n```\n\nSupported tags: `[laughter]`, `[sigh]`, `[confirmation-en]`, `[question-en]`, `[question-ah]`, `[question-oh]`, `[question-ei]`, `[question-yi]`, `[surprise-ah]`, `[surprise-oh]`, `[surprise-wa]`, `[surprise-yo]`, `[dissatisfaction-hnn]`.\n\n**Pronunciation control (Chinese)**: Use pinyin with tone numbers to correct specific character pronunciations.\n\n```python\naudio = model.generate(text=\"这批货物打ZHE2出售后他严重SHE2本了，再也经不起ZHE1腾了。\")\n```\n\n**Pronunciation control (English)**: Use [CMU pronunciation dictionary](https:\u002F\u002Fsvn.code.sf.net\u002Fp\u002Fcmusphinx\u002Fcode\u002Ftrunk\u002Fcmudict\u002Fcmudict.0.7a)  (uppercase, in brackets) to override default English pronunciations.\n\n```python\naudio = model.generate(text=\"He plays the [B EY1 S] guitar while catching a [B AE1 S] fish.\")\n```\n\n---\n\n## Command-Line Tools\n\nThree CLI entry points are provided. The CLI tools support all features available in the Python API (voice cloning, voice design, auto voice, generation parameters, etc.) — all controlled via command-line arguments.\n\n| Command | Description | Source |\n|---|---|---|\n| `omnivoice-demo` | Interactive Gradio web demo | [omnivoice\u002Fcli\u002Fdemo.py](omnivoice\u002Fcli\u002Fdemo.py) |\n| `omnivoice-infer` | Single-item inference | [omnivoice\u002Fcli\u002Finfer.py](omnivoice\u002Fcli\u002Finfer.py) |\n| `omnivoice-infer-batch` | Batch inference across multiple GPUs | [omnivoice\u002Fcli\u002Finfer_batch.py](omnivoice\u002Fcli\u002Finfer_batch.py) |\n\n### Demo\n\n```bash\nomnivoice-demo --ip 0.0.0.0 --port 8001\n```\n\nProvides a web UI for voice cloning and voice design. See `omnivoice-demo --help` for all options.\n\n### Single Inference\n\n```bash\n# Voice Cloning\n# ref_text can be omitted (Whisper will auto-transcribe ref_audio to get it).\nomnivoice-infer \\\n    --model k2-fsa\u002FOmniVoice \\\n    --text \"This is a test for text to speech.\" \\\n    --ref_audio ref.wav \\\n    --ref_text \"Transcription of the reference audio.\" \\\n    --output hello.wav\n\n# Voice Design\nomnivoice-infer --model k2-fsa\u002FOmniVoice \\\n    --text \"This is a test for text to speech.\" \\\n    --instruct \"male, British accent\" \\\n    --output hello.wav\n\n# Auto Voice\nomnivoice-infer \\\n    --model k2-fsa\u002FOmniVoice \\\n    --text \"This is a test for text to speech.\"\\\n    --output hello.wav\n```\n\n### Batch Inference\n\n`omnivoice-infer-batch` can distribute batch inference across multiple GPUs, designed for large-scale TTS tasks.\n\n```bash\nomnivoice-infer-batch \\\n    --model k2-fsa\u002FOmniVoice \\\n    --test_list test.jsonl \\\n    --res_dir results\u002F\n```\n\nThe test list is a JSONL file where each line is a JSON object:\n```json\n{\"id\": \"sample_001\", \"text\": \"Hello world\", \"ref_audio\": \"\u002Fpath\u002Fto\u002Fref.wav\", \"ref_text\": \"Reference transcript\", \"instruct\": \"female, british accent\", \"language_id\": \"en\", \"duration\": 10.0, \"speed\": 1.0}\n```\nOnly `id` and `text` are mandatory fields. `ref_audio` and `ref_text` are used in voice cloning mode. `instruct` is used in voice design mode. If no reference audio or instruct are provided, the model will generate text in a random voice.\n\n`language_id`, `duration`, and `speed` are optional. `duration` (in seconds) fixes the output length; `speed` controls the speaking rate. If `duration` and `speed` are both provided, `speed` will be ignored.\n\n---\n\n## Training & Evaluation\n\nSee [examples\u002F](examples\u002F) for the complete pipeline — from data preparation to training, evaluation, and finetuning.\n\n---\n\n## Discussion & Communication\n\nYou can directly discuss on [GitHub Issues](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002FOmniVoice\u002Fissues).\n\nYou can also scan the QR code to join our wechat group or follow our wechat official account.\n\n| Wechat Group | Wechat Official Account |\n| ------------ | ----------------------- |\n|![wechat](https:\u002F\u002Fk2-fsa.org\u002Fzh-CN\u002Fassets\u002Fpic\u002Fwechat_group.jpg) |![wechat](https:\u002F\u002Fk2-fsa.org\u002Fzh-CN\u002Fassets\u002Fpic\u002Fwechat_account.jpg) |\n\n---\n\n## Community Projects\n\nOmniVoice is supported by a growing ecosystem of community projects.\nExplore them in [Community Projects](docs\u002Fcommunity-projects.md).\n\n---\n\n## Citation\n\n```bibtex\n@article{zhu2026omnivoice,\n      title={OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models},\n      author={Zhu, Han and Ye, Lingxuan and Kang, Wei and Yao, Zengwei and Guo, Liyong and Kuang, Fangjun and Han, Zhifeng and Zhuang, Weiji and Lin, Long and Povey, Daniel},\n      journal={arXiv preprint arXiv:2604.00688},\n      year={2026}\n}\n```\n\n---\n\n## Disclaimer\n\nUsers are strictly prohibited from using this model for unauthorized voice cloning, voice impersonation, fraud, scams, or any other illegal or unethical activities. All users shall ensure full compliance with applicable local laws, regulations, and ethical standards. The developers assume no liability for any misuse of this model and advocate for responsible AI development and use, encouraging the community to uphold safety and ethical principles in AI research and applications.\n","OmniVoice 是一个支持超过600种语言的高质量语音克隆文本转语音（TTS）系统。其核心技术基于新颖的扩散语言模型架构，能够以卓越的推理速度生成高品质语音，并支持零样本多语言处理、语音克隆及自定义声音设计等功能。用户可以通过调整性别、年龄、音调等属性来控制输出的声音特性，同时该系统还提供了非语言符号和发音校正的支持。由于其广泛的多语言支持与高效的运行性能，OmniVoice 非常适合需要跨语言或多语言应用场景下的音频内容创作、辅助技术开发等领域使用。",2,"2026-06-11 03:48:27","high_star"]