[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83416":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":8,"pushedAt":8,"updatedAt":23,"readmeContent":24,"aiSummary":8,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},83416,"dots.tts","rednote-hilab\u002Fdots.tts","rednote-hilab",null,"Python",468,33,6,8,0,39,337,250,91.59,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:41","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.png\" alt=\"dots.tts\" width=\"280\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Frednote-hilab\u002Fdots.tts\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-rednote--hilab%2Fdots.tts-blue?logo=github\" alt=\"GitHub\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Frednote-hilab\u002Fdotstts\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-dots.tts%20collection-yellow\" alt=\"Hugging Face\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Frednote-hilab\u002Fdots.tts\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPlayground-Live-orange\" alt=\"Playground\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Frednote-hilab.github.io\u002Fdots.tts-demo\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo%20Page-Live-red\" alt=\"Demo Page\">\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green\" alt=\"License\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n**dots.tts** is a **2B-parameter fully continuous, end-to-end autoregressive (AR) text-to-speech system**. The backbone pairs a semantic encoder, an LLM, and an autoregressive flow-matching acoustic head over a **48 kHz** AudioVAE, with no discrete tokens anywhere in the pipeline.\n\ndots.tts achieves the best average performance on **Seed-TTS-Eval**, with WERs of **0.94% \u002F 1.30% \u002F 6.60%** and SIM scores of **81.0 \u002F 77.1 \u002F 79.5** on the zh \u002F en \u002F zh-hard test sets, respectively. It further attains the **highest average speaker similarity (83.9)** on the 24-language **MiniMax multilingual** benchmark. Across other benchmarks, dots.tts also consistently demonstrates **open-source state-of-the-art performance**, exhibiting strong generation stability, voice cloning ability, and emotional expressiveness.\n\n### News\n\n* **[2026.06]** 🔥 We have released **dots.tts** — 2B fully continuous AR TTS, with pretrained \u002F self-corrective-aligned \u002F MeanFlow-distilled checkpoints and full inference & fine-tuning code under Apache-2.0.\n\n---\n\n## Contents\n\n- [Quick Start](#-quick-start)\n  - [Installation](#installation)\n  - [CLI](#cli)\n  - [Python API](#python-api)\n  - [Web Demo (Gradio)](#web-demo-gradio)\n  - [Fine-tuning](#fine-tuning)\n- [Architecture](#-architecture)\n- [Performance](#-performance)\n  - [Seed-TTS-Eval](#seed-tts-eval)\n  - [MiniMax Multilingual](#minimax-multilingual-24-languages)\n  - [CV3-Eval](#cv3-eval)\n  - [EmergentTTS-Eval](#emergenttts-eval)\n- [Risks and Limitations](#%EF%B8%8F-risks-and-limitations)\n- [Citation](#-citation)\n- [License](#-license)\n\n---\n\n## 🚀 Quick Start\n\n### Installation\n\nWe recommend creating a fresh conda environment first (Python 3.10–3.12):\n\n```bash\nconda create -n dots_tts python=3.10 -y\nconda activate dots_tts\n```\n\nThen install from source:\n\n```bash\npython -m pip install --upgrade pip\npython -m pip install -e . -c constraints\u002Frecommended.txt\n```\n\nFor training \u002F linting extras:\n\n```bash\npython -m pip install -e .[full] -c constraints\u002Frecommended.txt\n```\n\nThe constraints file pins the recommended versions. To use other compatible\nversions, omit `-c constraints\u002Frecommended.txt`; the compatibility ranges are\ndeclared in `pyproject.toml`.\n\n### CLI\n\nThe package installs a `dots.tts` entry point:\n\n```bash\n# Continuation voice cloning (reference audio + transcript) — recommended\ndots.tts \\\n  --model-name-or-path \u002Fpath\u002Fto\u002Fdots_tts_model \\\n  --text \"Hello, this is a zero-shot voice cloning demonstration.\" \\\n  --prompt-audio \u002Fpath\u002Fto\u002Freference.wav \\\n  --prompt-text \"The exact transcript of the reference audio.\" \\\n  --output clone.wav\n\n# X-vector-only voice cloning (reference audio only — timbre from speaker x-vector)\ndots.tts \\\n  --model-name-or-path \u002Fpath\u002Fto\u002Fdots_tts_model \\\n  --text \"Hello, this is a zero-shot voice cloning demonstration.\" \\\n  --prompt-audio \u002Fpath\u002Fto\u002Freference.wav \\\n  --output clone.wav\n\n# Random-voice sampling (no reference) — only meaningful with a fine-tuned\n# single-speaker checkpoint; output is deterministic for a fixed --seed\ndots.tts \\\n  --model-name-or-path \u002Fpath\u002Fto\u002Fdots_tts_model \\\n  --text \"Hello, this is a quick speech synthesis test.\" \\\n  --output output.wav\n```\n\nCommon flags:\n\n| Flag | Description | Default |\n|------|-------------|---------|\n| `--num-steps` | Flow-matching sampling steps | `10` |\n| `--guidance-scale` | CFG scale (flow-matching only; MeanFlow has CFG fused into the student) | `1.0` |\n| `--normalize-text` | Apply text normalization before inference | off |\n| `--language` | Add an explicit language tag to the input text; accepts `none`, `auto_detect`, language codes such as `EN` \u002F `ZH`, or names such as `english` \u002F `chinese` | `none` |\n| `--seed` | RNG seed | `42` |\n\n`dots.tts --help` lists the full set.\n\nNotes:\n\n- `--prompt-audio` selects the speaker voice — continuation cloning when paired with `--prompt-text`, x-vector-only cloning when used alone. Omitting `--prompt-audio` falls back to random-voice sampling, which is only meaningful on a fine-tuned single-speaker checkpoint.\n- `--language` is useful for multilingual or code-switched text when you want to force the model-side language tag. For example, pass `--language EN` for English, `--language ZH` for Mandarin, `--language Cantonese` for Cantonese, or `--language auto_detect` to infer the tag from `--text`.\n- Pass either a local model directory or a Hugging Face repo id.\n\n### Python API\n\n```python\nfrom dots_tts.runtime import DotsTtsRuntime\nimport soundfile as sf\n\nruntime = DotsTtsRuntime.from_pretrained(\n    \"\u002Fpath\u002Fto\u002Fdots_tts_model\",\n    precision=\"bfloat16\",\n)\n\nresult = runtime.generate(\n    text=\"Hello, this is a quick speech synthesis test.\",\n    prompt_audio_path=\"\u002Fpath\u002Fto\u002Freference.wav\",\n    prompt_text=\"The exact transcript of the reference audio.\",\n    num_steps=10,\n    guidance_scale=1.0,\n)\n\nsf.write(\"output.wav\", result[\"audio\"].float().cpu().squeeze().numpy(), result[\"sample_rate\"])\n```\n\n### Web Demo (Gradio)\n\n```bash\npython apps\u002Fgradio\u002Fapp.py \\\n  --model-name-or-path \u002Fpath\u002Fto\u002Fdots_tts_model \\\n  --optimize\n```\n\nDefaults to `http:\u002F\u002F0.0.0.0:7860`. With `--optimize` the first launch runs warmup (slower startup, faster steady-state).\n\nCommon flags:\n\n- `--host` \u002F `--port` \u002F `--execution-mode` \u002F `--optimize`\n- `--model-name-or-path` \u002F `--output-dir` \u002F `--log-file`\n\nThe model, execution mode, precision, optimize flag, and max generation length are fixed at startup — changing any of them requires restarting the server.\n\n### Fine-tuning\n\nThis repo currently exposes a fine-tuning entry point only. We are open to releasing the full training pipeline (pretraining, self-corrective alignment, and MeanFlow distillation) in the future. Fine-tune from a released checkpoint with:\n\n```bash\naccelerate launch scripts\u002Ftrain_dots_tts.py --config configs\u002Fdots_tts.yaml\n```\n\n`configs\u002Fdots_tts.yaml` is a smoke configuration that verifies the pipeline runs end-to-end on commodity hardware. Replace `train.pretrained_model_path`, `train_data.sources` \u002F `val_data.sources`, `train.output_dir`, and `train.max_train_steps` with your own values to use it.\n\nA helper script downloads LJSpeech-1.1-48kHz and emits a train\u002Fvalid JSONL manifest for the smoke run:\n\n```bash\npython scripts\u002Fprepare_train_jsonl_manifest.py --output-dir downloaded_data\n```\n\nManifest format — one JSON per line, minimum three fields:\n\n```json\n{\"fid\": \"sample-0001\", \"audio\": \"\u002Fabs\u002Fpath\u002Fto\u002Faudio.wav\", \"text\": \"hello world\"}\n```\n\n---\n\n## 🏛 Architecture\n\nA frozen **AudioVAE** encodes 48 kHz mono waveform into a continuous latent and decodes it back via a BigVGAN-style causal decoder. An **autoregressive backbone** predicts that latent one patch at a time, in three components:\n\n- **Semantic encoder** — re-encodes each newly generated VAE patch into a compact embedding for the LLM, stripping high-variance acoustic detail.\n- **LLM** — initialized from **Qwen2.5-1.5B-Base**, consumes BPE text directly (no phonemes), and emits one hidden state per audio step.\n- **AR flow-matching head** — a DiT that conditions on the LLM hidden state and the AR prefix to denoise the next VAE patch, with a frozen CAM++ speaker x-vector as side input.\n\nTwo sequence layouts: *plain mode* places the full text as a prefix before the audio span (standard TTS); *[1T1A interleaved mode](scripts\u002Fexample_double_streaming.py)* alternates one BPE token with one audio step, enabling low-latency streaming when driven by a duplex dialogue LLM. See the technical report for full architectural and training details.\n\n---\n\n## 📊 Performance\n\nBaselines are taken from original publications or default-configuration open-source releases.\n\n### Seed-TTS-Eval\n\nZero-shot, ~3 s reference prompt, scored by the benchmark's reference ASR and WavLM-SV similarity.\n\n| Model | Params | test-en WER↓ \u002F SIM↑ | test-zh WER↓ \u002F SIM↑ | test-zh-hard WER↓ \u002F SIM↑ | **Avg WER↓ \u002F SIM↑** |\n|---|---:|:---:|:---:|:---:|:---:|\n| CosyVoice 3 | 1.5B | 2.22 \u002F 72.0 | 1.12 \u002F 78.1 | **5.83** \u002F 75.8 | 3.06 \u002F 75.3 |\n| DiTAR | 0.6B | 1.69 \u002F 73.5 | 1.02 \u002F 75.3 | — | — |\n| F5-TTS | 0.3B | 2.00 \u002F 67.0 | 1.53 \u002F 76.0 | 8.67 \u002F 71.3 | 4.10 \u002F 71.4 |\n| FireRedTTS-2 | 1.5B | 1.95 \u002F 66.5 | 1.14 \u002F 73.6 | 8.98 \u002F 70.3 | 4.02 \u002F 70.1 |\n| IndexTTS 2 | 1.5B | 2.23 \u002F 70.6 | 1.03 \u002F 76.5 | 7.12 \u002F 75.5 | 3.46 \u002F 74.2 |\n| MegaTTS 3 | 0.5B | 2.79 \u002F 77.1 | 1.52 \u002F 79.0 | — | — |\n| MiniMax-Speech | — | 1.65 \u002F 69.2 | **0.83** \u002F 78.3 | — | — |\n| Qwen3-TTS | 1.7B | **1.23** \u002F 71.7 | 1.22 \u002F 77.0 | 6.76 \u002F 74.8 | 3.07 \u002F 74.5 |\n| Seed-TTS | — | 2.25 \u002F 76.2 | 1.12 \u002F 79.6 | 7.59 \u002F 77.6 | 3.65 \u002F 77.8 |\n| VibeVoice | 1.5B | 3.04 \u002F 68.9 | 1.16 \u002F 74.4 | — | — |\n| VoxCPM 2 | 2B | 1.84 \u002F 75.3 | 0.97 \u002F 79.5 | 8.13 \u002F 75.3 | 3.65 \u002F 76.7 |\n| **dots.tts (Pretrain)** | **2B** | 1.34 \u002F 76.8 | 0.96 \u002F 80.5 | 6.46 \u002F 79.2 | **2.92** \u002F 78.8 |\n| **dots.tts (SCA)** | **2B** | 1.30 \u002F **77.1** | 0.94 \u002F **81.0** | 6.60 \u002F **79.5** | 2.95 \u002F **79.2** |\n| **dots.tts (MF, NFE=4)** | **2B** | 1.29 \u002F 76.2 | 0.94 \u002F 80.0 | 6.60 \u002F 78.5 | 2.94 \u002F 78.2 |\n\n### MiniMax Multilingual (24 languages)\n\nPer-language WER \u002F SIM on the MiniMax-Speech multilingual test set (100 utterances × 2 reference speakers per language). **Highest average SIM (83.9, SCA)**, with a dots.tts variant taking the per-language SIM lead outright on 19 of 24 languages and tying on 2 more. Content fidelity is on par with the strongest systems on high-resource \u002F Western European splits, and trails on low-resource long-tail languages where SIM is still preserved.\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Per-language WER \u002F SIM (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | MiniMax | ElevenLabs | Fish-Audio S2 | VoxCPM 2 | **dots.tts (Pre.)** | **dots.tts (SCA)** | **dots.tts (MF$_4$)** |\n|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| Arabic | **1.67** \u002F 73.6 | **1.67** \u002F 70.6 | 3.50 \u002F 75.0 | 13.05 \u002F **79.1** | 37.91 \u002F 77.5 | 36.19 \u002F **79.1** | 39.65 \u002F 77.6 |\n| Cantonese* | 34.11 \u002F 77.8 | 51.51 \u002F 67.0 | 30.67 \u002F 80.5 | 38.58 \u002F 83.5 | 37.91 \u002F 84.7 | 42.32 \u002F **85.0** | 37.82 \u002F 84.0 |\n| Chinese | 2.25 \u002F 78.0 | 16.03 \u002F 67.7 | **0.73** \u002F 81.6 | 1.14 \u002F **82.5** | 1.08 \u002F 82.3 | 0.77 \u002F **82.5** | 1.01 \u002F 81.8 |\n| Czech | 3.88 \u002F 79.6 | **2.11** \u002F 68.5 | 2.84 \u002F 79.8 | 24.13 \u002F 78.3 | 5.05 \u002F 83.8 | 4.25 \u002F **84.2** | 5.67 \u002F 83.9 |\n| Dutch | 1.14 \u002F 73.8 | **0.80** \u002F 68.0 | 0.99 \u002F 73.0 | 0.91 \u002F 80.8 | 1.20 \u002F 81.4 | 1.39 \u002F **82.2** | 1.30 \u002F 82.1 |\n| English | 2.16 \u002F 75.6 | 2.34 \u002F 61.3 | 1.62 \u002F 79.7 | 2.29 \u002F 85.4 | 1.06 \u002F 86.9 | **1.03** \u002F **87.5** | 1.09 \u002F 86.9 |\n| Finnish | 4.67 \u002F 83.5 | 2.96 \u002F 75.9 | 3.33 \u002F 81.9 | **2.63** \u002F **89.0** | 3.44 \u002F 88.0 | 4.08 \u002F 88.3 | 3.61 \u002F 88.3 |\n| French | 4.10 \u002F 62.8 | 5.22 \u002F 53.5 | **3.05** \u002F 69.8 | 4.53 \u002F 73.5 | 3.82 \u002F 78.2 | 3.56 \u002F **78.6** | 3.26 \u002F 78.5 |\n| German | 1.91 \u002F 73.3 | 0.57 \u002F 61.4 | **0.55** \u002F 76.7 | 0.68 \u002F 80.3 | 1.03 \u002F 79.5 | 1.70 \u002F **80.6** | 0.91 \u002F 79.5 |\n| Greek | 2.02 \u002F 82.6 | **0.99** \u002F 73.3 | 5.74 \u002F 79.5 | 2.84 \u002F 86.0 | 2.97 \u002F **87.6** | 3.00 \u002F **87.6** | 3.19 \u002F 87.3 |\n| Hindi | 6.96 \u002F 81.8 | **5.83** \u002F 73.0 | 14.64 \u002F 82.1 | 19.70 \u002F **85.6** | 14.32 \u002F 84.5 | 14.24 \u002F 84.7 | 14.75 \u002F 84.8 |\n| Indonesian | 1.24 \u002F 72.9 | **1.06** \u002F 66.0 | 1.46 \u002F 76.3 | 1.08 \u002F 80.0 | 2.71 \u002F 80.8 | 2.96 \u002F 80.8 | 3.91 \u002F **81.2** |\n| Italian | 1.54 \u002F 69.9 | 1.74 \u002F 57.9 | **1.27** \u002F 74.7 | 1.56 \u002F 78.0 | 3.16 \u002F 84.5 | 3.12 \u002F **84.7** | 2.16 \u002F 84.3 |\n| Japanese | 3.52 \u002F 77.6 | 10.65 \u002F 73.8 | **2.76** \u002F 79.6 | 4.63 \u002F 82.8 | 7.16 \u002F 83.1 | 5.28 \u002F **83.7** | 5.17 \u002F 83.1 |\n| Korean | 1.75 \u002F 77.6 | 1.87 \u002F 70.0 | **1.18** \u002F 81.7 | 1.96 \u002F 83.3 | 5.30 \u002F 84.3 | 5.66 \u002F 83.6 | 3.93 \u002F **84.9** |\n| Polish | 1.42 \u002F 80.2 | **0.77** \u002F 72.9 | 1.26 \u002F 81.9 | 1.14 \u002F **88.4** | 2.72 \u002F 87.3 | 3.59 \u002F 87.8 | 3.42 \u002F 87.5 |\n| Portuguese | 1.88 \u002F 80.5 | 1.33 \u002F 71.1 | **1.14** \u002F 78.1 | 1.94 \u002F 83.7 | 1.64 \u002F 83.1 | 2.00 \u002F **84.3** | 2.40 \u002F 83.1 |\n| Romanian | 2.88 \u002F 80.9 | **1.35** \u002F 69.9 | 10.74 \u002F 73.3 | 21.58 \u002F 79.7 | 3.36 \u002F 86.2 | 3.87 \u002F **87.1** | 3.38 \u002F 86.1 |\n| Russian | 4.28 \u002F 76.1 | 3.88 \u002F 67.6 | **2.40** \u002F 79.0 | 3.63 \u002F 81.1 | 3.64 \u002F 83.0 | 4.28 \u002F **83.2** | 4.42 \u002F **83.2** |\n| Spanish | 1.03 \u002F 76.2 | 1.08 \u002F 61.5 | 0.91 \u002F 77.6 | 1.44 \u002F 83.1 | 0.96 \u002F 83.9 | 1.27 \u002F **84.0** | **0.80** \u002F **84.0** |\n| Thai | **2.70** \u002F 80.0 | 73.94 \u002F 58.8 | 4.23 \u002F 78.6 | 2.96 \u002F 84.0 | 7.45 \u002F 83.8 | 7.86 \u002F 83.9 | 8.03 \u002F **84.2** |\n| Turkish | 1.52 \u002F 77.9 | **0.70** \u002F 59.6 | 0.87 \u002F 83.5 | 0.82 \u002F 87.1 | 5.45 \u002F **87.4** | 4.96 \u002F 87.3 | 6.20 \u002F 86.8 |\n| Ukrainian | 1.08 \u002F 73.0 | **1.00** \u002F 64.7 | 2.30 \u002F 74.7 | 6.32 \u002F 79.8 | 1.61 \u002F 80.5 | 1.27 \u002F **81.2** | 1.66 \u002F 80.0 |\n| Vietnamese | **0.88** \u002F 74.3 | 73.42 \u002F 36.9 | 7.41 \u002F 74.0 | 3.31 \u002F 80.6 | 3.85 \u002F 80.7 | 3.89 \u002F **81.6** | 5.43 \u002F 80.5 |\n| **Average** | **2.8** \u002F 76.6 | 7.5 \u002F 65.5 | 3.7 \u002F 78.0 | 5.7 \u002F 82.3 | 6.6 \u002F 83.5 | 6.8 \u002F **83.9** | 6.8 \u002F 83.5 |\n\n\u003C\u002Fdetails>\n\n\u003Csub>*Cantonese WER reflects an ASR-faithfulness floor common to all systems; SIM remains comparable.\u003C\u002Fsub>\n\n### CV3-Eval\n\nHard-subset Chinese\u002FEnglish plus a cross-lingual voice-cloning split. **Takes the table top on hard-en (MF$_4$ at 4.37) and leads both cross-lingual SIM subsets (SCA at 75.0 \u002F 72.8)**, with the post-trained variants bracketing the prior leader on the hardest English subset.\n\n| Model | zh W↓ | en W↓ | hard-zh W↓ | hard-en W↓ | en→zh W↓ \u002F S↑ | zh→en W↓ \u002F S↑ |\n|---|:---:|:---:|:---:|:---:|:---:|:---:|\n| CosyVoice 2 | 4.08 | 6.32 | 12.58 | 11.96 | 13.50 \u002F 63.3 | 6.47 \u002F 64.3 |\n| CosyVoice 3 (1.5B) | 3.91 | 4.99 | 9.77 | 10.55 | **8.01** \u002F 66.9 | **4.32** \u002F 66.4 |\n| Fish-Audio S2 | **2.65** | **2.43** | 9.10 | 4.40 | — | — |\n| VoxCPM 2 | 3.65 | 5.00 | **8.55** | 8.48 | — | — |\n| **dots.tts (Pretrain)** | 3.51 | 5.24 | 9.69 | 5.99 | 10.88 \u002F 74.6 | 4.97 \u002F 71.9 |\n| **dots.tts (SCA)** | 3.71 | 4.50 | 9.22 | 4.49 | 10.75 \u002F **75.0** | 5.66 \u002F **72.8** |\n| **dots.tts (MF, NFE=4)** | 3.95 | 4.05 | 9.10 | **4.37** | 10.73 \u002F 73.8 | 5.24 \u002F 70.9 |\n\n### EmergentTTS-Eval\n\nWin-rate judged head-to-head against `gpt-4o-mini-tts` by Gemini-2.5-Pro-0506 across six expressiveness-oriented scenarios. **SCA takes the top Syntactic Complexity score in the table (65.7%) — above every closed-source system** — and Pretrain posts the **best Emotions score among open-source systems (72.7%)**.\n\n| Model | Voice | WER↓ | Overall↑ | Emotions↑ | Paraling.↑ | Foreign↑ | C. Pron.↑ | Quest.↑ | Syntax↑ |\n|---|---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| Gemini-2.5-Flash-TTS\\* | Zephyr | 10.39 | **70.7%** | **95.9%** | **91.3%** | 58.5% | 55.7% | **63.0%** | 57.9% |\n| Gemini-2.5-Pro-TTS\\* | Zephyr | 11.79 | 69.3% | 86.9% | 82.3% | 58.2% | **64.8%** | 61.3% | 61.8% |\n| gpt-4o-audio-preview\\* | Ballad | 11.87 | 65.2% | 88.8% | 82.1% | **60.2%** | 40.4% | 57.0% | 59.5% |\n| gpt-4o-mini-tts\\* | Alloy | 10.76 | 56.3% | 59.2% | 58.8% | 57.3% | 52.4% | 52.7% | 57.1% |\n| *baseline: gpt-4o-mini-tts* | Alloy | 10.61 | 50.0% | — | — | — | — | — | — |\n| **dots.tts (Pretrain)** | basic\\_ref\\_en | 10.86 | 49.2% | 72.7% | 54.7% | 39.5% | 18.0% | 48.4% | 58.4% |\n| **dots.tts (MF4)** | basic\\_ref\\_en | 11.75 | 47.9% | 59.8% | 55.2% | 36.3% | 16.7% | 50.5% | 64.8% |\n| **dots.tts (SCA)** | basic\\_ref\\_en | 10.45 | 47.6% | 63.9% | 52.7% | 39.4% | 16.4% | 47.0% | **65.7%** |\n| Qwen3-TTS | basic\\_ref\\_en | 17.32 | 42.8% | 39.8% | 50.7% | 25.4% | 30.0% | 48.9% | 60.4% |\n| HumeAI\\* | — | 12.85 | 42.7% | 61.6% | 36.9% | 34.6% | 34.3% | 43.2% | 44.6% |\n| Qwen3-TTS | Ryan | 19.65 | 42.3% | 60.5% | 62.7% | 17.1% | 9.8% | 56.4% | 43.0% |\n| VoxCPM 2 | basic\\_ref\\_en | 11.84 | 41.1% | 42.3% | 44.1% | 33.3% | 18.6% | 53.4% | 52.3% |\n| MiniMax\u002Fspeech-02-hd\\* | EN-narr | **10.02** | 36.6% | 40.9% | 34.3% | 34.3% | 16.3% | 47.3% | 43.9% |\n| 11Labs Multilingual v2\\* | Brian | 11.19 | 33.9% | 30.4% | 45.5% | 35.5% | 14.5% | 39.5% | 35.5% |\n| F5-TTS | basic\\_ref\\_en | 16.47 | 15.3% | 26.8% | 21.6% | 1.8% | 1.4% | 14.8% | 23.8% |\n\n\u003Csub>\\* Closed-source \u002F commercial. Table shows a selected subset for brevity — for the full leaderboard, see [EmergentTTS-Eval-public](https:\u002F\u002Fgithub.com\u002Fboson-ai\u002FEmergentTTS-Eval-public\u002Fblob\u002Fmain\u002FLEADERBOARD_gemini-2.5-pro-05-06.md).\u003C\u002Fsub>\n\n---\n\n## ⚠️ Risks and Limitations\n\n- **Misuse risk.** High-fidelity zero-shot voice cloning can produce highly realistic synthetic speech. The released checkpoints are intended for research and authorized deployment. Do **not** use dots.tts for impersonation, fraud, or disinformation. Combine downstream use with consent-aware reference-audio policies, robust synthetic-speech detection, and content watermarking. Clearly mark AI-generated audio.\n- **Low-resource WER gap.** A BPE backbone inherits the text LLM's language coverage at the cost of a higher data appetite. On script-divergent and under-represented languages (Arabic, Hindi, Turkish, Vietnamese) the WER gap visible on the MiniMax benchmark reflects this, and the same long tail surfaces on the Foreign Words and Complex Pronunciation scenarios of EmergentTTS-Eval. Speaker similarity is preserved across these languages.\n- **Speech-heavy training.** Although the AudioVAE is trained at 48 kHz and is modality-agnostic in principle, the backbone is trained on a speech-heavy mixture. Singing and unified speech + sound generation are not covered in this release.\n\n---\n\n## 📖 Citation\n\nIf you find dots.tts useful, please consider citing the technical report and starring the repository.\n\n```bibtex\n@article{dotstts2026,\n  title   = {dots.tts Technical Report},\n  author  = {dots.tts Team},\n  journal = {arXiv preprint},\n  year    = {2026},\n}\n```\n\n## 📄 License\n\ndots.tts code and released checkpoints are licensed under [Apache-2.0](LICENSE).\n\n## 🙏 Acknowledgments\n\n- [Qwen2.5](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5) — LLM backbone initialization.\n- [DiTAR](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03930) and [ARDiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.05551) — for the continuous-AR + per-patch diffusion design.\n- [BigVGAN](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN) — for the vocoder design.\n- [CAM++](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002F3D-Speaker) — for speaker x-vector encoder.\n",2,"2026-06-11 04:11:07","CREATED_QUERY"]