[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82274":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":13,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},82274,"tts-bench","5uck1ess\u002Ftts-bench","5uck1ess","Speed and samples benchmark: for all types of text to speech (TTS) models on Windows\u002FLinux\u002FMac.","",null,"Python",74,3,29,1,0,44,13,1.81,"Other",false,"master",true,[],"2026-06-12 02:04:24","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"assets\u002Flogo-flat-dark.svg\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"assets\u002Flogo-flat-light.svg\">\n    \u003Cimg alt=\"tts-bench\" src=\"assets\u002Flogo-flat-dark.svg\" width=\"520\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\nBench for local TTS models. Two lenses, on whatever hardware you put it on:\n\n- **Speed** — cold + warm **TTFA** (time to first audio), **RTF** (real-time factor; higher = faster than realtime), memory, on CPU \u002F CUDA \u002F Apple Silicon\n- **Listen** — every model on every prompt, default voice + voice cloning, with inline audio players, so you can pick a model by ear\n\nAn objective quality score (NAQ) was prototyped and is currently paused for redesign — the v2 features didn't track subjective ranking closely enough to publish. The bench still computes it into the CSV; the HTML report omits it until a refit lands.\n\n---\n\n## ▶ Demos\n\n**[5uck1ess.github.io\u002Ftts-bench](https:\u002F\u002F5uck1ess.github.io\u002Ftts-bench\u002F)** — listen to every model, no install. Two lenses:\n\n- **Listen** — one consolidated gallery with an inline `\u003Caudio>` player for every model on every prompt, in **default voice** and **voice cloning** (each clone sits next to the reference it's imitating). Browse **by prompt** (compare all models on one sentence) or **by model** (audition one model across prompts); only one clip plays at a time. Audio is rig-independent, so each sample is sourced once from the highest-fidelity rig and tagged with where it came from. Quality, prosody, and artifacts are obvious in 5 seconds — benchmark tables can't show that.\n- **Speed** — per-rig leaderboards (Ryzen 9 9950X3D + RTX 5090, Apple M4, Ryzen + RTX 3090) with cold\u002Fwarm TTFA, RTF, and memory, sortable. Pick the box you actually own.\n\nFull per-rig reports (every model × prompt × device, plus by-prompt samples) are linked from the **Archive**.\n\n---\n\n## Quick start\n\nRequires [`uv`](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) and Python 3.11. ~10-15 min install. Disk for the full set is large: **~39 GB of per-model venvs** in the repo, plus **~125 GB of model weights** downloaded to your Hugging Face cache (`~\u002F.cache\u002Fhuggingface`, **not** the repo) — **~165 GB all-in**. Individual models are far smaller, so installing a subset costs a fraction of that.\n\n```powershell\n# Windows\n.\\install.ps1\npython bench.py\n```\n\n```bash\n# macOS \u002F Linux\n.\u002Finstall.sh\npython bench.py\n```\n\nInteractive feel-test: `python speak.py kokoro`. One-shot A\u002FB comparison: `python compare.py \"your phrase\"`. See [docs\u002Farchitecture.md](docs\u002Farchitecture.md) for the runner protocol and how to add a model.\n\n---\n\n## TLDR (May 2026)\n\n**Fastest:**\n- CPU (Ryzen 9 9950X3D, Windows): **Piper** — 39ms warm TTFA, 47× RTF\n- CUDA (RTX 5090): **Kokoro** — 69ms warm TTFA, 101× RTF\n- CPU + MPS (Apple M4, 16 GB): **Piper** — 202ms warm TTFA, 33× RTF\n\n**Best sounding:** *No objective ranking right now — the NAQ score is paused pending redesign. Open the [Demos site](https:\u002F\u002F5uck1ess.github.io\u002Ftts-bench\u002F) and use the Listen lens.*\n\n**Best cloning (subjective rank):**\n- 1. **OmniVoice** — accent preserved, top of listening test\n- 2. **ChatterBox** — strong second, clean output\n- 3. **IndexTTS-2** — also good, accent preserved\n\n[→ full per-rig results](docs\u002Fresults.md) · [→ full cloning ranking](docs\u002Fcloning.md)\n\n---\n\n## Models tracked (37)\n\n| Model | Params | Predefined | Cloning | Multilingual | SR | License |\n|---|---|---|---|---|---|---|\n| Piper | ~15M | ✓ | — | ✓ | 22.05k | MIT |\n| Kokoro | 82M | ✓ | — | ✓ | 24k | Apache 2.0 |\n| KittenTTS | \u003C100M | ✓ | — | — | 24k | Apache 2.0 |\n| Magpie-TTS | 357M | ✓ | — | ✓ (9) | 22.05k | NVIDIA OML |\n| VibeVoice Realtime 0.5B | 0.5B | ✓ | — | — | 24k | MIT |\n| VibeVoice 1.5B | 1.5B | — | ✓ | — | 24k | MIT |\n| Supertonic | 99M | ✓ | — | ✓ (31) | 24k | MIT + OpenRAIL-M |\n| LuxTTS | ~123M | ✓ | — | — | 22.05k | MIT |\n| Soprano 80M | 80M | ✓ | — | — | 32k | Apache 2.0 |\n| Pocket-TTS | 100M | — | ✓ | — | 24k | Apache 2.0 |\n| ChatterBox | 1.2B | — | ✓ | — | 24k | MIT |\n| ChatterBox Turbo | 744M | — | ✓ | — | 24k | MIT |\n| F5-TTS | 330M | — | ✓ | ✓ | 24k | CC-BY-NC |\n| IndexTTS-2 | 1.5B | — | ✓ | ✓ | 24k | Apache 2.0 |\n| OmniVoice | ~1B | — | ✓ | ✓ (600+) | 24k | Apache 2.0 |\n| ZipVoice | 123M | — | ✓ | ✓ (zh+en) | 24k | Apache 2.0 |\n| VoxCPM2 | 2B | — | ✓ | ✓ (30) | **48k** | Apache 2.0 |\n| Sesame CSM-1B | 1B | — | ✓ | — | 24k | Apache 2.0 |\n| Coqui XTTS-v2 | 750M | — | ✓ | ✓ (17) | 24k | CPML (non-commercial) |\n| Qwen3-TTS Base | 1.7B | — | ✓ | ✓ | 24k | Apache 2.0 |\n| Qwen3-TTS 1.7B (CUDA-graph) | 1.7B | — | ✓ | ✓ | 24k | MIT |\n| Mars5-TTS | 1.2B | — | ✓ | — | 24k | AGPL-3.0 |\n| NeuTTS Air | 748M | — | ✓ | — | 24k | Apache 2.0 |\n| NeuTTS Nano | 748M | — | ✓ | — | 24k | Apache 2.0 |\n| Dia 1.6B | 1.6B | — | ✓ | — | 44.1k | Apache 2.0 |\n| MOSS-TTS-Nano | 100M | — | ✓ | ✓ (zh+en) | **48k** | Apache 2.0 |\n| MOSS-TTS | 8B (Qwen3) | — | ✓ | ✓ (20) | 24k | Apache 2.0 |\n| Maya1 | 3B | ✓ (voice desc) | — | — | 24k | Apache 2.0 |\n| Voxtral 4B TTS | 4B | ✓ (20) | ✓ | ✓ | 24k | CC-BY-NC 4.0 |\n| Fish Speech 1.5 | ~500M | — | ✓ | ✓ | **44.1k** | CC-BY-NC-SA 4.0 |\n| Fish Speech S2-Pro | 4B | — | ✓ | — | **44.1k** | Research (non-commercial) |\n| Zonos v0.1 | 1.6B | — | ✓ | ✓ | **44.1k** | Apache 2.0 |\n| OpenVoice v2 | ~100M | — | ✓ | ✓ | 22.05k | MIT |\n| StyleTTS 2 | ~148M | — | ✓ | — | 24k | MIT |\n| VibeVoice 7B | 7B | — | ✓ | — | 24k | MIT |\n| MetaVoice-1B | 1.2B | — | ✓ | — | **48k** | Apache 2.0 |\n| Step-Audio-EditX | 3B | — | ✓ | — | 24k | Apache 2.0 |\n\nFull per-model gotchas + license details: **[docs\u002Fknown-issues.md](docs\u002Fknown-issues.md)**. Models considered but excluded: **[docs\u002Fconsidered.md](docs\u002Fconsidered.md)**.\n\n> **Predefined vs Cloning.** *Predefined* models have fixed\u002Fselectable speaker voices baked into the weights — they speak with no reference needed. *Cloning* (zero-shot) models have **no voice of their own**: they synthesize whatever voice you hand them as a reference clip at inference. Given no reference, a pure zero-shot model falls back to a bundled sample (this bench uses `chris_hemsworth_15s.wav`), so its \"default voice\" is just a clone of that clip. A few models do both (e.g. Voxtral has 20 presets *and* cloning).\n\n> Rig availability: Voxtral is Mac (MLX, preset-voice only) + Linux (vLLM, cloning); Fish S2-Pro \u002F MetaVoice \u002F Step-Audio-EditX are Linux-only (CUDA). The rest run on Windows + Linux CUDA, most on CPU\u002FMPS too. Per-rig speed + samples on the [Demos site](https:\u002F\u002F5uck1ess.github.io\u002Ftts-bench\u002F).\n\n---\n\n## Voice cloning\n\n**28 of the 37 tracked models can clone** a voice from a reference clip. Three reference formats supported (wav only \u002F wav + transcript \u002F HF-gated wav). Drop a reference into `reference\u002F`, then `python bench.py --reference reference\u002Fmyvoice.wav`.\n\nReference-format docs + the subjective cloning ranking (10 models ranked by ear so far): **[docs\u002Fcloning.md](docs\u002Fcloning.md)**.\n\n---\n\n## Test hardware\n\n| Machine | Used for |\n|---|---|\n| Windows desktop (Ryzen 9 9950X3D \u002F 128 GB \u002F RTX 5090 32 GB) | Windows CPU + CUDA bench rows |\n| Linux workstation (Ryzen 9 5900XT \u002F 64 GB \u002F RTX 3090 24 GB, Ubuntu Server 24.04) | Linux CPU + CUDA; the only rig that runs Fish-Speech S2 natively |\n| Mac (Apple M4 \u002F 16 GB \u002F M4 GPU) | Mac CPU + MPS bench rows |\n\nIf you reproduce on different hardware, file an issue or PR with your results and we'll add a column.\n\n---\n\n## Docs\n\n- [Full results tables](docs\u002Fresults.md) — per-rig, per-prompt, per-model\n- [Cloning ranking](docs\u002Fcloning.md) — reference formats + subjective ranking (10 of 28 cloning models ranked so far)\n- [Architecture](docs\u002Farchitecture.md) — bench design, runner protocol, adding a model\n- [Known issues](docs\u002Fknown-issues.md) — per-model gotchas + per-license table\n- [Considered but skipped](docs\u002Fconsidered.md) — models evaluated and excluded\n- [Tasks & pending work](docs\u002Ftasks.md) — open issues, planned features\n- [Methodology](docs\u002Fmethodology.md) — what's measured, why cold + warm, why reproducible\n\n---\n\n## License\n\nMIT for the bench code in this repo. **Each TTS model has its own license** — see [docs\u002Fknown-issues.md](docs\u002Fknown-issues.md) for the full per-model table.\n\n---\n\n## Support\n\nIf this bench saved you a weekend of writing your own:\n\n\u003Ca href=\"https:\u002F\u002Fko-fi.com\u002F5uck1ess\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fstorage.ko-fi.com\u002Fcdn\u002Fkofi2.png?v=3\" alt=\"Buy me a coffee at ko-fi.com\" height=\"50\" \u002F>\u003C\u002Fa>\n","tts-bench 是一个用于评估各种文本转语音（TTS）模型在不同硬件平台上性能和音频质量的工具。该项目使用 Python 编写，提供了对 TTS 模型的速度（包括冷启动与热启动时间、实时因子等）和听感（通过内联音频播放器试听每个模型生成的声音样本）两方面的基准测试。特别地，它支持跨 Windows、Linux 和 macOS 平台运行，并且可以利用 CPU 或 GPU（如 CUDA、Apple Silicon）进行加速。适合于需要对比不同 TTS 解决方案的实际效果以及性能表现的研究人员或开发者使用，在选择适合自己应用场景的最佳 TTS 模型时提供参考依据。",2,"2026-06-11 04:08:14","CREATED_QUERY"]