[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79355":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},79355,"Confucius4-TTS","netease-youdao\u002FConfucius4-TTS","netease-youdao","Confucius4-TTS: a Multilingual and Cross-Lingual Zero-Shot TTS Engine",null,"Python",145,13,2,1,0,6,32,4,51.64,"Apache License 2.0",false,"main",[],"2026-06-11 04:06:54","\u003Cdiv align=\"center\">\n    \u003Cimg src=\".\u002Fresources\u002FConfucius4-TTS.png\" alt=\"Confucius4-TTS\" width=\"35%\">\n    \u003Ch1>Confucius4-TTS: a Multilingual and Cross-Lingual Zero-Shot TTS Engine\u003C\u002Fh1>\n    \u003Cp>\u003Cb>One voice. Any language.\u003C\u002Fb>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\".\u002FREADME.zh.md\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FREADME-中文版本-red\">\u003C\u002Fa>\n    &nbsp;&nbsp;&nbsp;&nbsp;\n    \u003Ca href=\".\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-yellow\">\u003C\u002Fa>\n    &nbsp;&nbsp;&nbsp;&nbsp;\n    \u003Ca href=\"https:\u002F\u002Fconfucius4-tts.youdao.com\u002Fgradio\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-在线体验-purple\">\u003C\u002Fa>\n    &nbsp;&nbsp;&nbsp;&nbsp;\n    \u003Ca href=\"https:\u002F\u002F2901733926.github.io\u002FConfucius4-TTS\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub.io-Demo_Page-blue?logo=GitHub&style=flat-square\">\u003C\u002Fa>\n    &nbsp;&nbsp;&nbsp;&nbsp;\n\u003C\u002Fdiv>\n\u003Cbr>\n\n📢 **Note:** Code and model weights are currently under preparation and will be released soon. In the meantime, you can try our online demo at [https:\u002F\u002Fconfucius4-tts.youdao.com\u002Fgradio](https:\u002F\u002Fconfucius4-tts.youdao.com\u002Fgradio).\n\nConfucius4-TTS is an advanced LLM-based text-to-speech (TTS) system designed for multilingual and cross-lingual speech synthesis. Built on a speech encoder + large language model (LLM) architecture, Confucius4-TTS enables high-quality speech generation while preserving speaker identity across languages.\n\n**✨ Key Features**\n\n- **14 Languages Supported**: Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay and Vietnamese *(more coming soon)*\n- **Unconstrained Voice Cloning**: No reference transcript required\n- **Cross-Lingual Voice Transfer**: Unaccented speech synthesis across 14 languages\n- **Zero-Shot Voice Transfer**: Clone voices without additional training\n- **Seamless Emotion Transfer**: Clone the feeling, not just the voice\n- **Robust Generalization**: Stable performance in real-world multilingual scenarios\n\nWith strong cross-lingual generalization, Confucius4-TTS allows users to seamlessly switch languages while keeping the same voice, delivering fluent, natural, and expressive speech.\n\n## Contents\n\n- [Performance](#-performance)\n- [Citation](#citation)\n\n## 📊 Performance\n\nConfucius4-TTS achieves competitive results on multilingual and cross-lingual zero-shot TTS benchmarks, with strong intelligibility and speaker similarity across multiple languages.\n\n> Lower is better for WER\u002FCER (↓), and higher is better for SIM (↑).\n\n### CV3-eval Cross-lingual\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>CV3-eval Cross-lingual Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Direction | Metric | Confucius4-TTS | F5-TTS† | Spark-TTS | CosyVoice2† | CosyVoice3-0.5B† | CosyVoice3-0.5B + DiffRO† | CosyVoice3-1.5B† | CosyVoice3-1.5B + DiffRO† |\n|---|---|---:|---:|---:|---:|---:|---:|---:|---:|\n| en→zh | WER↓ | **6.71** | 11.60 | 12.40 | 13.50 | 8.48 | 5.16 | 8.01 | 5.09 |\n| ja→zh | WER↓ | 4.93 | – | – | 48.10 | 6.86 | 3.22 | 6.78 | **3.05** |\n| ko→zh | WER↓ | 1.46 | – | – | 7.70 | 5.24 | **1.03** | 3.30 | 1.06 |\n| zh→en | WER↓ | **3.19** | 5.57 | 7.36 | 17.10 | 6.83 | 4.41 | 5.39 | 4.20 |\n| ja→en | WER↓ | **3.44** | – | – | 11.20 | 5.86 | 4.78 | 5.94 | 4.19 |\n| ko→en | WER↓ | **3.42** | – | – | 13.10 | 18.30 | 7.91 | 13.70 | 7.08 |\n\n† Requires reference text.\n\n\u003C\u002Fdetails>\n\n### X-Voice Benchmark\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>X-Voice Cross-lingual Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Direction | Metric | Confucius4-TTS | X-Voice | OmniVoice† | IndexTTS2 |\n|---|---|---:|---:|---:|---:|\n| de→zh | WER↓ | **2.86** | 3.07 | 13.10 | 3.46 |\n|  | SIM↑ | 0.569 | 0.516 | **0.691** | 0.544 |\n| en→zh | WER↓ | 3.27 | **3.06** | 4.03 | 3.78 |\n|  | SIM↑ | 0.504 | 0.443 | **0.544** | 0.485 |\n| fr→zh | WER↓ | **2.74** | 3.01 | 18.10 | 3.53 |\n|  | SIM↑ | 0.550 | 0.518 | **0.686** | 0.543 |\n| ja→zh | WER↓ | 3.50 | **3.39** | 79.10 | 4.11 |\n|  | SIM↑ | 0.637 | 0.629 | **0.709** | 0.650 |\n| ko→zh | WER↓ | **2.86** | 3.13 | 11.88 | 2.90 |\n|  | SIM↑ | 0.649 | 0.655 | **0.718** | 0.650 |\n| th→zh | WER↓ | 2.87 | **2.79** | 3.30 | 3.08 |\n|  | SIM↑ | 0.623 | 0.614 | **0.661** | 0.622 |\n| vi→zh | WER↓ | **2.75** | 2.78 | 10.51 | 2.98 |\n|  | SIM↑ | 0.640 | 0.641 | **0.701** | 0.641 |\n\n† Requires reference text.\n\n\u003C\u002Fdetails>\n\n### Seed-TTS-eval\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Seed-TTS-eval English & Chinese Zero-shot Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | Metric | Confucius4-TTS | Qwen3-TTS | FishAudio S2† | OmniVoice† | VoxCPM2† | X-Voice |\n|---|---|---:|---:|---:|---:|---:|---:|\n| English | WER↓ | 1.47 | 1.24 | **0.99** | 1.60 | 1.84 | 1.91 |\n|  | SIM↑ | 0.702 | 0.714 | – | 0.741 | **0.753** | 0.627 |\n| Chinese | CER↓ | 1.09 | 0.77 | **0.54** | 0.84 | 0.97 | 1.47 |\n|  | SIM↑ | 0.749 | 0.770 | – | 0.777 | **0.795** | 0.746 |\n\n† Requires reference text.\n\n\u003C\u002Fdetails>\n\n### MiniMax-Multilingual-Test\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>MiniMax-Multilingual-Test Results (click to expand)\u003C\u002Fb>\u003C\u002Fsummary>\n\n| Language | Metric | Confucius4-TTS | ElevenLab | Qwen3-TTS | FishAudio S2† | OmniVoice† | VoxCPM2† | X-Voice |\n|---|---|---:|---:|---:|---:|---:|---:|---:|\n| German | WER↓ | 0.89 | 0.57 | 1.24 | **0.55** | 0.96 | 0.68 | 2.00 |\n|  | SIM↑ | 0.767 | 0.614 | 0.768 | 0.767 | **0.812** | 0.803 | 0.763 |\n| French | WER↓ | 3.81 | 5.22 | **2.86** | 3.05 | 3.35 | 4.53 | 4.73 |\n|  | SIM↑ | 0.697 | 0.535 | 0.716 | 0.698 | **0.801** | 0.735 | 0.746 |\n| Indonesian | WER↓ | 1.79 | 1.06 | – | 1.46 | 1.97 | **1.08** | 1.47 |\n|  | SIM↑ | 0.754 | 0.660 | – | 0.763 | **0.805** | 0.800 | 0.725 |\n| Korean | WER↓ | 2.20 | 1.87 | 1.76 | **1.18** | 2.65 | 1.96 | 2.27 |\n|  | SIM↑ | 0.790 | 0.700 | 0.790 | 0.817 | 0.828 | **0.833** | 0.788 |\n| Thai | WER↓ | **2.42** | 73.94 | – | 4.23 | 3.98 | 2.96 | 4.71 |\n|  | SIM↑ | 0.736 | 0.588 | – | 0.786 | **0.841** | 0.840 | 0.791 |\n| Japanese | WER↓ | 4.26 | 10.65 | 3.82 | **2.76** | 4.03 | 4.63 | 7.13 |\n|  | SIM↑ | 0.775 | 0.738 | 0.771 | 0.796 | **0.828** | **0.828** | 0.765 |\n| Vietnamese | WER↓ | 1.99 | 73.42 | – | 7.41 | **1.37** | 3.31 | 1.40 |\n|  | SIM↑ | 0.764 | 0.369 | – | 0.740 | **0.805** | 0.806 | 0.672 |\n| Italian | WER↓ | 1.58 | 1.74 | 0.95 | 1.27 | 2.07 | 1.56 | 2.27 |\n|  | SIM↑ | 0.764 | 0.579 | 0.752 | 0.747 | **0.812** | 0.780 | 0.780 |\n| Portuguese | WER↓ | 2.04 | 1.33 | 1.53 | **1.14** | 2.51 | 1.94 | 2.61 |\n|  | SIM↑ | 0.794 | 0.711 | 0.805 | 0.781 | **0.859** | 0.837 | 0.794 |\n| Spanish | WER↓ | **0.95** | 1.08 | 1.13 | 0.91 | 1.03 | 1.44 | 2.91 |\n|  | SIM↑ | 0.770 | 0.615 | 0.814 | 0.776 | 0.804 | **0.831** | 0.747 |\n| Russian | WER↓ | 4.38 | 3.88 | 3.21 | 2.40 | **2.23** | 3.63 | 6.49 |\n|  | SIM↑ | 0.790 | 0.675 | 0.784 | 0.790 | 0.783 | **0.811** | 0.799 |\n\n† Requires reference text.\n\n\u003C\u002Fdetails>\n\n---\n\n## Citation\n\nIf you find Confucius4-TTS useful in your research or project, please consider citing:\n\n```bibtex\n@misc{confucius4tts_2026,\n  title        = {Confucius4-TTS: A Multilingual and Cross-Lingual Zero-Shot TTS Engine},\n  author       = {{NetEase Youdao}},\n  year         = {2026},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fnetease-youdao\u002FConfucius4-TTS}},\n  note         = {GitHub repository}\n}\n```\n","Confucius4-TTS 是一个基于大型语言模型的多语言和跨语言零样本文本转语音系统。它采用语音编码器加大型语言模型架构，支持14种语言，包括中文、英文、日文等，并具备无约束语音克隆、跨语言语音转换、零样本语音转移及无缝情感转移等功能。该系统能够在保持说话人身份一致的情况下，实现高质量的多语言语音合成。适用于需要在不同语言间无缝切换且保持同一声音的应用场景，如全球化多媒体内容制作、跨文化交流工具开发等领域。","2026-06-11 03:57:44","CREATED_QUERY"]