[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82306":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":12,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":8,"rankLanguage":8,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":27,"discoverSource":28},82306,"PilotTTS","AMAPVOICE\u002FPilotTTS","AMAPVOICE",null,"Python",167,16,2,8,0,46,129,15,3.69,"Apache License 2.0",false,"master",true,[],"2026-06-12 02:04:25","# PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"assert\u002FIntroduction.png\" width=\"600\" \u002F>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    English &nbsp;|&nbsp; \u003Ca href=\"README_zh.md\">中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.27258\">Paper\u003C\u002Fa> &nbsp;|&nbsp; 🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FAmapVoice\u002FPilotTTS\">HuggingFace\u003C\u002Fa> &nbsp;|&nbsp; 🤖 \u003Ca href=\"https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FAmapVoice\u002FPilotTTS\">ModelScope\u003C\u002Fa> &nbsp;|&nbsp; 🎧 \u003Ca href=\"https:\u002F\u002Famapvoice.github.io\u002FPilotTTS\u002F\">Demos\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n## News 📝\n\n- **[Coming Soon]** Expanding support for 14+ dialects, with model weights to be released soon\n- **[2026.05]** Release Pilot-TTS base and instruct model weights\n\n## Highlight 🔥\n\n**PilotTTS** is an LLM-based text-to-speech (TTS) system that builds an intentionally simplified architecture with fully open-source components and achieves competitive performance through rigorous data engineering.\n\n### Key Features\n- **A fully open-source data processing pipeline:** We design a multi-stage pipeline that incorporates quality assessment and enhancement, annotation, and quality filtering, where all operators are implemented using publicly available tools. This pipeline converts large-scale Internet audio into clean training data with rich annotation, achieving high-quality data generation while substantially reducing costs.\n- **Content Consistency and Speaker Similarity Control:** On the Seed-TTS test set, our model achieves state-of-the-art speaker similarity (0.862) and highly competitive content accuracy (CER 0.87%).\n- **Emotion and Paralinguistic Control:** Supports controllable synthesis for 11 emotion categories (Happy, Sad, Fear, Angry, Contempt, Serious, Surprise, Blue, Concern, Disgust, Psychology) and 4 paralinguistic categories (LAUGH, BREATH, CRY, COUGH).\n- **Dialect Control:** Supports 14 Chinese dialects and enables cross-dialect synthesis, with particular strength in synthesizing from Mandarin Chinese to the target dialect.\n\n## Installation ⚙️\n\n### Clone and install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fxxx\u002Fpilot-tts.git\ncd pilot-tts\n```\n\n### Environment setup\n\n```bash\nconda create -n pilot-tts python=3.10 -y\nconda activate pilot-tts\npip install -r requirements.txt\n```\n\n### Model download\n\n#### 1. Pilot-TTS models (our weights)\n\n```python\n# ModelScope\nfrom modelscope import snapshot_download\nsnapshot_download('AmapVoice\u002FPilotTTS', local_dir='pretrained_models\u002F')\n\n# HuggingFace\nfrom huggingface_hub import snapshot_download\nsnapshot_download('AmapVoice\u002FPilotTTS', local_dir='pretrained_models\u002F')\n```\n\nThis includes: `pilot_tts.pt`, `pilot_tts_instruct.pt`, and `tokenizer\u002F`.\n\n#### 2. Third-party open-source models\n\nDownload the following dependencies from their respective open-source projects:\n\n```\n\n```python\nfrom huggingface_hub import snapshot_download\n\n# w2v-bert-2.0 (audio feature extractor)\nsnapshot_download('facebook\u002Fw2v-bert-2.0', local_dir='pretrained_models\u002Fw2v-bert-2.0')\n```\n\n> Note: `wav2vec2bert_stats.pt` (from [MaskGCT](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion\u002Ftree\u002Fmain\u002Fmodels\u002Ftts\u002Fmaskgct)) is included in the Pilot-TTS model package.\n\n#### Final directory structure\n\n```\npretrained_models\u002F\n├── pilot_tts.pt              # Base model (zero-shot voice cloning)\n├── pilot_tts_instruct.pt     # Instruct model (emotion, paralanguage, dialect)\n├── Qwen3-0.6B\u002F              # LLM backbone (from Qwen)\n├── w2v-bert-2.0\u002F            # Audio feature extractor (from Meta)\n├── wav2vec2bert_stats.pt    # Feature normalization stats (from MaskGCT)\n└── CosyVoice3-0.5B\u002F        # Flow-matching vocoder (from FunAudioLLM)\n```\n\n## Quick Start 📖\n\nRun all inference demos with a single command:\n\n```bash\npython demo.py\n```\n\n## Inference\n\n### Python API\n\n```python\nfrom demo import load_engine, synthesize\n\n# Zero-shot voice cloning (base model)\nengine = load_engine(\n    config_path=\"configs\u002Finfer_pilot_tts.yaml\",\n    checkpoint=\"pretrained_models\u002Fpilot_tts.pt\",\n)\n\nsynthesize(engine, text=\"你好，世界！\",\n           prompt_wav=\"assert\u002Fprompt.wav\",\n           output_path=\"output\u002Fclone.wav\")\n\n# Load instruct model (emotion, paralanguage, dialect)\nengine_instruct = load_engine(\n    config_path=\"configs\u002Finfer_pilot_tts_instruct.yaml\",\n    checkpoint=\"pretrained_models\u002Fpilot_tts_instruct.pt\",\n)\n\n# Emotion synthesis\nsynthesize(engine_instruct, text=\"今天天气真好啊！\",\n           prompt_wav=\"assert\u002Fprompt.wav\",\n           emotion=\"happy\", output_path=\"output\u002Fhappy.wav\")\n\n# Paralanguage\nsynthesize(engine_instruct, text=\"这太好笑了\u003C|LAUGH|>停不下来\",\n           prompt_wav=\"assert\u002Fprompt.wav\",\n           output_path=\"output\u002Flaugh.wav\")\n\n# Dialect (Henan)\nsynthesize(engine_instruct, text=\"中不中啊，咱俩一块儿去吃胡辣汤吧\",\n           prompt_wav=\"assert\u002Fprompt.wav\",\n           language=\"zh-henan\", output_path=\"output\u002Fhenan.wav\")\n```\n\n### Command Line\n\n```bash\n# Zero-shot voice cloning (base model)\npython inference.py \\\n    --checkpoint pretrained_models\u002Fpilot_tts.pt \\\n    --prompt-wav assert\u002Fprompt.wav \\\n    --text \"需要合成的目标文本\" \\\n    --output output\u002Fzeroshot.wav\n\n# Emotion synthesis (instruct model)\npython inference.py \\\n    --config configs\u002Finfer_pilot_tts_instruct.yaml \\\n    --checkpoint pretrained_models\u002Fpilot_tts_instruct.pt \\\n    --prompt-wav assert\u002Fprompt.wav \\\n    --text \"今天天气真好啊，我们去公园玩吧！\" \\\n    --emotion happy \\\n    --output output\u002Femotion.wav\n\n# Paralanguage (instruct model)\npython inference.py \\\n    --config configs\u002Finfer_pilot_tts_instruct.yaml \\\n    --checkpoint pretrained_models\u002Fpilot_tts_instruct.pt \\\n    --prompt-wav assert\u002Fprompt.wav \\\n    --text \"这个笑话太好笑了\u003C|LAUGH|>我真的忍不住\" \\\n    --output output\u002Fparalang.wav\n\n# Dialect synthesis (instruct model)\npython inference.py \\\n    --config configs\u002Finfer_pilot_tts_instruct.yaml \\\n    --checkpoint pretrained_models\u002Fpilot_tts_instruct.pt \\\n    --prompt-wav assert\u002Fprompt.wav \\\n    --text \"中不中啊，咱俩一块儿去吃胡辣汤吧\" \\\n    --language zh-henan \\\n    --output output\u002Fdialect.wav\n```\n\n### Supported Controls\n\n| Feature | Usage | Model |\n|---------|-------|-------|\n| Voice Cloning | Provide prompt audio | Both |\n| Emotions | `--emotion \u003Ctag>` | Instruct |\n| Paralanguage | Insert tags in text | Instruct |\n| Dialects | `--language \u003Cdialect>` | Instruct |\n\n**Emotions:**\n\n| Tag | 情感 | Tag | 情感 |\n|-----|------|-----|------|\n| `happy` | 开心 | `sad` | 悲伤 |\n| `angry` | 愤怒 | `surprise` | 惊讶 |\n| `fear` | 恐惧 | `disgust` | 厌恶 |\n| `serious` | 严肃 | `concern` | 关切 |\n| `blue` | 忧郁 | `disdain` | 轻蔑 |\n| `neutral` | 中性\u002F平静 | `psychology` | 心理活动 |\n| `unknown` | 不指定情感 | | |\n\n**Paralanguage tags:**\n\n| Tag | Description |\n|-----|-------------|\n| `\u003C\\|LAUGH\\|>` | 笑声 |\n| `\u003C\\|BREATH\\|>` | 呼吸声 |\n| `\u003C\\|COUGH\\|>` | 咳嗽 |\n| `\u003C\\|CRY\\|>` | 哭泣声 |\n| `\u003C\\|LAUGH_SPAN\\|>...\u003C\\|\u002FLAUGH_SPAN\\|>` | 包裹笑声文本 |\n\n**Dialects:**\n\n| Tag | 方言 | Tag | 方言 |\n|-----|------|-----|------|\n| `zh-dongbei` | 东北话 | `zh-shandong` | 山东话 |\n| `zh-henan` | 河南话 | `zh-shan1xi` | 山西话 |\n| `zh-minnan` | 闽南语 |  `zh-gansu` | 甘肃话 |\n| `zh-ningxia` | 宁夏话 | `zh-shanghai` | 上海话 |\n| `zh-chongqing` | 重庆话 | `zh-hubei` | 湖北话 |\n| `zh-hunan` | 湖南话 | `zh-jiangxi` | 江西话 |\n| `zh-guizhou` | 贵州话 | `zh-yunnan` | 云南话 |\n\n## WebUI\n\nLaunch a Gradio-based interactive interface:\n\n```bash\npython webui.py --port 9000\n```\n\n## Project Structure\n\n```\npilot-tts\u002F\n├── configs\u002F                     # Inference configurations (per checkpoint)\n├── demo.py                      # Complete demo (all inference modes)\n├── inference.py                 # CLI inference entry\n├── webui.py                     # Gradio WebUI\n├── asset\u002F                       # Example prompt audio\n├── pilot_voice\u002F                 # Core model code\n│   ├── engine.py                # InferenceEngine pipeline\n│   ├── model.py                 # AR model (Qwen3 backbone + audio tokens)\n│   ├── sampling.py              # RAS sampling (from VALL-E 2)\n│   ├── utils.py                 # Utilities\n│   ├── modules\u002F                 # Conformer + Perceiver modules\n│   └── tools\u002F                   # Audio & text processing\n├── third_party\u002F\n│   ├── cosyvoice\u002F               # Flow-matching vocoder\n│   └── Matcha-TTS\u002F              # Flow matching dependency\n├── tokenizer\u002F                   # Custom tokenizer with special tokens\n├── pretrained_models\u002F           # Model weights (not in git)\n└── requirements.txt\n```\n\n## Acknowledgements\n\n- [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) — Flow-matching & Vocoder\n- [Qwen3](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3) — LLM backbone\n- [Matcha-TTS](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS) — Flow matching framework\n- [MaskGCT](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion\u002Ftree\u002Fmain\u002Fmodels\u002Ftts\u002Fmaskgct) — wav2vec2bert feature statistics\n\n## Citation\n\n```bibtex\n\n@article{pilottts2026,\n      title={PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis},\n      author={Bowen Li and Shaotong Guo and Zhen Wang and Yang Xiang and Mingli Jin and Yihang Lin and Jiahui Zhao and Weibo Xiong and Dongrui Li and Keming Chen and Yunze Gao and Yuze Zhou and Zeyang Lin and Yue Liu},\n      year={2026},\n      journal={arXiv preprint arXiv:2605.27258}\n}\n```\n\n## License\n\nApache-2.0\n","PilotTTS 是一个基于大语言模型的文本转语音系统，通过精简架构和全开源组件实现了高性能的声音合成。该项目核心功能包括一套完整的开源数据处理流程，能够将大规模互联网音频转化为高质量训练数据；支持内容一致性及说话人相似度控制，在Seed-TTS测试集上达到领先水平；同时具备11种情绪类别与4种副语言特征的可控合成能力，并能处理14种中文方言之间的转换。PilotTTS适用于需要高质量、多样化语音输出的应用场景，如虚拟助手、有声读物制作等。","2026-06-11 04:08:18","CREATED_QUERY"]