[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83130":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},83130,"WavTTS","cwx-worst-one\u002FWavTTS","cwx-worst-one","WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling","https:\u002F\u002Fwavtts.github.io",null,"Python",171,5,3,1,0,6,102,36,84.33,"MIT License",false,"main",true,[26,27,28,29,30,31,32],"chinese","english","mel-spectrogram","non-autoregressive","torch","waveform","zero-shot-tts","2026-06-12 04:01:40","\u003Cdiv align=\"center\">\n  \u003Ch1>\n  WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling\n  \u003C\u002Fh1> \n\n  \u003Cp align=\"center\">\n    \u003Ca href=\"#installation\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10-brightgreen.svg?logo=python&logoColor=white\" alt=\"Python\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03455\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2606.03455-b31b1b.svg?logo=arXiv\" alt=\"arXiv\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fwavtts.github.io\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐%20Demo-Page-orange.svg\" alt=\"Demo\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fworstchan\u002FWavTTS\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HF-Models-yellow.svg\" alt=\"HF Models\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fchenxie95\u002FWavTTS\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20HF-Space-blue.svg\" alt=\"HF Space\">\u003C\u002Fa>\n  \u003C\u002Fp>\n\n  \u003Cp align=\"center\">\n    \u003Ci>End-to-end zero-shot TTS directly in the raw waveform space.\u003C\u002Fi>\n  \u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## 📖 Introduction\n\nWavTTS is an end-to-end zero-shot TTS framework that generates speech directly in the raw waveform space, without relying on intermediate acoustic representations such as mel-spectrograms, VAE latents, or codec tokens. Built on flow matching with DiT, WavTTS combines waveform patchification, multi-scale mel-spectrogram supervision, and optimized noise scheduling to achieve high-quality waveform generation. For more details, please refer to our paper: [WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03455).\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"docs\u002Fstatic\u002Fimages\u002Fwavtts_pipeline.png\" alt=\"WavTTS pipeline\" width=\"85%\">\n\u003C\u002Fdiv>\n\n**Note:** This repository is based on [F5-TTS](https:\u002F\u002Fgithub.com\u002FSWivid\u002FF5-TTS). For general usage, troubleshooting, and basic guidance, please refer to the original F5-TTS repository. The sections below outline workflows specific to WavTTS.\n\n## 🚀 News\n\n- **[2026-06-03]**: We have released the WavTTS codebase along with the official 16 kHz checkpoint. Please note that this project is still under active development, and we will continue to roll out updates and improvements.\n\n\n\n## ⚙️ Installation\n\nWe recommend using Conda to manage the environment and dependencies.\n\n```bash\n# 1. Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fcwx-worst-one\u002FWavTTS\ncd WavTTS\n\n# 2. Create and activate a virtual environment\nconda create -n wavtts python=3.10\nconda activate wavtts\n\n# 3. Install PyTorch (>=2.2.0) with CUDA support, e.g.,\npip install torch==2.6.0 torchaudio==2.6.0 --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124\n\n# 4. Install WavTTS in editable mode\npip install -e .\n```\n\n## 📦 Model Checkpoints\n\nThe official WavTTS checkpoint is available on Hugging Face: [WavTTS 🤗](https:\u002F\u002Fhuggingface.co\u002Fworstchan\u002FWavTTS). The default checkpoint supports 16 kHz zero-shot TTS inference and will be downloaded automatically the first time you run the inference script.\n\n## 🚀 Inference\n\nWavTTS supports both command-line inference and script-based inference. For more details, please refer to the [Inference Guide](src\u002Fwavtts\u002Finfer\u002FREADME.md).\n\n### CLI Inference\n\nGenerate speech using a short reference audio prompt. CLI arguments will automatically override values defined in the TOML config.\n\n```bash\nwavtts_infer-cli \\\n  --model WavTTS \\\n  --ref_audio \"provide_prompt_wav_path_here.wav\" \\\n  --ref_text \"The content, subtitle, or transcription of the reference audio.\" \\\n  --gen_text \"The text you want WavTTS to synthesize.\"\n```\n\nAlternatively, manage your parameters cleanly using a TOML configuration file:\n\n```bash\n# Use the provided default config\nwavtts_infer-cli -c src\u002Fwavtts\u002Finfer\u002Fexamples\u002Fbasic.toml\n\n# Use a custom config (with an inline text override)\nwavtts_infer-cli -c custom.toml --gen_text \"Override text here.\"\n```\n\n### Script-based Inference\n\nFor customized pipelines, you can directly modify the paths and texts in `src\u002Fwavtts\u002Finfer\u002Finfer.sh` and execute:\n\n```bash\nbash src\u002Fwavtts\u002Finfer\u002Finfer.sh\n```\n\n\n## 🏋️ Training\n\nTraining WavTTS requires preprocessed dataset metadata. For a complete walkthrough of data preparation, training, and fine-tuning, please refer to the [Training Guide](src\u002Fwavtts\u002Ftrain\u002FREADME.md).\n\n### Data Preparation\n\nWe use [Emilia](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Famphion\u002FEmilia-Dataset) as the training dataset in our main experiments. After downloading Emilia, update the paths in the preparation script and run:\n\n```bash\n# Prepare training metadata for the Emilia dataset\npython src\u002Fwavtts\u002Ftrain\u002Fdatasets\u002Fprepare_emilia.py\n```\n\nPreparation scripts for other datasets like LibriTTS are available under `src\u002Fwavtts\u002Ftrain\u002Fdatasets\u002F`. To use a custom dataset, please adapt the loading logic in `src\u002Fwavtts\u002Fmodel\u002Fdataset.py`.\n\n### Launching Training\n\nWavTTS can be trained directly with `accelerate`:\n\n```bash\n# Step 1: Configure Accelerate (e.g., multi-GPU DDP, mixed precision)\naccelerate config\n\n# Step 2: Launch training using a Hydra config\n# YAML configuration files are located under the src\u002Fwavtts\u002Fconfigs\u002F directory.\naccelerate launch src\u002Fwavtts\u002Ftrain\u002Ftrain.py --config-name WavTTS.yaml\n\n# Example with inline overrides:\naccelerate launch --mixed_precision=bf16 src\u002Fwavtts\u002Ftrain\u002Ftrain.py --config-name WavTTS.yaml ++datasets.batch_size_per_gpu=19200\n```\n\nFor our main experiments, we provide a unified launcher script. Remember to edit the default environment variables at the top of the script before running:\n\n```bash\nbash src\u002Fwavtts\u002Ftrain\u002Frun_main_train.sh\n```\n\n\n## 📊 Evaluation\n\nFor evaluation setup, dataset preparation, and objective metric scripts, please refer to the [Evaluation Guide](src\u002Fwavtts\u002Feval\u002FREADME.md).\n\n\n## 🙏 Acknowledgements\n\nWavTTS is built upon the awesome [F5-TTS](https:\u002F\u002Fgithub.com\u002FSWivid\u002FF5-TTS) codebase, with references to the implementations of [DAC](https:\u002F\u002Fgithub.com\u002Fdescriptinc\u002Fdescript-audio-codec) and [JiT](https:\u002F\u002Fgithub.com\u002FLTH14\u002FJiT). We sincerely thank the authors for their invaluable open-source contributions.\n\nIf you encounter general pipeline or environment issues, we recommend first checking the [F5-TTS issue tracker](https:\u002F\u002Fgithub.com\u002FSWivid\u002FF5-TTS\u002Fissues), where many common questions may have already been discussed or resolved.\n\n## 📝 Citation\n\nIf you find this work useful in your research, please consider citing our paper:\n\n```bibtex\n@misc{chen2026wavttshighqualityzeroshottts,\n      title={WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling}, \n      author={Wenxi Chen and Dongya Jia and Yushen Chen and Zhikang Niu and Yuzhe Liang and Xiquan Li and Ruiqi Yan and Ziyang Ma and Guanrou Yang and Sanyuan Chen and Yue Wang and Zhuo Chen and Kai Yu and Xie Chen},\n      year={2026},\n      eprint={2606.03455},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.03455}, \n}\n```\n\n## 📜 License\n\nThe codebase of this repository is released under the MIT License. Due to the license restrictions of the Emilia training dataset, the released pre-trained model weights are licensed under CC BY-NC 4.0.\n","WavTTS 是一个端到端的零样本文本转语音（TTS）框架，能够在原始波形空间直接生成高质量语音。其核心功能包括基于流匹配和DiT的模型架构，结合波形块化、多尺度梅尔谱图监督以及优化噪声调度等技术特点，无需依赖中间声学表示如梅尔谱图或编码器令牌。WavTTS 适用于需要高质量语音合成且不希望进行大量前期训练数据准备的场景，特别适合于多语言环境下的快速原型设计和开发工作。",2,"2026-06-11 04:10:12","CREATED_QUERY"]