[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74053":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},74053,"VoiceCraft","jasonppy\u002FVoiceCraft","jasonppy","Zero-Shot Speech Editing and Text-to-Speech in the Wild",null,"Jupyter Notebook",8490,795,97,95,0,5,10,68.2,"Other",false,"master",true,[],"2026-06-12 04:01:12","# VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2403.16973-brightgreen.svg?style=flat-square)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.16973.pdf)  [![HuggingFace](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fpyp1\u002FVoiceCraft_gradio)  [![Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing)  [![Replicate](https:\u002F\u002Freplicate.com\u002Fcjwbw\u002Fvoicecraft\u002Fbadge)](https:\u002F\u002Freplicate.com\u002Fcjwbw\u002Fvoicecraft)  [![YouTube demo](https:\u002F\u002Fimg.shields.io\u002Fyoutube\u002Fviews\u002FeikybOi8iwU)](https:\u002F\u002Fyoutu.be\u002FeikybOi8iwU)  [![Demo page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAudio_Samples-blue?logo=Github&style=flat-square)](https:\u002F\u002Fjasonppy.github.io\u002FVoiceCraft_web\u002F)\n\n\n### TL;DR\nVoiceCraft is a token infilling neural codec language model, that achieves state-of-the-art performance on both **speech editing** and **zero-shot text-to-speech (TTS)** on in-the-wild data including audiobooks, internet videos, and podcasts.\n\nTo clone or edit an unseen voice, VoiceCraft needs only a few seconds of reference.\n\n## How to run inference\nThere are three ways (besides running Gradio in Colab):\n\n1. More flexible inference beyond Gradio UI in Google Colab. see [quickstart colab](#quickstart-colab)\n2. with docker. see [quickstart docker](#quickstart-docker)\n3. without docker. see [environment setup](#environment-setup). You can also run gradio locally if you choose this option\n4. As a standalone script that you can easily integrate into other projects.\nsee [quickstart command line](#quickstart-command-line).\n\nWhen you are inside the docker image or you have installed all dependencies, Checkout [`inference_tts.ipynb`](.\u002Finference_tts.ipynb).\n\nIf you want to do model development such as training\u002Ffinetuning, I recommend following [envrionment setup](#environment-setup) and [training](#training).\n\n## News\n:star: 03\u002F15\u002F2025: change inference sampling from topp=1 to topk=40 massively improve editing and TTS performance\n\n:star: 04\u002F22\u002F2024: 330M\u002F830M TTS Enhanced Models are up [here](https:\u002F\u002Fhuggingface.co\u002Fpyp1), load them through [`gradio_app.py`](.\u002Fgradio_app.py) or [`inference_tts.ipynb`](.\u002Finference_tts.ipynb)! Replicate demo is up, major thanks to [@chenxwh](https:\u002F\u002Fgithub.com\u002Fchenxwh)!\n\n:star: 04\u002F11\u002F2024: VoiceCraft Gradio is now available on HuggingFace Spaces [here](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fpyp1\u002FVoiceCraft_gradio)! Major thanks to [@zuev-stepan](https:\u002F\u002Fgithub.com\u002Fzuev-stepan), [@Sewlell](https:\u002F\u002Fgithub.com\u002FSewlell), [@pgsoar](https:\u002F\u002Fgithub.com\u002Fpgosar) [@Ph0rk0z](https:\u002F\u002Fgithub.com\u002FPh0rk0z).\n\n:star: 04\u002F05\u002F2024: I finetuned giga330M with the TTS objective on gigaspeech and 1\u002F5 of librilight. Weights are [here](https:\u002F\u002Fhuggingface.co\u002Fpyp1\u002FVoiceCraft\u002Ftree\u002Fmain). Make sure maximal prompt + generation length \u003C= 16 seconds (due to our limited compute, we had to drop utterances longer than 16s in training data). Even stronger models forthcomming, stay tuned!\n\n:star: 03\u002F28\u002F2024: Model weights for giga330M and giga830M are up on HuggingFace🤗 [here](https:\u002F\u002Fhuggingface.co\u002Fpyp1\u002FVoiceCraft\u002Ftree\u002Fmain)!\n\n## TODO\n- [x] Codebase upload\n- [x] Environment setup\n- [x] Inference demo for speech editing and TTS\n- [x] Training guidance\n- [x] RealEdit dataset and training manifest\n- [x] Model weights\n- [x] Better guidance on training\u002Ffinetuning\n- [x] Colab notebooks\n- [x] HuggingFace Spaces demo\n- [x] Command line\n- [ ] Improve efficiency\n\n## QuickStart Colab\n\n:star: To try out speech editing or TTS Inference with VoiceCraft, the simplest way is using Google Colab.\nInstructions to run are on the Colab itself.\n\n1. To try [Speech Editing](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1FV7EC36dl8UioePY1xXijXTMl7X47kR_?usp=sharing)\n2. To try [TTS Inference](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1lch_6it5-JpXgAQlUTRRI2z2_rk5K67Z?usp=sharing)\n\n## QuickStart Command Line\n\n:star: To use it as a standalone script, check out tts_demo.py and speech_editing_demo.py.\nBe sure to first [setup your environment](#environment-setup).\nWithout arguments, they will run the standard demo arguments used as an example elsewhere\nin this repository. You can use the command line arguments to specify unique input audios,\ntarget transcripts, and inference hyperparameters. Run the help command for more information:\n`python3 tts_demo.py -h`\n\n## QuickStart Docker\n:star: To try out TTS inference with VoiceCraft, you can also use docker. Thank [@ubergarm](https:\u002F\u002Fgithub.com\u002Fubergarm) and [@jayc88](https:\u002F\u002Fgithub.com\u002Fjay-c88) for making this happen.\n\nTested on Linux and Windows and should work with any host with docker installed.\n```bash\n# 1. clone the repo on in a directory on a drive with plenty of free space\ngit clone git@github.com:jasonppy\u002FVoiceCraft.git\ncd VoiceCraft\n\n# 2. assumes you have docker installed with nvidia container container-toolkit (windows has this built into the driver)\n# https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002F1.13.5\u002Finstall-guide.html\n# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...\n\n# 3. First build the docker image\ndocker build --tag \"voicecraft\" .\n\n# 4. Try to start an existing container otherwise create a new one passing in all GPUs\n.\u002Fstart-jupyter.sh  # linux\nstart-jupyter.bat   # windows\n\n# 5. now open a webpage on the host box to the URL shown at the bottom of:\ndocker logs jupyter\n\n# 6. optionally look inside from another terminal\ndocker exec -it jupyter \u002Fbin\u002Fbash\nexport USER=(your_linux_username_used_above)\nexport HOME=\u002Fhome\u002F$USER\nsudo apt-get update\n\n# 7. confirm video card(s) are visible inside container\nnvidia-smi\n\n# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time\necho GOOD LUCK\n```\n\n## Environment setup\n```bash\nconda create -n voicecraft python=3.9.16\nconda activate voicecraft\n\npip install -e git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft\npip install xformers==0.0.22\npip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https:\u002F\u002Fpytorch.org\u002Fget-started\u002Fprevious-versions\u002F#v201\napt-get install ffmpeg # if you don't already have ffmpeg installed\napt-get install espeak-ng # backend for the phonemizer installed below\npip install tensorboard==2.16.2\npip install phonemizer==3.2.1\npip install datasets==2.16.0\npip install torchmetrics==0.11.1\npip install huggingface_hub==0.22.2\n# install MFA for getting forced-alignment, this could take a few minutes\nconda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068\n# install MFA english dictionary and model\nmfa model download dictionary english_us_arpa\nmfa model download acoustic english_us_arpa\n# pip install huggingface_hub\n# conda install pocl # above gives an warning for installing pocl, not sure if really need this\n\n# to run ipynb\nconda install -n voicecraft ipykernel --no-deps --force-reinstall\n```\n\nIf you have encountered version issues when running things, checkout [environment.yml](.\u002Fenvironment.yml) for exact matching.\n\n## Inference Examples\nCheckout [`inference_speech_editing.ipynb`](.\u002Finference_speech_editing.ipynb) and [`inference_tts.ipynb`](.\u002Finference_tts.ipynb)\n\n## Gradio\n### Run in colab\n\n[![Open in Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1IOjpglQyMTO2C3Y94LD9FY0Ocn-RJRg6?usp=sharing)\n\n### Run locally\nAfter environment setup install additional dependencies:\n```bash\napt-get install -y espeak espeak-data libespeak1 libespeak-dev\napt-get install -y festival*\napt-get install -y build-essential\napt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools\napt-get install -y libxml2-dev libxslt-dev zlib1g-dev\npip install -r gradio_requirements.txt\n```\n\nRun gradio server from terminal or [`gradio_app.ipynb`](.\u002Fgradio_app.ipynb):\n```bash\npython gradio_app.py\n```\nIt is ready to use on [default url](http:\u002F\u002F127.0.0.1:7860).\n\n### How to use it\n1. (optionally) Select models\n2. Load models\n3. Transcribe\n4. (optionally) Tweak some parameters\n5. Run\n6. (optionally) Rerun part-by-part in Long TTS mode\n\n### Some features\nSmart transcript: write only what you want to generate\n\nTTS mode: Zero-shot TTS\n\nEdit mode: Speech editing\n\nLong TTS mode: Easy TTS on long texts\n\n\n## Training\nTo train an VoiceCraft model, you need to prepare the following parts:\n1. utterances and their transcripts\n2. encode the utterances into codes using e.g. Encodec\n3. convert transcripts into phoneme sequence, and a phoneme set (we named it vocab.txt)\n4. manifest (i.e. metadata)\n\nStep 1,2,3 are handled in [.\u002Fdata\u002Fphonemize_encodec_encode_hf.py](.\u002Fdata\u002Fphonemize_encodec_encode_hf.py), where\n1. Gigaspeech is downloaded through HuggingFace. Note that you need to sign an agreement in order to download the dataset (it needs your auth token)\n2. phoneme sequence and encodec codes are also extracted using the script.\n\nAn example run:\n\n```bash\nconda activate voicecraft\nexport CUDA_VISIBLE_DEVICES=0\ncd .\u002Fdata\npython phonemize_encodec_encode_hf.py \\\n--dataset_size xs \\\n--download_to path\u002Fto\u002Fstore_huggingface_downloads \\\n--save_dir path\u002Fto\u002Fstore_extracted_codes_and_phonemes \\\n--encodec_model_path path\u002Fto\u002Fencodec_model \\\n--mega_batch_size 120 \\\n--batch_size 32 \\\n--max_len 30000\n```\nwhere encodec_model_path is avaliable [here](https:\u002F\u002Fhuggingface.co\u002Fpyp1\u002FVoiceCraft). This model is trained on Gigaspeech XL, it has 56M parameters, 4 codebooks, each codebook has 2048 codes. Details are described in our [paper](https:\u002F\u002Fjasonppy.github.io\u002Fassets\u002Fpdfs\u002FVoiceCraft.pdf). If you encounter OOM during extraction, try decrease the batch_size and\u002For max_len.\nThe extracted codes, phonemes, and vocab.txt will be stored at `path\u002Fto\u002Fstore_extracted_codes_and_phonemes\u002F${dataset_size}\u002F{encodec_16khz_4codebooks,phonemes,vocab.txt}`.\n\nAs for manifest, please download train.txt and validation.txt from [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpyp1\u002FVoiceCraft_RealEdit\u002Ftree\u002Fmain), and put them under `path\u002Fto\u002Fstore_extracted_codes_and_phonemes\u002Fmanifest\u002F`. Please also download vocab.txt from [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fpyp1\u002FVoiceCraft_RealEdit\u002Ftree\u002Fmain) if you want to use our pretrained VoiceCraft model (so that the phoneme-to-token matching is the same).\n\nNow, you are good to start training!\n\n```bash\nconda activate voicecraft\ncd .\u002Fz_scripts\nbash e830M.sh\n```\n\nIt's the same procedure to prepare your own custom dataset. Make sure that if\n\n## Finetuning\nYou also need to do step 1-4 as Training, and I recommend to use AdamW for optimization if you finetune a pretrained model for better stability. checkout script `.\u002Fz_scripts\u002Fe830M_ft.sh`.\n\nIf your dataset introduce new phonemes (which is very likely) that doesn't exist in the giga checkpoint, make sure you combine the original phonemes with the phoneme from your data when construction vocab. And you need to adjust `--text_vocab_size` and `--text_pad_token` so that the former is bigger than or equal to you vocab size, and the latter has the same value as `--text_vocab_size` (i.e. `--text_pad_token` is always the last token). Also since the text embedding are now of a different size, make sure you modify the weights loading part so that I won't crash (you could skip loading `text_embedding` or only load the existing part, and randomly initialize the new)\n\n## License\nThe codebase is under CC BY-NC-SA 4.0 ([LICENSE-CODE](.\u002FLICENSE-CODE)), and the model weights are under Coqui Public Model License 1.0.0 ([LICENSE-MODEL](.\u002FLICENSE-MODEL)). Note that we use some of the code from other repository that are under different licenses: `.\u002Fmodels\u002Fcodebooks_patterns.py` is under MIT license; `.\u002Fmodels\u002Fmodules`, `.\u002Fsteps\u002Foptim.py`, `data\u002Ftokenizer.py` are under Apache License, Version 2.0; the phonemizer we used is under GNU 3.0 License.\n\n## Acknowledgement\nWe thank Feiteng for his [VALL-E reproduction](https:\u002F\u002Fgithub.com\u002Flifeiteng\u002Fvall-e), and we thank audiocraft team for open-sourcing [encodec](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft).\n\n## Citation\n```\n@article{peng2024voicecraft,\n  author    = {Peng, Puyuan and Huang, Po-Yao and Mohamed, Abdelrahman and Harwath, David},\n  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},\n  journal   = {arXiv},\n  year      = {2024},\n}\n```\n\n## Disclaimer\nAny organization or individual is prohibited from using any technology mentioned in this paper to generate or edit someone's speech without his\u002Fher consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.\n\n","VoiceCraft 是一个用于零样本语音编辑和文本转语音的神经编解码语言模型，特别适用于处理包括有声书、网络视频和播客在内的野外数据。该项目的核心功能是仅需几秒钟的参考音频即可克隆或编辑未见过的声音，并且在语音编辑和零样本文本转语音方面表现出色。技术上，它通过令牌填充机制实现高质量的音频生成与修改。适合需要快速生成特定风格或模仿特定人声音的应用场景，如内容创作、虚拟助手个性化等。此外，项目提供了多种部署方式，包括Colab、Docker以及本地运行，方便用户根据需求选择合适的使用方法。",2,"2026-06-11 03:48:36","high_star"]