[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-539":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},539,"Real-Time-Voice-Cloning","CorentinJ\u002FReal-Time-Voice-Cloning","CorentinJ","Clone a voice in 5 seconds to generate arbitrary speech in real-time","",null,"Python",59906,9408,938,162,0,1,21,176,9,45,"Other",false,"master",true,[27,28,29,30,31,32],"deep-learning","python","pytorch","tensorflow","tts","voice-cloning","2026-06-12 02:00:15","# Real-Time Voice Cloning\n\nThis repository is an implementation of [Transfer Learning from Speaker Verification to\nMultispeaker Text-To-Speech Synthesis](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.04558.pdf) (SV2TTS) with a vocoder that works in real-time. This was my [master's thesis](https:\u002F\u002Fmatheo.uliege.be\u002Fhandle\u002F2268.2\u002F6801).\n\nSV2TTS is a deep learning framework in three stages. In the first stage, one creates a digital representation of a voice from a few seconds of audio. In the second and third stages, this representation is used as reference to generate speech given arbitrary text.\n\n**Video demonstration** (click the picture):\n\n[![Toolbox demo](https:\u002F\u002Fi.imgur.com\u002F8lFUlgz.png)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=-O_hYhToKoA)\n\n### Papers implemented\n\n| URL                                                    | Designation            | Title                                                                                    | Implementation source                                   |\n| ------------------------------------------------------ | ---------------------- | ---------------------------------------------------------------------------------------- | ------------------------------------------------------- |\n| [**1806.04558**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.04558.pdf) | **SV2TTS**             | **Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis** | This repo                                               |\n| [1802.08435](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1802.08435.pdf)     | WaveRNN (vocoder)      | Efficient Neural Audio Synthesis                                                         | [fatchord\u002FWaveRNN](https:\u002F\u002Fgithub.com\u002Ffatchord\u002FWaveRNN) |\n| [1703.10135](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1703.10135.pdf)     | Tacotron (synthesizer) | Tacotron: Towards End-to-End Speech Synthesis                                            | [fatchord\u002FWaveRNN](https:\u002F\u002Fgithub.com\u002Ffatchord\u002FWaveRNN) |\n| [1710.10467](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.10467.pdf)     | GE2E (encoder)         | Generalized End-To-End Loss for Speaker Verification                                     | This repo                                               |\n\n## Heads up\n\nLike everything else in Deep Learning, this repo has quickly gotten old. Many SaaS apps (often paying) will give you a better audio quality than this repository will. If you wish for an open-source solution with a high voice quality:\n\n- Check out [paperswithcode](https:\u002F\u002Fpaperswithcode.com\u002Ftask\u002Fspeech-synthesis\u002F) for other repositories and recent research in the field of speech synthesis.\n- Check out [Chatterbox](https:\u002F\u002Fgithub.com\u002Fresemble-ai\u002Fchatterbox) for a similar project up to date with the 2025 SOTA in voice cloning\n\n## Running the toolbox\n\nBoth Windows and Linux are supported.\n1. Install [ffmpeg](https:\u002F\u002Fffmpeg.org\u002Fdownload.html#get-packages). This is necessary for reading audio files. Check if it's installed by running in a command line\n```\nffmpeg\n```\n2. Install uv for python package management\n```\n# On Windows:\npowershell -ExecutionPolicy ByPass -c \"irm https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.ps1 | iex\"\n# On Linux\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n\n# Alternatively, on any platform if you have pip installed you can do\npip install -U uv\n```\n3. Run one of the following commands\n```\n# Run the toolbox if you have an NVIDIA GPU\nuv run --extra cuda demo_toolbox.py\n# Use this if you don't\nuv run --extra cpu demo_toolbox.py\n\n# Run in command line if you don't want the GUI\nuv run --extra cuda demo_cli.py\nuv run --extra cpu demo_cli.py\n```\nUv will automatically create a .venv directory for you with an appropriate python environment. [Open an issue](https:\u002F\u002Fgithub.com\u002FCorentinJ\u002FReal-Time-Voice-Cloning\u002Fissues) if this fails for you\n\n### (Optional) Download Pretrained Models\n\nPretrained models are now downloaded automatically. If this doesn't work for you, you can manually download them from [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FCorentinJ\u002FSV2TTS\u002Ftree\u002Fmain).\n\n### (Optional) Download Datasets\n\nFor playing with the toolbox alone, I only recommend downloading [`LibriSpeech\u002Ftrain-clean-100`](https:\u002F\u002Fwww.openslr.org\u002Fresources\u002F12\u002Ftrain-clean-100.tar.gz). Extract the contents as `\u003Cdatasets_root>\u002FLibriSpeech\u002Ftrain-clean-100` where `\u003Cdatasets_root>` is a directory of your choosing. Other datasets are supported in the toolbox, see [here](https:\u002F\u002Fgithub.com\u002FCorentinJ\u002FReal-Time-Voice-Cloning\u002Fwiki\u002FTraining#datasets). You're free not to download any dataset, but then you will need your own data as audio files or you will have to record it with the toolbox.\n","该项目实现了基于深度学习的实时语音克隆技术，能够在5秒内根据少量音频生成任意文本的语音。它采用三阶段框架SV2TTS，首先从几秒钟的音频中创建一个声音的数字表示，然后利用这一表示生成给定文本的语音。项目基于Python开发，使用了PyTorch和TensorFlow等深度学习库，并结合WaveRNN作为声码器以实现实时处理。适用于需要快速原型设计或研究用途的场景，如语音合成、个性化助手等应用领域。尽管当前存在更高质量的商业解决方案，但作为一个开源项目，它仍然为相关领域的探索提供了有价值的参考。",2,"2026-06-11 02:37:14","top_all"]