[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-564":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},564,"GPT-SoVITS","RVC-Boss\u002FGPT-SoVITS","RVC-Boss","1 min voice data can also be used to train a good TTS model! (few shot voice cloning)","",null,"Python",58583,6407,271,784,0,28,217,1185,152,120,"MIT License",false,"main",true,[27,28,29,30,31,32],"text-to-speech","tts","vits","voice-clone","voice-cloneai","voice-cloning","2026-06-12 04:00:04","\u003Cdiv align=\"center\">\n\n\u003Ch1>GPT-SoVITS-WebUI\u003C\u002Fh1>\nA Powerful Few-shot Voice Conversion and Text-to-Speech WebUI.\u003Cbr>\u003Cbr>\n\n[![madewithlove](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmade_with-%E2%9D%A4-red?style=for-the-badge&labelColor=orange)](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS)\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F7033\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F7033\" alt=\"RVC-Boss%2FGPT-SoVITS | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\n\u003C!-- img src=\"https:\u002F\u002Fcounter.seku.su\u002Fcmoe?name=gptsovits&theme=r34\" \u002F>\u003Cbr> -->\n\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10--3.12-blue?style=for-the-badge&logo=python)](https:\u002F\u002Fwww.python.org)\n[![GitHub release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FRVC-Boss\u002Fgpt-sovits?style=for-the-badge&logo=github)](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002Fgpt-sovits\u002Freleases)\n\n[![Train In Colab](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FColab-Training-F9AB00?style=for-the-badge&logo=googlecolab)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FRVC-Boss\u002FGPT-SoVITS\u002Fblob\u002Fmain\u002FColab-WebUI.ipynb)\n[![Huggingface](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F免费在线体验-free_online_demo-yellow.svg?style=for-the-badge&logo=huggingface)](https:\u002F\u002Flj1995-gpt-sovits-proplus.hf.space\u002F)\n[![Image Size](https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fimage-size\u002Fxxxxrt666\u002Fgpt-sovits\u002Flatest?style=for-the-badge&logo=docker)](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fxxxxrt666\u002Fgpt-sovits)\n\n[![简体中文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F简体中文-阅读文档-blue?style=for-the-badge&logo=googledocs&logoColor=white)](https:\u002F\u002Fwww.yuque.com\u002Fbaicaigongchang1145haoyuangong\u002Fib3g1e)\n[![English](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FEnglish-Read%20Docs-blue?style=for-the-badge&logo=googledocs&logoColor=white)](https:\u002F\u002Frentry.co\u002FGPT-SoVITS-guide#\u002F)\n[![Change Log](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FChange%20Log-View%20Updates-blue?style=for-the-badge&logo=googledocs&logoColor=white)](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fblob\u002Fmain\u002Fdocs\u002Fen\u002FChangelog_EN.md)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLICENSE-MIT-green.svg?style=for-the-badge&logo=opensourceinitiative)](https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fblob\u002Fmain\u002FLICENSE)\n\n**English** | [**中文简体**](.\u002Fdocs\u002Fcn\u002FREADME.md) | [**日本語**](.\u002Fdocs\u002Fja\u002FREADME.md) | [**한국어**](.\u002Fdocs\u002Fko\u002FREADME.md) | [**Türkçe**](.\u002Fdocs\u002Ftr\u002FREADME.md)\n\n\u003C\u002Fdiv>\n\n---\n\n## Features:\n\n1. **Zero-shot TTS:** Input a 5-second vocal sample and experience instant text-to-speech conversion.\n\n2. **Few-shot TTS:** Fine-tune the model with just 1 minute of training data for improved voice similarity and realism.\n\n3. **Cross-lingual Support:** Inference in languages different from the training dataset, currently supporting English, Japanese, Korean, Cantonese and Chinese.\n\n4. **WebUI Tools:** Integrated tools include voice accompaniment separation, automatic training set segmentation, Chinese ASR, and text labeling, assisting beginners in creating training datasets and GPT\u002FSoVITS models.\n\n**Check out our [demo video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV12g4y1m7Uw) here!**\n\nUnseen speakers few-shot fine-tuning demo:\n\nhttps:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fassets\u002F129054828\u002F05bee1fa-bdd8-4d85-9350-80c060ab47fb\n\n**RTF(inference speed) of GPT-SoVITS v2 ProPlus**:\n0.028 tested in 4060Ti, 0.014 tested in 4090 (1400words~=4min, inference time is 3.36s), 0.526 in M4 CPU. You can test our [huggingface demo](https:\u002F\u002Flj1995-gpt-sovits-proplus.hf.space\u002F) (half H200) to experience high-speed inference .\n\n请不要尬黑GPT-SoVITS推理速度慢，谢谢！\n\nCPU-Optimized Inference Version：https:\u002F\u002Fgithub.com\u002Fbaicai-1145\u002FGPT-SoVITS-CPUFast\n\n**User guide: [简体中文](https:\u002F\u002Fwww.yuque.com\u002Fbaicaigongchang1145haoyuangong\u002Fib3g1e) | [English](https:\u002F\u002Frentry.co\u002FGPT-SoVITS-guide#\u002F)**\n\n## Installation\n\nFor users in China, you can [click here](https:\u002F\u002Fwww.codewithgpu.com\u002Fi\u002FRVC-Boss\u002FGPT-SoVITS\u002FGPT-SoVITS-Official) to use AutoDL Cloud Docker to experience the full functionality online.\n\n### Tested Environments\n\n| Python Version | PyTorch Version  | Device        |\n| -------------- | ---------------- | ------------- |\n| Python 3.10    | PyTorch 2.5.1    | CUDA 12.4     |\n| Python 3.11    | PyTorch 2.5.1    | CUDA 12.4     |\n| Python 3.11    | PyTorch 2.7.0    | CUDA 12.8     |\n| Python 3.9     | PyTorch 2.8.0dev | CUDA 12.8     |\n| Python 3.9     | PyTorch 2.5.1    | Apple silicon |\n| Python 3.11    | PyTorch 2.7.0    | Apple silicon |\n| Python 3.9     | PyTorch 2.2.2    | CPU           |\n\n### Windows\n\nIf you are a Windows user (tested with win>=10), you can [download the integrated package](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS-windows-package\u002Fresolve\u002Fmain\u002FGPT-SoVITS-v3lora-20250228.7z?download=true) and double-click on _go-webui.bat_ to start GPT-SoVITS-WebUI.\n\n**Users in China can [download the package here](https:\u002F\u002Fwww.yuque.com\u002Fbaicaigongchang1145haoyuangong\u002Fib3g1e\u002Fdkxgpiy9zb96hob4#KTvnO).**\n\nInstall the program by running the following commands:\n\n```pwsh\nconda create -n GPTSoVits python=3.10\nconda activate GPTSoVits\npwsh -F install.ps1 --Device \u003CCU126|CU128|CPU> --Source \u003CHF|HF-Mirror|ModelScope> [--DownloadUVR5]\n```\n\n### Linux\n\n```bash\nconda create -n GPTSoVits python=3.10\nconda activate GPTSoVits\nbash install.sh --device \u003CCU126|CU128|ROCM|CPU> --source \u003CHF|HF-Mirror|ModelScope> [--download-uvr5]\n```\n\n### macOS\n\n**Note: The models trained with GPUs on Macs result in significantly lower quality compared to those trained on other devices, so we are temporarily using CPUs instead.**\n\nInstall the program by running the following commands:\n\n```bash\nconda create -n GPTSoVits python=3.10\nconda activate GPTSoVits\nbash install.sh --device \u003CMPS|CPU> --source \u003CHF|HF-Mirror|ModelScope> [--download-uvr5]\n```\n\n### Install Manually\n\n#### Install Dependences\n\n```bash\nconda create -n GPTSoVits python=3.10\nconda activate GPTSoVits\n\npip install -r extra-req.txt --no-deps\npip install -r requirements.txt\n```\n\n#### Install FFmpeg\n\n##### Conda Users\n\n```bash\nconda activate GPTSoVits\nconda install ffmpeg\n```\n\n##### Ubuntu\u002FDebian Users\n\n```bash\nsudo apt install ffmpeg\nsudo apt install libsox-dev\n```\n\n##### Windows Users\n\nDownload and place [ffmpeg.exe](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FVoiceConversionWebUI\u002Fblob\u002Fmain\u002Fffmpeg.exe) and [ffprobe.exe](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FVoiceConversionWebUI\u002Fblob\u002Fmain\u002Fffprobe.exe) in the GPT-SoVITS root\n\nInstall [Visual Studio 2017](https:\u002F\u002Faka.ms\u002Fvs\u002F17\u002Frelease\u002Fvc_redist.x86.exe)\n\n##### MacOS Users\n\n```bash\nbrew install ffmpeg\n```\n\n### Running GPT-SoVITS with Docker\n\n#### Docker Image Selection\n\nDue to rapid development in the codebase and a slower Docker image release cycle, please:\n\n- Check [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fxxxxrt666\u002Fgpt-sovits) for the latest available image tags\n- Choose an appropriate image tag for your environment\n- `Lite` means the Docker image **does not include** ASR models and UVR5 models. You can manually download the UVR5 models, while the program will automatically download the ASR models as needed\n- The appropriate architecture image (amd64\u002Farm64) will be automatically pulled during Docker Compose\n- Docker Compose will mount **all files** in the current directory. Please switch to the project root directory and **pull the latest code** before using the Docker image\n- Optionally, build the image locally using the provided Dockerfile for the most up-to-date changes\n\n#### Environment Variables\n\n- `is_half`: Controls whether half-precision (fp16) is enabled. Set to `true` if your GPU supports it to reduce memory usage.\n\n#### Shared Memory Configuration\n\nOn Windows (Docker Desktop), the default shared memory size is small and may cause unexpected behavior. Increase `shm_size` (e.g., to `16g`) in your Docker Compose file based on your available system memory.\n\n#### Choosing a Service\n\nThe `docker-compose.yaml` defines two services:\n\n- `GPT-SoVITS-CU126` & `GPT-SoVITS-CU128`: Full version with all features.\n- `GPT-SoVITS-CU126-Lite` & `GPT-SoVITS-CU128-Lite`: Lightweight version with reduced dependencies and functionality.\n\nTo run a specific service with Docker Compose, use:\n\n```bash\ndocker compose run --service-ports \u003CGPT-SoVITS-CU126-Lite|GPT-SoVITS-CU128-Lite|GPT-SoVITS-CU126|GPT-SoVITS-CU128>\n```\n\n#### Building the Docker Image Locally\n\nIf you want to build the image yourself, use:\n\n```bash\nbash docker_build.sh --cuda \u003C12.6|12.8> [--lite]\n```\n\n#### Accessing the Running Container (Bash Shell)\n\nOnce the container is running in the background, you can access it using:\n\n```bash\ndocker exec -it \u003CGPT-SoVITS-CU126-Lite|GPT-SoVITS-CU128-Lite|GPT-SoVITS-CU126|GPT-SoVITS-CU128> bash\n```\n\n## Pretrained Models\n\n**If `install.sh` runs successfully, you may skip No.1,2,3**\n\n**Users in China can [download all these models here](https:\u002F\u002Fwww.yuque.com\u002Fbaicaigongchang1145haoyuangong\u002Fib3g1e\u002Fdkxgpiy9zb96hob4#nVNhX).**\n\n1. Download pretrained models from [GPT-SoVITS Models](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS) and place them in `GPT_SoVITS\u002Fpretrained_models`.\n\n2. Download G2PW models from [G2PWModel.zip(HF)](https:\u002F\u002Fhuggingface.co\u002FXXXXRT\u002FGPT-SoVITS-Pretrained\u002Fresolve\u002Fmain\u002FG2PWModel.zip)| [G2PWModel.zip(ModelScope)](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FXXXXRT\u002FGPT-SoVITS-Pretrained\u002Fresolve\u002Fmaster\u002FG2PWModel.zip), unzip and rename to `G2PWModel`, and then place them in `GPT_SoVITS\u002Ftext`.(Chinese TTS Only)\n\n3. For UVR5 (Vocals\u002FAccompaniment Separation & Reverberation Removal, additionally), download models from [UVR5 Weights](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FVoiceConversionWebUI\u002Ftree\u002Fmain\u002Fuvr5_weights) and place them in `tools\u002Fuvr5\u002Fuvr5_weights`.\n\n   - If you want to use `bs_roformer` or `mel_band_roformer` models for UVR5, you can manually download the model and corresponding configuration file, and put them in `tools\u002Fuvr5\u002Fuvr5_weights`. **Rename the model file and configuration file, ensure that the model and configuration files have the same and corresponding names except for the suffix**. In addition, the model and configuration file names **must include `roformer`** in order to be recognized as models of the roformer class.\n\n   - The suggestion is to **directly specify the model type** in the model name and configuration file name, such as `mel_mand_roformer`, `bs_roformer`. If not specified, the features will be compared from the configuration file to determine which type of model it is. For example, the model `bs_roformer_ep_368_sdr_12.9628.ckpt` and its corresponding configuration file `bs_roformer_ep_368_sdr_12.9628.yaml` are a pair, `kim_mel_band_roformer.ckpt` and `kim_mel_band_roformer.yaml` are also a pair.\n\n4. For Chinese ASR (additionally), download models from [Damo ASR Model](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch\u002Ffiles), [Damo VAD Model](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fspeech_fsmn_vad_zh-cn-16k-common-pytorch\u002Ffiles), and [Damo Punc Model](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fdamo\u002Fpunc_ct-transformer_zh-cn-common-vocab272727-pytorch\u002Ffiles) and place them in `tools\u002Fasr\u002Fmodels`.\n\n5. For English or Japanese ASR (additionally), download models from [Faster Whisper Large V3](https:\u002F\u002Fhuggingface.co\u002FSystran\u002Ffaster-whisper-large-v3) and place them in `tools\u002Fasr\u002Fmodels`. Also, [other models](https:\u002F\u002Fhuggingface.co\u002FSystran) may have the similar effect with smaller disk footprint.\n\n## Dataset Format\n\nThe TTS annotation .list file format:\n\n```\n\nvocal_path|speaker_name|language|text\n\n```\n\nLanguage dictionary:\n\n- 'zh': Chinese\n- 'ja': Japanese\n- 'en': English\n- 'ko': Korean\n- 'yue': Cantonese\n\nExample:\n\n```\n\nD:\\GPT-SoVITS\\xxx\u002Fxxx.wav|xxx|en|I like playing Genshin.\n\n```\n\n## Finetune and inference\n\n### Open WebUI\n\n#### Integrated Package Users\n\nDouble-click `go-webui.bat`or use `go-webui.ps1`\nif you want to switch to V1,then double-click`go-webui-v1.bat` or use `go-webui-v1.ps1`\n\n#### Others\n\n```bash\npython webui.py \u003Clanguage(optional)>\n```\n\nif you want to switch to V1,then\n\n```bash\npython webui.py v1 \u003Clanguage(optional)>\n```\n\nOr maunally switch version in WebUI\n\n### Finetune\n\n#### Path Auto-filling is now supported\n\n1. Fill in the audio path\n2. Slice the audio into small chunks\n3. Denoise(optinal)\n4. ASR\n5. Proofreading ASR transcriptions\n6. Go to the next Tab, then finetune the model\n\n### Open Inference WebUI\n\n#### Integrated Package Users\n\nDouble-click `go-webui-v2.bat` or use `go-webui-v2.ps1` ,then open the inference webui at `1-GPT-SoVITS-TTS\u002F1C-inference`\n\n#### Others\n\n```bash\npython GPT_SoVITS\u002Finference_webui.py \u003Clanguage(optional)>\n```\n\nOR\n\n```bash\npython webui.py\n```\n\nthen open the inference webui at `1-GPT-SoVITS-TTS\u002F1C-inference`\n\n## V2 Release Notes\n\nNew Features:\n\n1. Support Korean and Cantonese\n\n2. An optimized text frontend\n\n3. Pre-trained model extended from 2k hours to 5k hours\n\n4. Improved synthesis quality for low-quality reference audio\n\n   [more details](\u003Chttps:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fwiki\u002FGPT%E2%80%90SoVITS%E2%80%90v2%E2%80%90features-(%E6%96%B0%E7%89%B9%E6%80%A7)>)\n\nUse v2 from v1 environment:\n\n1. `pip install -r requirements.txt` to update some packages\n\n2. Clone the latest codes from github.\n\n3. Download v2 pretrained models from [huggingface](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS\u002Ftree\u002Fmain\u002Fgsv-v2final-pretrained) and put them into `GPT_SoVITS\u002Fpretrained_models\u002Fgsv-v2final-pretrained`.\n\n   Chinese v2 additional: [G2PWModel.zip(HF)](https:\u002F\u002Fhuggingface.co\u002FXXXXRT\u002FGPT-SoVITS-Pretrained\u002Fresolve\u002Fmain\u002FG2PWModel.zip)| [G2PWModel.zip(ModelScope)](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FXXXXRT\u002FGPT-SoVITS-Pretrained\u002Fresolve\u002Fmaster\u002FG2PWModel.zip)(Download G2PW models, unzip and rename to `G2PWModel`, and then place them in `GPT_SoVITS\u002Ftext`.)\n\n## V3 Release Notes\n\nNew Features:\n\n1. The timbre similarity is higher, requiring less training data to approximate the target speaker (the timbre similarity is significantly improved using the base model directly without fine-tuning).\n\n2. GPT model is more stable, with fewer repetitions and omissions, and it is easier to generate speech with richer emotional expression.\n\n   [more details](\u003Chttps:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fwiki\u002FGPT%E2%80%90SoVITS%E2%80%90v3v4%E2%80%90features-(%E6%96%B0%E7%89%B9%E6%80%A7)>)\n\nUse v3 from v2 environment:\n\n1. `pip install -r requirements.txt` to update some packages\n\n2. Clone the latest codes from github.\n\n3. Download v3 pretrained models (s1v3.ckpt, s2Gv3.pth and models--nvidia--bigvgan_v2_24khz_100band_256x folder) from [huggingface](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS\u002Ftree\u002Fmain) and put them into `GPT_SoVITS\u002Fpretrained_models`.\n\n   additional: for Audio Super Resolution model, you can read [how to download](.\u002Ftools\u002FAP_BWE_main\u002F24kto48k\u002Freadme.txt)\n\n## V4 Release Notes\n\nNew Features:\n\n1. Version 4 fixes the issue of metallic artifacts in Version 3 caused by non-integer multiple upsampling, and natively outputs 48k audio to prevent muffled sound (whereas Version 3 only natively outputs 24k audio). The author considers Version 4 a direct replacement for Version 3, though further testing is still needed.\n   [more details](\u003Chttps:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fwiki\u002FGPT%E2%80%90SoVITS%E2%80%90v3v4%E2%80%90features-(%E6%96%B0%E7%89%B9%E6%80%A7)>)\n\nUse v4 from v1\u002Fv2\u002Fv3 environment:\n\n1. `pip install -r requirements.txt` to update some packages\n\n2. Clone the latest codes from github.\n\n3. Download v4 pretrained models (gsv-v4-pretrained\u002Fs2v4.pth, and gsv-v4-pretrained\u002Fvocoder.pth) from [huggingface](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS\u002Ftree\u002Fmain) and put them into `GPT_SoVITS\u002Fpretrained_models`.\n\n## V2Pro Release Notes\n\nNew Features:\n\n1. Slightly higher VRAM usage than v2, surpassing v4's performance, with v2's hardware cost and speed.\n   [more details](\u003Chttps:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fwiki\u002FGPT%E2%80%90SoVITS%E2%80%90features-(%E5%90%84%E7%89%88%E6%9C%AC%E7%89%B9%E6%80%A7)>)\n\n2.v1\u002Fv2 and the v2Pro series share the same characteristics, while v3\u002Fv4 have similar features. For training sets with average audio quality, v1\u002Fv2\u002Fv2Pro can deliver decent results, but v3\u002Fv4 cannot. Additionally, the synthesized tone and timebre of v3\u002Fv4 lean more toward the reference audio rather than the overall training set.\n\nUse v2Pro from v1\u002Fv2\u002Fv3\u002Fv4 environment:\n\n1. `pip install -r requirements.txt` to update some packages\n\n2. Clone the latest codes from github.\n\n3. Download v2Pro pretrained models (v2Pro\u002Fs2Dv2Pro.pth, v2Pro\u002Fs2Gv2Pro.pth, v2Pro\u002Fs2Dv2ProPlus.pth, v2Pro\u002Fs2Gv2ProPlus.pth, and sv\u002Fpretrained_eres2netv2w24s4ep4.ckpt) from [huggingface](https:\u002F\u002Fhuggingface.co\u002Flj1995\u002FGPT-SoVITS\u002Ftree\u002Fmain) and put them into `GPT_SoVITS\u002Fpretrained_models`.\n\n## Todo List\n\n- [x] **High Priority:**\n\n  - [x] Localization in Japanese and English.\n  - [x] User guide.\n  - [x] Japanese and English dataset fine tune training.\n\n- [ ] **Features:**\n  - [x] Zero-shot voice conversion (5s) \u002F few-shot voice conversion (1min).\n  - [x] TTS speaking speed control.\n  - [ ] ~~Enhanced TTS emotion control.~~ Maybe use pretrained finetuned preset GPT models for better emotion.\n  - [ ] Experiment with changing SoVITS token inputs to probability distribution of GPT vocabs (transformer latent).\n  - [x] Improve English and Japanese text frontend.\n  - [ ] Develop tiny and larger-sized TTS models.\n  - [x] Colab scripts.\n  - [x] Try expand training dataset (2k hours -> 10k hours).\n  - [x] better sovits base model (enhanced audio quality)\n  - [ ] model mix\n\n## (Additional) Method for running from the command line\n\nUse the command line to open the WebUI for UVR5\n\n```bash\npython tools\u002Fuvr5\u002Fwebui.py \"\u003Cinfer_device>\" \u003Cis_half> \u003Cwebui_port_uvr5>\n```\n\n\u003C!-- If you can't open a browser, follow the format below for UVR processing,This is using mdxnet for audio processing\n```\npython mdxnet.py --model --input_root --output_vocal --output_ins --agg_level --format --device --is_half_precision\n``` -->\n\nThis is how the audio segmentation of the dataset is done using the command line\n\n```bash\npython audio_slicer.py \\\n    --input_path \"\u003Cpath_to_original_audio_file_or_directory>\" \\\n    --output_root \"\u003Cdirectory_where_subdivided_audio_clips_will_be_saved>\" \\\n    --threshold \u003Cvolume_threshold> \\\n    --min_length \u003Cminimum_duration_of_each_subclip> \\\n    --min_interval \u003Cshortest_time_gap_between_adjacent_subclips>\n    --hop_size \u003Cstep_size_for_computing_volume_curve>\n```\n\nThis is how dataset ASR processing is done using the command line(Only Chinese)\n\n```bash\npython tools\u002Fasr\u002Ffunasr_asr.py -i \u003Cinput> -o \u003Coutput>\n```\n\nASR processing is performed through Faster_Whisper(ASR marking except Chinese)\n\n(No progress bars, GPU performance may cause time delays)\n\n```bash\npython .\u002Ftools\u002Fasr\u002Ffasterwhisper_asr.py -i \u003Cinput> -o \u003Coutput> -l \u003Clanguage> -p \u003Cprecision>\n```\n\nA custom list save path is enabled\n\n## Credits\n\nSpecial thanks to the following projects and contributors:\n\n### Theoretical Research\n\n- [ar-vits](https:\u002F\u002Fgithub.com\u002Finnnky\u002Far-vits)\n- [SoundStorm](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FSoundStorm\u002Ftree\u002Fmaster\u002Fsoundstorm\u002Fs1\u002FAR)\n- [vits](https:\u002F\u002Fgithub.com\u002Fjaywalnut310\u002Fvits)\n- [TransferTTS](https:\u002F\u002Fgithub.com\u002Fhcy71o\u002FTransferTTS\u002Fblob\u002Fmaster\u002Fmodels.py#L556)\n- [contentvec](https:\u002F\u002Fgithub.com\u002Fauspicious3000\u002Fcontentvec\u002F)\n- [hifi-gan](https:\u002F\u002Fgithub.com\u002Fjik876\u002Fhifi-gan)\n- [fish-speech](https:\u002F\u002Fgithub.com\u002Ffishaudio\u002Ffish-speech\u002Fblob\u002Fmain\u002Ftools\u002Fllama\u002Fgenerate.py#L41)\n- [f5-TTS](https:\u002F\u002Fgithub.com\u002FSWivid\u002FF5-TTS\u002Fblob\u002Fmain\u002Fsrc\u002Ff5_tts\u002Fmodel\u002Fbackbones\u002Fdit.py)\n- [shortcut flow matching](https:\u002F\u002Fgithub.com\u002Fkvfrans\u002Fshortcut-models\u002Fblob\u002Fmain\u002Ftargets_shortcut.py)\n\n### Pretrained Models\n\n- [Chinese Speech Pretrain](https:\u002F\u002Fgithub.com\u002FTencentGameMate\u002Fchinese_speech_pretrain)\n- [Chinese-Roberta-WWM-Ext-Large](https:\u002F\u002Fhuggingface.co\u002Fhfl\u002Fchinese-roberta-wwm-ext-large)\n- [BigVGAN](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FBigVGAN)\n- [eresnetv2](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_eres2netv2w24s4ep4_sv_zh-cn_16k-common)\n\n### Text Frontend for Inference\n\n- [paddlespeech zh_normalization](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleSpeech\u002Ftree\u002Fdevelop\u002Fpaddlespeech\u002Ft2s\u002Ffrontend\u002Fzh_normalization)\n- [split-lang](https:\u002F\u002Fgithub.com\u002FDoodleBears\u002Fsplit-lang)\n- [g2pW](https:\u002F\u002Fgithub.com\u002FGitYCC\u002Fg2pW)\n- [pypinyin-g2pW](https:\u002F\u002Fgithub.com\u002Fmozillazg\u002Fpypinyin-g2pW)\n- [paddlespeech g2pw](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleSpeech\u002Ftree\u002Fdevelop\u002Fpaddlespeech\u002Ft2s\u002Ffrontend\u002Fg2pw)\n\n### WebUI Tools\n\n- [ultimatevocalremovergui](https:\u002F\u002Fgithub.com\u002FAnjok07\u002Fultimatevocalremovergui)\n- [audio-slicer](https:\u002F\u002Fgithub.com\u002Fopenvpi\u002Faudio-slicer)\n- [SubFix](https:\u002F\u002Fgithub.com\u002Fcronrpc\u002FSubFix)\n- [FFmpeg](https:\u002F\u002Fgithub.com\u002FFFmpeg\u002FFFmpeg)\n- [gradio](https:\u002F\u002Fgithub.com\u002Fgradio-app\u002Fgradio)\n- [faster-whisper](https:\u002F\u002Fgithub.com\u002FSYSTRAN\u002Ffaster-whisper)\n- [FunASR](https:\u002F\u002Fgithub.com\u002Falibaba-damo-academy\u002FFunASR)\n- [AP-BWE](https:\u002F\u002Fgithub.com\u002Fyxlu-0102\u002FAP-BWE)\n\nThankful to @Naozumi520 for providing the Cantonese training set and for the guidance on Cantonese-related knowledge.\n\n## Thanks to all contributors for their efforts\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FRVC-Boss\u002FGPT-SoVITS\u002Fgraphs\u002Fcontributors\" target=\"_blank\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=RVC-Boss\u002FGPT-SoVITS\" \u002F>\n\u003C\u002Fa>\n","GPT-SoVITS 是一个强大的少量样本语音转换和文本转语音的Web界面工具。该项目允许用户通过仅一分钟的语音数据训练出高质量的TTS模型，并支持零样本和少量样本的语音合成，从而实现高相似度与真实感的声音克隆。此外，它还具备跨语言支持功能，能够处理包括英语、日语、韩语、粤语及中文在内的多种语言。项目内置了诸如声伴分离、自动训练集分割等实用工具，极大地方便了初学者构建自己的训练数据集及模型。适用于需要快速创建个性化语音助手或进行多语言内容创作的场景。",2,"2026-06-11 02:37:36","top_all"]