[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72235":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":24,"defaultBranch":25,"hasWiki":23,"hasPages":24,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},72235,"seed-vc","Plachtaa\u002Fseed-vc","Plachtaa","zero-shot voice conversion & singing voice conversion, with real-time support","",null,"Python",3795,492,1,91,0,11,25,82,33,30.08,"GNU General Public License v3.0",true,false,"main",[27,28],"singing-voice-conversion","voice-conversion","2026-06-12 02:03:00","# Seed-VC  \n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Demo-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPlachta\u002FSeed-VC)  [![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2411.09943-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09943)\n\n*English | [简体中文](README-ZH.md) | [日本語](README-JA.md)*  \n\n[real-time-demo.webm](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F86325c5e-f7f6-4a04-8695-97275a5d046c)\n\nCurrently released model supports *zero-shot voice conversion* 🔊 , *zero-shot real-time voice conversion* 🗣️ and *zero-shot singing voice conversion* 🎶. Without any training, it is able to clone a voice given a reference speech of 1~30 seconds.  \n\nWe support further fine-tuning on custom data to increase performance on specific speaker\u002Fspeakers, with extremely low data requirement **(minimum 1 utterance per speaker)** and extremely fast training speed **(minimum 100 steps, 2 min on T4)**!\n\n**Real-time voice conversion** is support, with algorithm delay of ~300ms and device side delay of ~100ms, suitable for online meetings, gaming and live streaming.\n\nTo find a list of demos and comparisons with previous voice conversion models, please visit our [demo page](https:\u002F\u002Fplachtaa.github.io\u002Fseed-vc\u002F)🌐  and [Evaluaiton](EVAL.md)📊.\n\nWe are keeping on improving the model quality and adding more features.\n\n## Evaluation📊\nSee [EVAL.md](EVAL.md) for objective evaluation results and comparisons with other baselines.\n## Installation📥\nSuggested python 3.10 on Windows, Mac M Series (Apple Silicon) or Linux.\nWindows and Linux:\n```bash\npip install -r requirements.txt\n```\n\nMac M Series:\n```bash\npip install -r requirements-mac.txt\n```\n\nFor Windows users, you may consider install `triton-windows` to enable `--compile` usage, which gains speed up on V2 models:\n```bash\npip install triton-windows==3.2.0.post13\n```\n\n## Usage🛠️\nWe have released 4 models for different purposes:\n\n| Version | Name                                                                                                                                                                                                                       | Purpose                        | Sampling Rate | Content Encoder                                                        | Vocoder | Hidden Dim | N Layers | Params             | Remarks                                                |\n|---------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|---------------|------------------------------------------------------------------------|---------|------------|----------|--------------------|--------------------------------------------------------|\n| v1.0    | seed-uvit-tat-xlsr-tiny ([🤗](https:\u002F\u002Fhuggingface.co\u002FPlachta\u002FSeed-VC\u002Fblob\u002Fmain\u002FDiT_uvit_tat_xlsr_ema.pth)[📄](configs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_xlsr_tiny.yml))                                                     | Voice Conversion (VC)          | 22050         | XLSR-large                                                             | HIFT    | 384        | 9        | 25M                | suitable for real-time voice conversion                |\n| v1.0    | seed-uvit-whisper-small-wavenet ([🤗](https:\u002F\u002Fhuggingface.co\u002FPlachta\u002FSeed-VC\u002Fblob\u002Fmain\u002FDiT_seed_v2_uvit_whisper_small_wavenet_bigvgan_pruned.pth)[📄](configs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_whisper_small_wavenet.yml)) | Voice Conversion (VC)          | 22050         | Whisper-small                                                          | BigVGAN | 512        | 13       | 98M                | suitable for offline voice conversion                  |\n| v1.0    | seed-uvit-whisper-base ([🤗](https:\u002F\u002Fhuggingface.co\u002FPlachta\u002FSeed-VC\u002Fblob\u002Fmain\u002FDiT_seed_v2_uvit_whisper_base_f0_44k_bigvgan_pruned_ft_ema.pth)[📄](configs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_whisper_base_f0_44k.yml))       | Singing Voice Conversion (SVC) | 44100         | Whisper-small                                                          | BigVGAN | 768        | 17       | 200M               | strong zero-shot performance, singing voice conversion |\n| v2.0    | hubert-bsqvae-small ([🤗](https:\u002F\u002Fhuggingface.co\u002FPlachta\u002FSeed-VC\u002Fblob\u002Fmain\u002Fv2)[📄](configs\u002Fv2\u002Fvc_wrapper.yaml))                                                                                                            | Voice & Accent Conversion (VC) | 22050         | [ASTRAL-Quantization](https:\u002F\u002Fgithub.com\u002FPlachtaa\u002FASTRAL-quantization) | BigVGAN | 512        | 13       | 67M(CFM) + 90M(AR) | Best in suppressing source speaker traits              |\n\nCheckpoints of the latest model release will be downloaded automatically when first run inference.  \nIf you are unable to access huggingface for network reason, try using mirror by adding `HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com` before every command.\n\nCommand line inference:\n```bash\npython inference.py --source \u003Csource-wav>\n--target \u003Creferene-wav>\n--output \u003Coutput-dir>\n--diffusion-steps 25 # recommended 30~50 for singingvoice conversion\n--length-adjust 1.0\n--inference-cfg-rate 0.7\n--f0-condition False # set to True for singing voice conversion\n--auto-f0-adjust False # set to True to auto adjust source pitch to target pitch level, normally not used in singing voice conversion\n--semi-tone-shift 0 # pitch shift in semitones for singing voice conversion\n--checkpoint \u003Cpath-to-checkpoint>\n--config \u003Cpath-to-config>\n --fp16 True\n```\nwhere:\n- `source` is the path to the speech file to convert to reference voice\n- `target` is the path to the speech file as voice reference\n- `output` is the path to the output directory\n- `diffusion-steps` is the number of diffusion steps to use, default is 25, use 30-50 for best quality, use 4-10 for fastest inference\n- `length-adjust` is the length adjustment factor, default is 1.0, set \u003C1.0 for speed-up speech, >1.0 for slow-down speech\n- `inference-cfg-rate` has subtle difference in the output, default is 0.7 \n- `f0-condition` is the flag to condition the pitch of the output to the pitch of the source audio, default is False, set to True for singing voice conversion  \n- `auto-f0-adjust` is the flag to auto adjust source pitch to target pitch level, default is False, normally not used in singing voice conversion\n- `semi-tone-shift` is the pitch shift in semitones for singing voice conversion, default is 0  \n- `checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface.(`seed-uvit-whisper-small-wavenet` if `f0-condition` is `False` else `seed-uvit-whisper-base`)\n- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface  \n- `fp16` is the flag to use float16 inference, default is True\n\nSimilarly, to use V2 model, you can run:\n```bash\npython inference_v2.py --source \u003Csource-wav>\n--target \u003Creferene-wav>\n--output \u003Coutput-dir>\n--diffusion-steps 25 # recommended 30~50 for singingvoice conversion\n--length-adjust 1.0 # same as V1\n--intelligibility-cfg-rate 0.7 # controls how clear the output linguistic content is, recommended 0.0~1.0\n--similarity-cfg-rate 0.7 # controls how similar the output voice is to the reference voice, recommended 0.0~1.0\n--convert-style true # whether to use AR model for accent & emotion conversion, set to false will only conduct timbre conversion similar to V1\n--anonymization-only false # set to true will ignore reference audio but only anonymize source speech to an \"average voice\"\n--top-p 0.9 # controls the diversity of the AR model output, recommended 0.5~1.0\n--temperature 1.0 # controls the randomness of the AR model output, recommended 0.7~1.2\n--repetition-penalty 1.0 # penalizes the repetition of the AR model output, recommended 1.0~1.5\n--cfm-checkpoint-path \u003Cpath-to-cfm-checkpoint> # path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface\n--ar-checkpoint-path \u003Cpath-to-ar-checkpoint> # path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface\n```\n\n\nVoice Conversion Web UI:\n```bash\npython app_vc.py --checkpoint \u003Cpath-to-checkpoint> --config \u003Cpath-to-config> --fp16 True\n```\n- `checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (`seed-uvit-whisper-small-wavenet`)\n- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface  \n\nThen open the browser and go to `http:\u002F\u002Flocalhost:7860\u002F` to use the web interface.\n\nSinging Voice Conversion Web UI:\n```bash\npython app_svc.py --checkpoint \u003Cpath-to-checkpoint> --config \u003Cpath-to-config> --fp16 True\n```\n- `checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (`seed-uvit-whisper-base`)\n- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface  \n\nV2 model Web UI:\n```bash\npython app_vc_v2.py --cfm-checkpoint-path \u003Cpath-to-cfm-checkpoint> --ar-checkpoint-path \u003Cpath-to-ar-checkpoint>\n```\n- `cfm-checkpoint-path` is the path to the checkpoint of the CFM model, leave to blank to auto-download default model from huggingface\n- `ar-checkpoint-path` is the path to the checkpoint of the AR model, leave to blank to auto-download default model from huggingface\n- you may consider adding `--compile` to gain ~x6 speed-up on AR model inference  \n- \nIntegrated Web UI:\n```bash\npython app.py --enable-v1 --enable-v2\n```\nThis will only load pretrained models for zero-shot inference. To use custom checkpoints, please run `app_vc.py` or `app_svc.py` as above.  \nIf you have limited memory, remove `--enable-v2` or `--enable-v1` to only load one of the model sets.\n\nReal-time voice conversion GUI:\n```bash\npython real-time-gui.py --checkpoint-path \u003Cpath-to-checkpoint> --config-path \u003Cpath-to-config>\n```\n- `checkpoint` is the path to the model checkpoint if you have trained or fine-tuned your own model, leave to blank to auto-download default model from huggingface. (`seed-uvit-tat-xlsr-tiny`)\n- `config` is the path to the model config if you have trained or fine-tuned your own model, leave to blank to auto-download default config from huggingface  \n\n> [!IMPORTANT]\n> It is strongly recommended to use a GPU for real-time voice conversion.\n> Some performance testing has been done on a NVIDIA RTX 3060 Laptop GPU, results and recommended parameter settings are listed below:\n\n| Model Configuration             | Diffusion Steps | Inference CFG Rate | Max Prompt Length | Block Time (s) | Crossfade Length (s) | Extra context (left) (s) | Extra context (right) (s) | Latency (ms) | Inference Time per Chunk (ms) |\n|---------------------------------|-----------------|--------------------|-------------------|----------------|----------------------|--------------------------|---------------------------|--------------|-------------------------------| \n| seed-uvit-xlsr-tiny             | 10              | 0.7                | 3.0               | 0.18s          | 0.04s                | 2.5s                     | 0.02s                     | 430ms        | 150ms                         |\n\nYou can adjust the parameters in the GUI according to your own device performance, the voice conversion stream should work well as long as Inference Time is less than Block Time.  \nNote that inference speed may drop if you are running other GPU intensive tasks (e.g. gaming, watching videos)  \n\nExplanations for real-time voice conversion GUI parameters:\n- `Diffusion Steps` is the number of diffusion steps to use, in real-time case usually set to 4~10 for fastest inference;\n- `Inference CFG Rate` has subtle difference in the output, default is 0.7, set to 0.0 gains about 1.5x speed-up;\n- `Max Prompt Length` is the maximum length of the prompt audio, setting to a low value can speed up inference, but may reduce similarity to prompt speech;\n- `Block Time` is the time length of each audio chunk for inference, the higher the value, the higher the latency, note this value must be greater than the inference time per block, set according to your hardware condition;\n- `Crossfade Length` is the time length of crossfade between audio chunks, normally not needed to change;\n- `Extra context (left)` is the time length of extra history context for inference, the higher the value, the higher the inference time, but can increase stability;\n- `Extra context (right)` is the time length of extra future context for inference, the higher the value, the higher the inference time and latency, but can increase stability;\n\nThe algorithm delay is appoximately calculated as `Block Time * 2 + Extra context (right)`, device side delay is usually of ~100ms. The overall delay is the sum of the two.\n\nYou may wish to use [VB-CABLE](https:\u002F\u002Fvb-audio.com\u002FCable\u002F) to route audio from GUI output stream to a virtual microphone.  \n\n*(GUI and audio chunking logic are modified from [RVC](https:\u002F\u002Fgithub.com\u002FRVC-Project\u002FRetrieval-based-Voice-Conversion-WebUI), thanks for their brilliant implementation!)*\n\n## Training🏋️\nFine-tuning on custom data allow the model to clone someone's voice more accurately. It will largely improve speaker similarity on particular speakers, but may slightly increase WER.  \nA Colab Tutorial is here for you to follow: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1R1BJTqMsTXZzYAVx3j1BiemFXog9pbQG?usp=sharing)\n1. Prepare your own dataset. It has to satisfy the following:\n    - File structure does not matter\n    - Each audio file should range from 1 to 30 seconds, otherwise will be ignored\n    - All audio files should be in on of the following formats: `.wav` `.flac` `.mp3` `.m4a` `.opus` `.ogg`\n    - Speaker label is not required, but make sure that each speaker has at least 1 utterance\n    - Of course, the more data you have, the better the model will perform\n    - Training data should be as clean as possible, BGM or noise is not desired\n2. Choose a model configuration file from `configs\u002Fpresets\u002F` for fine-tuning, or create your own to train from scratch.\n    - For fine-tuning, it should be one of the following:\n        - `.\u002Fconfigs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_xlsr_tiny.yml` for real-time voice conversion\n        - `.\u002Fconfigs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_whisper_small_wavenet.yml` for offline voice conversion\n        - `.\u002Fconfigs\u002Fpresets\u002Fconfig_dit_mel_seed_uvit_whisper_base_f0_44k.yml` for singing voice conversion\n3. Run the following command to start training:\n```bash\npython train.py \n--config \u003Cpath-to-config> \n--dataset-dir \u003Cpath-to-data>\n--run-name \u003Crun-name>\n--batch-size 2\n--max-steps 1000\n--max-epochs 1000\n--save-every 500\n--num-workers 0\n```\nwhere:\n- `config` is the path to the model config, choose one of the above for fine-tuning or create your own for training from scratch\n- `dataset-dir` is the path to the dataset directory, which should be a folder containing all the audio files\n- `run-name` is the name of the run, which will be used to save the model checkpoints and logs\n- `batch-size` is the batch size for training, choose depends on your GPU memory.\n- `max-steps` is the maximum number of steps to train, choose depends on your dataset size and training time\n- `max-epochs` is the maximum number of epochs to train, choose depends on your dataset size and training time\n- `save-every` is the number of steps to save the model checkpoint\n- `num-workers` is the number of workers for data loading, set to 0 for Windows    \n\nSimilarly, to train V2 model, you can run: (note that V2 training script supports multi-GPU training)\n```bash\naccelerate launch train_v2.py \n--dataset-dir \u003Cpath-to-data>\n--run-name \u003Crun-name>\n--batch-size 2\n--max-steps 1000\n--max-epochs 1000\n--save-every 500\n--num-workers 0\n--train-cfm\n```\n\n4. If training accidentially stops, you can resume training by running the same command again, the training will continue from the last checkpoint. (Make sure `run-name` and `config` arguments are the same so that latest checkpoint can be found)\n\n5. After training, you can use the trained model for inference by specifying the path to the checkpoint and config file.\n    - They should be under `.\u002Fruns\u002F\u003Crun-name>\u002F`, with the checkpoint named `ft_model.pth` and config file with the same name as the training config file.\n    - You still have to specify a reference audio file of the speaker you'd like to use during inference, similar to zero-shot usage.\n\n## TODO📝\n- [x] Release code\n- [x] Release pretrained models: [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-SeedVC-blue)](https:\u002F\u002Fhuggingface.co\u002FPlachta\u002FSeed-VC)\n- [x] Huggingface space demo: [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hugging%20Face-Space-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FPlachta\u002FSeed-VC)\n- [x] HTML demo page: [Demo](https:\u002F\u002Fplachtaa.github.io\u002Fseed-vc\u002F)\n- [x] Streaming inference\n- [x] Reduce streaming inference latency\n- [x] Demo video for real-time voice conversion\n- [x] Singing voice conversion\n- [x] Noise resiliency for source audio\n- [ ] Potential architecture improvements\n    - [x] U-ViT style skip connections\n    - [x] Changed input to OpenAI Whisper\n    - [x] Time as Token\n- [x] Code for training on custom data\n- [x] Few-shot\u002FOne-shot speaker fine-tuning\n- [x] Changed to BigVGAN from NVIDIA for singing voice decoding\n- [x] Whisper version model for singing voice conversion\n- [x] Objective evaluation and comparison with RVC\u002FSoVITS for singing voice conversion\n- [x] Improve audio quality\n- [ ] NSF vocoder for better singing voice conversion\n- [x] Fix real-time voice conversion artifact while not talking (done by adding a VAD model)\n- [x] Colab Notebook for fine-tuning example\n- [x] Replace whisper with more advanced linguistic content extractor\n- [ ] More to be added\n- [x] Add Apple Silicon support\n- [ ] Release paper, evaluations and demo page for V2 model\n\n## Known Issues\n- On Mac - running `real-time-gui.py` might raise an error `ModuleNotFoundError: No module named '_tkinter'`, in this case a new Python version **with Tkinter support** should be installed. Refer to [This Guide on stack overflow](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F76105218\u002Fwhy-does-tkinter-or-turtle-seem-to-be-missing-or-broken-shouldnt-it-be-part) for explanation of the problem and a detailed fix.\n\n\n## CHANGELOGS🗒️\n- 2024-04-16\n    - Released V2 model for voice and accent conversion, with better anonymization of source speaker\n- 2025-03-03:\n    - Added Mac M Series (Apple Silicon) support\n- 2024-11-26:\n    - Updated v1.0 tiny version pretrained model, optimized for real-time voice conversion\n    - Support one-shot\u002Ffew-shot single\u002Fmulti speaker fine-tuning\n    - Support using custom checkpoint for webUI & real-time GUI\n- 2024-11-19:\n    - arXiv paper released\n- 2024-10-28:\n    - Updated fine-tuned 44k singing voice conversion model with better audio quality\n- 2024-10-27:\n    - Added real-time voice conversion GUI\n- 2024-10-25:\n    - Added exhaustive evaluation results and comparisons with RVCv2 for singing voice conversion\n- 2024-10-24:\n    - Updated 44kHz singing voice conversion model, with OpenAI Whisper as speech content input\n- 2024-10-07:\n    - Updated v0.3 pretrained model, changed speech content encoder to OpenAI Whisper\n    - Added objective evaluation results for v0.3 pretrained model\n- 2024-09-22:\n    - Updated singing voice conversion model to use BigVGAN from NVIDIA, providing large improvement to high-pitched singing voices\n    - Support chunking and  streaming output for long audio files in Web UI\n- 2024-09-18:\n    - Updated f0 conditioned model for singing voice conversion\n- 2024-09-14:\n    - Updated v0.2 pretrained model, with smaller size and less diffusion steps to achieve same quality, and additional ability to control prosody preservation\n    - Added command line inference script\n    - Added installation and usage instructions\n\n## Acknowledgements🙏\n- [Amphion](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion) for providing computational resources and inspiration!\n- [Vevo](https:\u002F\u002Fgithub.com\u002Fopen-mmlab\u002FAmphion\u002Ftree\u002Fmain\u002Fmodels\u002Fvc\u002Fvevo) for theoretical foundation of V2 model\n- [MegaTTS3](https:\u002F\u002Fgithub.com\u002Fbytedance\u002FMegaTTS3) for multi-condition CFG inference implemented in V2 model\n- [ASTRAL-quantiztion](https:\u002F\u002Fgithub.com\u002FPlachtaa\u002FASTRAL-quantization) for the amazing speaker-disentangled speech tokenizer used by V2 model\n- [RVC](https:\u002F\u002Fgithub.com\u002FRVC-Project\u002FRetrieval-based-Voice-Conversion-WebUI) for foundationing the real-time voice conversion\n- [SEED-TTS](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.02430) for the initial idea\n","Seed-VC 是一个支持零样本语音转换和歌唱声音转换的项目，同时具备实时转换能力。其核心功能包括无需训练即可根据1到30秒的参考语音克隆出目标声音，并且支持对特定说话人进行微调以提高性能，所需数据量极低（每说话人最少只需一条语句），训练速度非常快（T4显卡上最短仅需2分钟）。此外，该项目还提供了约300毫秒算法延迟和100毫秒设备端延迟的实时语音转换功能，适用于在线会议、游戏以及直播等场景。该工具基于Python开发，推荐使用Python 3.10版本，在Windows、Mac M系列或Linux系统上运行。",2,"2026-06-11 03:40:59","high_star"]