[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71051":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},71051,"vits","jaywalnut310\u002Fvits","jaywalnut310","VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech","https:\u002F\u002Fjaywalnut310.github.io\u002Fvits-demo\u002Findex.html",null,"Python",7866,1391,54,159,0,3,12,19,9,40.43,"MIT License",false,"main",true,[27,28,29,30,31],"deep-learning","pytorch","speech-synthesis","text-to-speech","tts","2026-06-12 02:02:47","# VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech\n\n### Jaehyeon Kim, Jungil Kong, and Juhee Son\n\nIn our recent [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.06103), we propose VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech.\n\nSeveral recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.\n\nVisit our [demo](https:\u002F\u002Fjaywalnut310.github.io\u002Fvits-demo\u002Findex.html) for audio samples.\n\nWe also provide the [pretrained models](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1ksarh-cJf3F5eKJjLVWY0X1j1qsQqiS2?usp=sharing).\n\n** Update note: Thanks to [Rishikesh (ऋषिकेश)](https:\u002F\u002Fgithub.com\u002Fjaywalnut310\u002Fvits\u002Fissues\u002F1), our interactive TTS demo is now available on [Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1CO61pZizDj7en71NQG_aqqKdGaA_SaBf?usp=sharing).\n\n\u003Ctable style=\"width:100%\">\n  \u003Ctr>\n    \u003Cth>VITS at training\u003C\u002Fth>\n    \u003Cth>VITS at inference\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cimg src=\"resources\u002Ffig_1a.png\" alt=\"VITS at training\" height=\"400\">\u003C\u002Ftd>\n    \u003Ctd>\u003Cimg src=\"resources\u002Ffig_1b.png\" alt=\"VITS at inference\" height=\"400\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n## Pre-requisites\n0. Python >= 3.6\n0. Clone this repository\n0. Install python requirements. Please refer [requirements.txt](requirements.txt)\n    1. You may need to install espeak first: `apt-get install espeak`\n0. Download datasets\n    1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: `ln -s \u002Fpath\u002Fto\u002FLJSpeech-1.1\u002Fwavs DUMMY1`\n    1. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: `ln -s \u002Fpath\u002Fto\u002FVCTK-Corpus\u002Fdownsampled_wavs DUMMY2`\n0. Build Monotonic Alignment Search and run preprocessing if you use your own datasets.\n```sh\n# Cython-version Monotonoic Alignment Search\ncd monotonic_align\npython setup.py build_ext --inplace\n\n# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.\n# python preprocess.py --text_index 1 --filelists filelists\u002Fljs_audio_text_train_filelist.txt filelists\u002Fljs_audio_text_val_filelist.txt filelists\u002Fljs_audio_text_test_filelist.txt \n# python preprocess.py --text_index 2 --filelists filelists\u002Fvctk_audio_sid_text_train_filelist.txt filelists\u002Fvctk_audio_sid_text_val_filelist.txt filelists\u002Fvctk_audio_sid_text_test_filelist.txt\n```\n\n\n## Training Exmaple\n```sh\n# LJ Speech\npython train.py -c configs\u002Fljs_base.json -m ljs_base\n\n# VCTK\npython train_ms.py -c configs\u002Fvctk_base.json -m vctk_base\n```\n\n\n## Inference Example\nSee [inference.ipynb](inference.ipynb)\n","VITS 是一个基于条件变分自编码器和对抗学习的端到端文本转语音系统。该项目利用了变分推断、标准化流以及对抗训练过程来提高生成音频的自然度，同时引入了一个随机时长预测器以生成具有多样节奏的语音。其核心在于通过建模潜在变量的不确定性及随机时长预测，实现了同一文本输入可以被以不同音调和节奏读出的效果。适用于需要高质量、多样化语音合成的应用场景，如虚拟助手、有声书制作等。此项目使用 Python 编写，并基于 PyTorch 框架实现，开源代码易于扩展与定制。",2,"2026-06-11 03:35:40","high_star"]