[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2439":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":45,"readmeContent":46,"aiSummary":47,"trendingCount":16,"starSnapshotCount":16,"syncStatus":48,"lastSyncTime":49,"discoverSource":50},2439,"CosyVoice","FunAudioLLM\u002FCosyVoice","FunAudioLLM","Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.","https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice3",null,"Python",21592,2485,132,763,0,19,150,606,107,45,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44],"audio-generation","cantonese","chatbot","chatgpt","chinese","cosyvoice","cross-lingual","english","fine-grained","fine-tuning","gpt-4o","japanese","korean","multi-lingual","natural-language-generation","python","text-to-speech","tts","voice-cloning","2026-06-12 02:00:41","![SVG Banners](https:\u002F\u002Fsvg-banners.vercel.app\u002Fapi?type=origin&text1=CosyVoice🤠&text2=Text-to-Speech%20💖%20Large%20Language%20Model&width=800&height=210)\n\n## 👉🏻 CosyVoice 👈🏻\n\n**Fun-CosyVoice 3.0**: [Demos](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice3\u002F); [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.17589); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512); [Huggingface](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FFun-CosyVoice3-0.5B-2512); [CV3-Eval](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCV3-Eval)\n\n**CosyVoice 2.0**: [Demos](https:\u002F\u002Ffunaudiollm.github.io\u002Fcosyvoice2\u002F); [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.10117); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice2-0.5B); [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice2-0.5B)\n\n**CosyVoice 1.0**: [Demos](https:\u002F\u002Ffun-audio-llm.github.io); [Paper](https:\u002F\u002Ffunaudiollm.github.io\u002Fpdf\u002FCosyVoice_v1.pdf); [Modelscope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002Fiic\u002FCosyVoice-300M); [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FFunAudioLLM\u002FCosyVoice-300M)\n\n## Highlight🔥\n\n**Fun-CosyVoice 3.0** is an advanced text-to-speech (TTS) system based on large language models (LLM), surpassing its predecessor (CosyVoice 2.0) in content consistency, speaker similarity, and prosody naturalness. It is designed for zero-shot multilingual speech synthesis in the wild.\n### Key Features\n- **Language Coverage**: Covers 9 common languages (Chinese, English, Japanese, Korean, German, Spanish, French, Italian, Russian), 18+ Chinese dialects\u002Faccents (Guangdong, Minnan, Sichuan, Dongbei, Shan3xi, Shan1xi, Shanghai, Tianjin, Shandong, Ningxia, Gansu, etc.) and meanwhile supports both multi-lingual\u002Fcross-lingual zero-shot voice cloning.\n- **Content Consistency & Naturalness**: Achieves state-of-the-art performance in content consistency, speaker similarity, and prosody naturalness.\n- **Pronunciation Inpainting**: Supports pronunciation inpainting of Chinese Pinyin and English CMU phonemes, providing more controllability and thus suitable for production use.\n- **Text Normalization**: Supports reading of numbers, special symbols and various text formats without a traditional frontend module.\n- **Bi-Streaming**: Support both text-in streaming and audio-out streaming, and achieves latency as low as 150ms while maintaining high-quality audio output.\n- **Instruct Support**: Supports various instructions such as languages, dialects, emotions, speed, volume, etc.\n\n\n## Roadmap\n\n- [x] 2025\u002F12\n\n    - [x] release Fun-CosyVoice3-0.5B-2512 base model, rl model and its training\u002Finference script\n    - [x] release Fun-CosyVoice3-0.5B modelscope gradio space\n\n- [x] 2025\u002F08\n\n    - [x] Thanks to the contribution from NVIDIA Yuekai Zhang, add triton trtllm runtime support and cosyvoice2 grpo training support\n\n- [x] 2025\u002F07\n\n    - [x] release Fun-CosyVoice 3.0 eval set\n\n- [x] 2025\u002F05\n\n    - [x] add CosyVoice2-0.5B vllm support\n\n- [x] 2024\u002F12\n\n    - [x] 25hz CosyVoice2-0.5B released\n\n- [x] 2024\u002F09\n\n    - [x] 25hz CosyVoice-300M base model\n    - [x] 25hz CosyVoice-300M voice conversion function\n\n- [x] 2024\u002F08\n\n    - [x] Repetition Aware Sampling(RAS) inference for llm stability\n    - [x] Streaming inference mode support, including kv cache and sdpa for rtf optimization\n\n- [x] 2024\u002F07\n\n    - [x] Flow matching training support\n    - [x] WeTextProcessing support when ttsfrd is not available\n    - [x] Fastapi server and client\n\n## Evaluation\n\n| Model | Open-Source | Model Size | test-zh\u003Cbr>CER (%) ↓ | test-zh\u003Cbr>SS (%) ↑ | test-en\u003Cbr>WER (%) ↓ | test-en\u003Cbr>SS (%) ↑ | test-hard\u003Cbr>CER (%) ↓ | test-hard\u003Cbr>SS (%) ↑ |\n| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |\n| Human | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |\n| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |\n| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |\n| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |\n| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |\n| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |\n| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |\n| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |\n| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |\n| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |\n| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |\n| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |\n| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |\n| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |\n| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |\n| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |\n\n\n## Install\n\n### Clone and install\n\n- Clone the repo\n    ``` sh\n    git clone --recursive https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice.git\n    # If you failed to clone the submodule due to network failures, please run the following command until success\n    cd CosyVoice\n    git submodule update --init --recursive\n    ```\n\n- Install Conda: please see https:\u002F\u002Fdocs.conda.io\u002Fen\u002Flatest\u002Fminiconda.html\n- Create Conda env:\n\n    ``` sh\n    conda create -n cosyvoice -y python=3.10\n    conda activate cosyvoice\n    pip install -r requirements.txt -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n\n    # If you encounter sox compatibility issues\n    # ubuntu\n    sudo apt-get install sox libsox-dev\n    # centos\n    sudo yum install sox sox-devel\n    ```\n\n### Model download\n\nWe strongly recommend that you download our pretrained `Fun-CosyVoice3-0.5B` `CosyVoice2-0.5B` `CosyVoice-300M` `CosyVoice-300M-SFT` `CosyVoice-300M-Instruct` model and `CosyVoice-ttsfrd` resource.\n\n``` python\n# modelscope SDK model download\nfrom modelscope import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('iic\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('iic\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('iic\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('iic\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('iic\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n\n# for oversea users, huggingface SDK model download\nfrom huggingface_hub import snapshot_download\nsnapshot_download('FunAudioLLM\u002FFun-CosyVoice3-0.5B-2512', local_dir='pretrained_models\u002FFun-CosyVoice3-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice2-0.5B', local_dir='pretrained_models\u002FCosyVoice2-0.5B')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M', local_dir='pretrained_models\u002FCosyVoice-300M')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-SFT', local_dir='pretrained_models\u002FCosyVoice-300M-SFT')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-300M-Instruct', local_dir='pretrained_models\u002FCosyVoice-300M-Instruct')\nsnapshot_download('FunAudioLLM\u002FCosyVoice-ttsfrd', local_dir='pretrained_models\u002FCosyVoice-ttsfrd')\n```\n\nOptionally, you can unzip `ttsfrd` resource and install `ttsfrd` package for better text normalization performance.\n\nNotice that this step is not necessary. If you do not install `ttsfrd` package, we will use wetext by default.\n\n``` sh\ncd pretrained_models\u002FCosyVoice-ttsfrd\u002F\nunzip resource.zip -d .\npip install ttsfrd_dependency-0.1-py3-none-any.whl\npip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl\n```\n\n### Basic Usage\n\nWe strongly recommend using `Fun-CosyVoice3-0.5B` for better performance.\nFollow the code in `example.py` for detailed usage of each model.\n```sh\npython example.py\n```\n\n#### vLLM Usage\nCosyVoice2\u002F3 now supports **vLLM 0.11.x+ (V1 engine)** and **vLLM 0.9.0 (legacy)**.\nOlder vllm version(\u003C0.9.0) do not support CosyVoice inference, and versions in between (e.g., 0.10.x) are not tested.\n\nNotice that `vllm` has a lot of specific requirements. You can create a new env to in case your hardward do not support vllm and old env is corrupted.\n\n``` sh\nconda create -n cosyvoice_vllm --clone cosyvoice\nconda activate cosyvoice_vllm\n# for vllm==0.9.0\npip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\n# for vllm>=0.11.0\npip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F --trusted-host=mirrors.aliyun.com\npython vllm_example.py\n```\n\n#### Start web demo\n\nYou can use our web demo page to get familiar with CosyVoice quickly.\n\nPlease see the demo website for details.\n\n``` python\n# change iic\u002FCosyVoice-300M-SFT for sft inference, or iic\u002FCosyVoice-300M-Instruct for instruct inference\npython3 webui.py --port 50000 --model_dir pretrained_models\u002FCosyVoice-300M\n```\n\n#### Advanced Usage\n\nFor advanced users, we have provided training and inference scripts in `examples\u002Flibritts`.\n\n#### Build for deployment\n\nOptionally, if you want service deployment,\nYou can run the following steps.\n\n``` sh\ncd runtime\u002Fpython\ndocker build -t cosyvoice:v1.0 .\n# change iic\u002FCosyVoice-300M to iic\u002FCosyVoice-300M-Instruct if you want to use instruct inference\n# for grpc usage\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Fgrpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd grpc && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n# for fastapi usage\ndocker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 \u002Fbin\u002Fbash -c \"cd \u002Fopt\u002FCosyVoice\u002FCosyVoice\u002Fruntime\u002Fpython\u002Ffastapi && python3 server.py --port 50000 --model_dir iic\u002FCosyVoice-300M && sleep infinity\"\ncd fastapi && python3 client.py --port 50000 --mode \u003Csft|zero_shot|cross_lingual|instruct>\n```\n\n#### Using Nvidia TensorRT-LLM for deployment\n\nUsing TensorRT-LLM to accelerate cosyvoice2 llm could give 4x acceleration comparing with huggingface transformers implementation.\nTo quick start:\n\n``` sh\ncd runtime\u002Ftriton_trtllm\ndocker compose up -d\n```\nFor more details, you could check [here](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Ftree\u002Fmain\u002Fruntime\u002Ftriton_trtllm)\n\n## Discussion & Communication\n\nYou can directly discuss on [Github Issues](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice\u002Fissues).\n\nYou can also scan the QR code to join our official Dingding chat group.\n\n\u003Cimg src=\".\u002Fasset\u002Fdingding.png\" width=\"250px\">\n\n## Acknowledge\n\n1. We borrowed a lot of code from [FunASR](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunASR).\n2. We borrowed a lot of code from [FunCodec](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FFunCodec).\n3. We borrowed a lot of code from [Matcha-TTS](https:\u002F\u002Fgithub.com\u002Fshivammehta25\u002FMatcha-TTS).\n4. We borrowed a lot of code from [AcademiCodec](https:\u002F\u002Fgithub.com\u002Fyangdongchao\u002FAcademiCodec).\n5. We borrowed a lot of code from [WeNet](https:\u002F\u002Fgithub.com\u002Fwenet-e2e\u002Fwenet).\n\n## Citations\n\n``` bibtex\n@article{du2024cosyvoice,\n  title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},\n  author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},\n  journal={arXiv preprint arXiv:2407.05407},\n  year={2024}\n}\n\n@article{du2024cosyvoice,\n  title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},\n  author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},\n  journal={arXiv preprint arXiv:2412.10117},\n  year={2024}\n}\n\n@article{du2025cosyvoice,\n  title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},\n  author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui and Shi, Xian and An, Keyu and others},\n  journal={arXiv preprint arXiv:2505.17589},\n  year={2025}\n}\n\n@inproceedings{lyu2025build,\n  title={Build LLM-Based Zero-Shot Streaming TTS System with Cosyvoice},\n  author={Lyu, Xiang and Wang, Yuxuan and Zhao, Tianyu and Wang, Hao and Liu, Huadai and Du, Zhihao},\n  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},\n  pages={1--2},\n  year={2025},\n  organization={IEEE}\n}\n```\n\n## Disclaimer\nThe content provided above is for academic purposes only and is intended to demonstrate technical capabilities. Some examples are sourced from the internet. If any content infringes on your rights, please contact us to request its removal.\n","CosyVoice是一个多语言大型语音生成模型，提供从推理、训练到部署的全栈能力。它支持包括中文、英文、日文等在内的九种常见语言以及18种以上的中国方言\u002F口音，具备零样本跨语言语音克隆功能。项目在内容一致性、说话人相似度和韵律自然性方面达到领先水平，并支持发音修复、文本归一化及双流处理等功能，使得其在低延迟下仍能保持高质量音频输出。此外，通过指令支持多种自定义设置如语速、音量等，适用于需要高度定制化和自然流畅语音合成的应用场景，如虚拟助手、有声读物制作等。",2,"2026-06-11 02:49:55","top_language"]