[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72181":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":34,"readmeContent":35,"aiSummary":36,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":37,"discoverSource":38},72181,"metavoice-src","metavoiceio\u002Fmetavoice-src","metavoiceio","Foundational model for human-like, expressive TTS","https:\u002F\u002Fthemetavoice.xyz\u002F",null,"Python",4199,692,80,57,0,2,4,61.92,"Apache License 2.0",false,"main",true,[25,26,27,28,29,30,31,32,33],"ai","deep-learning","pytorch","speech","speech-synthesis","text-to-speech","tts","voice-clone","zero-shot-tts","2026-06-12 04:01:04","# MetaVoice-1B\n\n\n\n[![Playground](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Try&message=Playground&color=fc4982&url=https:\u002F\u002Fttsdemo.themetavoice.xyz\u002F)](https:\u002F\u002Fttsdemo.themetavoice.xyz\u002F)\n\u003Ca target=\"_blank\" style=\"display: inline-block; vertical-align: middle\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmetavoiceio\u002Fmetavoice-src\u002Fblob\u002Fmain\u002Fcolab_demo.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n[![](https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002FCpy6U3na8Z?style=flat&compact=True)](https:\u002F\u002Fdiscord.gg\u002FtbTbkGEgJM)\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Ftwitter.com\u002FOnusFM.svg?style=social&label=@metavoiceio)](https:\u002F\u002Ftwitter.com\u002Fmetavoiceio)\n\n\n\nMetaVoice-1B is a 1.2B parameter base model trained on 100K hours of speech for TTS (text-to-speech). It has been built with the following priorities:\n* **Emotional speech rhythm and tone** in English.\n* **Zero-shot cloning for American & British voices**, with 30s reference audio.\n* Support for (cross-lingual) **voice cloning with finetuning**.\n  * We have had success with as little as 1 minute training data for Indian speakers.\n* Synthesis of **arbitrary length text**\n\nWe’re releasing MetaVoice-1B under the Apache 2.0 license, *it can be used without restrictions*.\n\n\n## Quickstart - tl;dr\n\nWeb UI\n```bash\ndocker-compose up -d ui && docker-compose ps && docker-compose logs -f\n```\n\nServer\n```bash\n# navigate to \u003CURL>\u002Fdocs for API definitions\ndocker-compose up -d server && docker-compose ps && docker-compose logs -f\n```\n\n## Installation\n\n**Pre-requisites:**\n- GPU VRAM >=12GB\n- Python >=3.10,\u003C3.12\n- pipx ([installation instructions](https:\u002F\u002Fpipx.pypa.io\u002Fstable\u002Finstallation\u002F))\n\n**Environment setup**\n```bash\n# install ffmpeg\nwget https:\u002F\u002Fjohnvansickle.com\u002Fffmpeg\u002Fbuilds\u002Fffmpeg-git-amd64-static.tar.xz\nwget https:\u002F\u002Fjohnvansickle.com\u002Fffmpeg\u002Fbuilds\u002Fffmpeg-git-amd64-static.tar.xz.md5\nmd5sum -c ffmpeg-git-amd64-static.tar.xz.md5\ntar xvf ffmpeg-git-amd64-static.tar.xz\nsudo mv ffmpeg-git-*-static\u002Fffprobe ffmpeg-git-*-static\u002Fffmpeg \u002Fusr\u002Flocal\u002Fbin\u002F\nrm -rf ffmpeg-git-*\n\n# install rust if not installed (ensure you've restarted your terminal after installation)\ncurl --proto '=https' --tlsv1.2 -sSf https:\u002F\u002Fsh.rustup.rs | sh\n```\n\n### Project dependencies installation\n1. [Using poetry](#using-poetry-recommended)\n2. [Using pip\u002Fconda](#using-pipconda)\n\n#### Using poetry (recommended)\n```bash\n# install poetry if not installed (ensure you've restarted your terminal after installation)\npipx install poetry\n\n# disable any conda envs that might interfere with poetry's venv\nconda deactivate\n\n# if running from Linux, keyring backend can hang on `poetry install`. This prevents that.\nexport PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring\n\n# pip's dependency resolver will complain, this is temporary expected behaviour\n# full inference & finetuning functionality will still be available\npoetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1\n```\n\n#### Using pip\u002Fconda\nNOTE 1: When raising issues, we'll ask you to try with poetry first.\nNOTE 2: All commands in this README use `poetry` by default, so you can just remove any `poetry run`.\n\n```bash\npip install -r requirements.txt\npip install torch==2.2.1 torchaudio==2.2.1\npip install -e .\n```\n\n## Usage\n1. Download it and use it anywhere (including locally) with our [reference implementation](\u002Ffam\u002Fllm\u002Ffast_inference.py)\n```bash\n# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference.  This will degrade the quality of the audio.\n# Note: int8 is slower than bf16\u002Ffp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16\u002Ffp16.\npoetry run python -i fam\u002Fllm\u002Ffast_inference.py\n\n# Run e.g. of API usage within the interactive python session\ntts.synthesise(text=\"This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model.\", spk_ref_path=\"assets\u002Fbria.mp3\")\n```\n> Note: The script takes 30-90s to startup (depending on hardware). This is because we torch.compile the model for fast inference.\n\n> On Ampere, Ada-Lovelace, and Hopper architecture GPUs, once compiled, the synthesise() API runs faster than real-time, with a Real-Time Factor (RTF) \u003C 1.0.\n\n2. Deploy it on any cloud (AWS\u002FGCP\u002FAzure), using our [inference server](serving.py) or [web UI](app.py)\n```bash\n# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.\n# Note: int8 is slower than bf16\u002Ffp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16\u002Ffp16.\n\n# navigate to \u003CURL>\u002Fdocs for API definitions\npoetry run python serving.py\n\npoetry run python app.py\n```\n\n3. Use it via [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmetavoiceio)\n4. [Google Colab Demo](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmetavoiceio\u002Fmetavoice-src\u002Fblob\u002Fmain\u002Fcolab_demo.ipynb)\n\n## Finetuning\nWe support finetuning the first stage LLM (see [Architecture section](#Architecture)).\n\nIn order to finetune, we expect a \"|\"-delimited CSV dataset of the following format:\n\n```csv\naudio_files|captions\n.\u002Fdata\u002Faudio.wav|.\u002Fdata\u002Fcaption.txt\n```\n\nNote that we don't perform any dataset overlap checks, so ensure that your train and val datasets are disjoint.\n\nTry it out using our sample datasets via:\n```bash\npoetry run finetune --train .\u002Fdatasets\u002Fsample_dataset.csv --val .\u002Fdatasets\u002Fsample_val_dataset.csv\n```\n\nOnce you've trained your model, you can use it for inference via:\n```bash\npoetry run python -i fam\u002Fllm\u002Ffast_inference.py --first_stage_path .\u002Fmy-finetuned_model.pt\n```\n\n### Configuration\n\nIn order to set hyperparameters such as learning rate, what to freeze, etc, you\ncan edit the [finetune_params.py](.\u002Ffam\u002Fllm\u002Fconfig\u002Ffinetune_params.py) file.\n\nWe've got a light & optional integration with W&B that can be enabled via setting\n`wandb_log = True` & by installing the appropriate dependencies.\n\n```bash\npoetry install -E observable\n```\n\n## Upcoming\n- [x] Faster inference ⚡\n- [x] Fine-tuning code 📐\n- [ ] Synthesis of arbitrary length text\n\n\n## Architecture\nWe predict EnCodec tokens from text, and speaker information. This is then diffused up to the waveform level, with post-processing applied to clean up the audio.\n\n* We use a causal GPT to predict the first two hierarchies of EnCodec tokens. Text and audio are part of the LLM context. Speaker information is passed via conditioning at the token embedding layer. This speaker conditioning is obtained from a separately trained speaker verification network.\n  - The two hierarchies are predicted in a \"flattened interleaved\" manner, we predict the first token of the first hierarchy, then the first token of the second hierarchy, then the second token of the first hierarchy, and so on.\n  - We use condition-free sampling to boost the cloning capability of the model.\n  - The text is tokenised using a custom trained BPE tokeniser with 512 tokens.\n  - Note that we've skipped predicting semantic tokens as done in other works, as we found that this isn't strictly necessary.\n* We use a non-causal (encoder-style) transformer to predict the rest of the 6 hierarchies from the first two hierarchies. This is a super small model (~10Mn parameters), and has extensive zero-shot generalisation to most speakers we've tried. Since it's non-causal, we're also able to predict all the timesteps in parallel.\n* We use multi-band diffusion to generate waveforms from the EnCodec tokens. We noticed that the speech is clearer than using the original RVQ decoder or VOCOS. However, the diffusion at waveform level leaves some background artifacts which are quite unpleasant to the ear. We clean this up in the next step.\n* We use DeepFilterNet to clear up the artifacts introduced by the multi-band diffusion.\n\n## Optimizations\nThe model supports:\n1. KV-caching via Flash Decoding\n2. Batching (including texts of different lengths)\n\n## Contribute\n- See all [active issues](https:\u002F\u002Fgithub.com\u002Fmetavoiceio\u002Fmetavoice-src\u002Fissues)!\n\n## Acknowledgements\nWe are grateful to Together.ai for their 24\u002F7 help in marshalling our cluster. We thank the teams of AWS, GCP & Hugging Face for support with their cloud platforms.\n\n- [A Défossez et. al.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.13438) for Encodec.\n- [RS Roman et. al.](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.02560) for Multiband Diffusion.\n- [@liusongxiang](https:\u002F\u002Fgithub.com\u002Fliusongxiang\u002Fppg-vc\u002Fblob\u002Fmain\u002Fspeaker_encoder\u002Finference.py) for speaker encoder implementation.\n- [@karpathy](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT) for NanoGPT which our inference implementation is based on.\n- [@Rikorose](https:\u002F\u002Fgithub.com\u002FRikorose) for DeepFilterNet.\n\nApologies in advance if we've missed anyone out. Please let us know if we have.\n","MetaVoice-1B是一个用于生成类似人类、富有表现力的文本转语音（TTS）的基础模型。该项目基于PyTorch框架开发，拥有1.2亿参数量，并通过10万小时的语音数据训练而成，专注于英语中的情感语音节奏和音调表达，支持零样本克隆美国与英式口音（仅需30秒参考音频），同时具备跨语言声音克隆能力，甚至使用最少1分钟的数据即可对印度语种进行微调优化。此外，它能够处理任意长度的文字合成任务。适用于需要高质量语音合成的应用场景，如虚拟助手、有声读物制作或游戏配音等。项目遵循Apache 2.0许可证发布，用户可以自由地使用该技术而无任何限制。","2026-06-11 03:40:43","high_star"]