[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72060":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72060,"Orpheus-TTS","canopyai\u002FOrpheus-TTS","canopyai","Towards Human-Sounding Speech","https:\u002F\u002Fcanopylabs.ai",null,"Python",6181,527,73,116,0,14,24,47,42,100.87,"Apache License 2.0",false,"main",true,[27,28,29],"llm","realtime","tts","2026-06-12 04:01:03","# Orpheus TTS\n\n#### Updates 🔥\n- [5\u002F2025] We've partnered with [Baseten](https:\u002F\u002Fwww.baseten.co\u002Fblog\u002Fcanopy-labs-selects-baseten-as-preferred-inference-provider-for-orpheus-tts-model) to bring highly optimized inference to Orpheus at fp8 (more performant) and fp16 (full fidelity) inference. See code and docs [here](\u002Fadditional_inference_options\u002Fbaseten_inference_example\u002FREADME.md).\n\n- [4\u002F2025] We release a [family of multilingual models](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fcanopylabs\u002Forpheus-multilingual-research-release-67f5894cd16794db163786ba) in a research preview. We release a [training guide](https:\u002F\u002Fcanopylabs.ai\u002Freleases\u002Forpheus_can_speak_any_language#training) that explains how we created these models in the hopes that even better versions in both the languages released and new languages are created. We welcome feedback and criticism as well as invite questions in this [discussion](https:\u002F\u002Fgithub.com\u002Fcanopyai\u002FOrpheus-TTS\u002Fdiscussions\u002F123) for feedback and questions.\n\n## Overview\nOrpheus TTS is a SOTA open-source text-to-speech system built on the Llama-3b backbone. Orpheus demonstrates the emergent capabilities of using LLMs for speech synthesis.\n\n[Check out our original blog post](https:\u002F\u002Fcanopylabs.ai\u002Fmodel-releases)\n\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fce17dd3a-f866-4e67-86e4-0025e6e87b8a\n\n## Abilities\n\n- **Human-Like Speech**: Natural intonation, emotion, and rhythm that is superior to SOTA closed source models\n- **Zero-Shot Voice Cloning**: Clone voices without prior fine-tuning\n- **Guided Emotion and Intonation**: Control speech and emotion characteristics with simple tags\n- **Low Latency**: ~200ms streaming latency for realtime applications, reducible to ~100ms with input streaming\n\n## Models\n\nWe provide 2 English models, and additionally we offer the data processing scripts and sample datasets to make it very straightforward to create your own finetune.\n\n1. [**Finetuned Prod**](https:\u002F\u002Fhuggingface.co\u002Fcanopylabs\u002Forpheus-tts-0.1-finetune-prod) – A finetuned model for everyday TTS applications\n\n2. [**Pretrained**](https:\u002F\u002Fhuggingface.co\u002Fcanopylabs\u002Forpheus-tts-0.1-pretrained) – Our base model trained on 100k+ hours of English speech data\n\nWe also offer a family of multilingual models in a research release.\n\n1. [**Multlingual Family**](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fcanopylabs\u002Forpheus-multilingual-research-release-67f5894cd16794db163786ba) - 7 pairs of pretrained and finetuned models.\n\n### Inference\n\n#### Simple setup on Colab\n\nWe offer a standardised prompt format across languages, and these notebooks illustrate how to use our models in English.\n\n1. [Colab For Tuned Model](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1KhXT56UePPUHhqitJNUxq63k-pQomz3N?usp=sharing) (not streaming, see below for realtime streaming) – A finetuned model for everyday TTS applications.\n2. [Colab For Pretrained Model](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F10v9MIEbZOr_3V8ZcPAIh8MN7q2LjcstS?usp=sharing) – This notebook is set up for conditioned generation but can be extended to a range of tasks.\n\n#### One-click deployment on Baseten\n\nBaseten is our [preferred inference partner](https:\u002F\u002Fwww.baseten.co\u002Fblog\u002Fcanopy-labs-selects-baseten-as-preferred-inference-provider-for-orpheus-tts-model) for Orpheus. Get a dedicated deployment with real-time streaming on production-grade infrastructure [in one click on Baseten](https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Forpheus-tts\u002F).\n\n#### Streaming Inference Example\n\n1. Clone this repo\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002Fcanopyai\u002FOrpheus-TTS.git\n   ```\n2. Navigate and install packages\n   ```bash\n   cd Orpheus-TTS && pip install orpheus-speech # uses vllm under the hood for fast inference\n   ```\n   vllm pushed a slightly buggy version on March 18th so some bugs are being resolved by reverting to `pip install vllm==0.7.3` after `pip install orpheus-speech`\n4. Run the example below:\n   ```python\n   from orpheus_tts import OrpheusModel\n   import wave\n   import time\n   \n   model = OrpheusModel(model_name =\"canopylabs\u002Forpheus-tts-0.1-finetune-prod\", max_model_len=2048)\n   prompt = '''Man, the way social media has, um, completely changed how we interact is just wild, right? Like, we're all connected 24\u002F7 but somehow people feel more alone than ever. And don't even get me started on how it's messing with kids' self-esteem and mental health and whatnot.'''\n\n   start_time = time.monotonic()\n   syn_tokens = model.generate_speech(\n      prompt=prompt,\n      voice=\"tara\",\n      )\n\n   with wave.open(\"output.wav\", \"wb\") as wf:\n      wf.setnchannels(1)\n      wf.setsampwidth(2)\n      wf.setframerate(24000)\n\n      total_frames = 0\n      chunk_counter = 0\n      for audio_chunk in syn_tokens: # output streaming\n         chunk_counter += 1\n         frame_count = len(audio_chunk) \u002F\u002F (wf.getsampwidth() * wf.getnchannels())\n         total_frames += frame_count\n         wf.writeframes(audio_chunk)\n      duration = total_frames \u002F wf.getframerate()\n\n      end_time = time.monotonic()\n      print(f\"It took {end_time - start_time} seconds to generate {duration:.2f} seconds of audio\")\n   ```\n\n#### Setup Issues \n\nIf you've cloned this repository and encounter a KV cache error or `max_model_len` property does not exist, use the local package instead of the installed PyPI version:\n\n```python\nimport sys\nsys.path.insert(0, 'orpheus_tts_pypi')\nfrom orpheus_tts import OrpheusModel\n```\n\nThis ensures you're using the repository code, which may have fixes not yet published to PyPI. See [#290](https:\u002F\u002Fgithub.com\u002Fcanopyai\u002FOrpheus-TTS\u002Fissues\u002F290) for more details.\n\n#### Additional Functionality\n\n1. Watermark your audio: Use Silent Cipher to watermark your audio generation; see [Watermark Audio Implementation](additional_inference_options\u002Fwatermark_audio) for implementation.\n\n2. For No GPU inference using Llama cpp see implementation [documentation](additional_inference_options\u002Fno_gpu\u002FREADME.md) for implementation example\n\n\n#### Prompting\n\n1. The `finetune-prod` models: for the primary model, your text prompt is formatted as `{name}: I went to the ...`. The options for name in order of conversational realism (subjective benchmarks) are \"tara\", \"leah\", \"jess\", \"leo\", \"dan\", \"mia\", \"zac\", \"zoe\" for English - each language has different voices [see voices here] (https:\u002F\u002Fcanopylabs.ai\u002Freleases\u002Forpheus_can_speak_any_language#info)). Our python package does this formatting for you, and the notebook also prepends the appropriate string. You can additionally add the following emotive tags: `\u003Claugh>`, `\u003Cchuckle>`, `\u003Csigh>`, `\u003Ccough>`, `\u003Csniffle>`, `\u003Cgroan>`, `\u003Cyawn>`, `\u003Cgasp>`. For multilingual, see this [post](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fcanopylabs\u002Forpheus-multilingual-research-release-67f5894cd16794db163786ba) for supported tags.\n\n2. The pretrained model: you can either generate speech just conditioned on text, or generate speech conditioned on one or more existing text-speech pairs in the prompt. Since this model hasn't been explicitly trained on the zero-shot voice cloning objective, the more text-speech pairs you pass in the prompt, the more reliably it will generate in the correct voice.\n\n\nAdditionally, use regular LLM generation args like `temperature`, `top_p`, etc. as you expect for a regular LLM. `repetition_penalty>=1.1`is required for stable generations. Increasing `repetition_penalty` and `temperature` makes the model speak faster.\n\n\n## Finetune Model\n\nHere is an overview of how to finetune your model on any text and speech.\nThis is a very simple process analogous to tuning an LLM using Trainer and Transformers.\n\nYou should start to see high quality results after ~50 examples but for best results, aim for 300 examples\u002Fspeaker.\n\n1. Your dataset should be a huggingface dataset in [this format](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcanopylabs\u002Fzac-sample-dataset)\n2. We prepare the data using [this notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1wg_CPCA-MzsWtsujwy-1Ovhv-tn8Q1nD?usp=sharing). This pushes an intermediate dataset to your Hugging Face account which you can can feed to the training script in finetune\u002Ftrain.py. Preprocessing should take less than 1 minute\u002Fthousand rows.\n3. Modify the `finetune\u002Fconfig.yaml` file to include your dataset and training properties, and run the training script. You can additionally run any kind of huggingface compatible process like Lora to tune the model.\n   ```bash\n    pip install transformers datasets wandb trl flash_attn torch\n    huggingface-cli login \u003Center your HF token>\n    wandb login \u003Cwandb token>\n    accelerate launch train.py\n   ```\n### Additional Resources\n1. [Finetuning with unsloth](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Funslothai\u002Fnotebooks\u002Fblob\u002Fmain\u002Fnb\u002FOrpheus_(3B)-TTS.ipynb)\n   \n## Pretrain Model\n\nThis is a very simple process analogous to training an LLM using Trainer and Transformers.\n\nThe base model provided is trained over 100k hours. I recommend not using synthetic data for training as it produces worse results when you try to finetune specific voices, probably because synthetic voices lack diversity and map to the same set of tokens when tokenised (i.e. lead to poor codebook utilisation).\n\nWe train the 3b model on sequences of length 8192 - we use the same dataset format for TTS finetuning for the \u003CTTS-dataset> pretraining. We chain input_ids sequences together for more efficient training. The text dataset required is in the form described in this issue [#37 ](https:\u002F\u002Fgithub.com\u002Fcanopyai\u002FOrpheus-TTS\u002Fissues\u002F37). \n\nIf you are doing extended training this model, i.e. for another language or style we recommend starting with finetuning only (no text dataset). The main idea behind the text dataset is discussed in the blog post. (tldr; doesn't forget too much semantic\u002Freasoning ability so its able to better understand how to intone\u002Fexpress phrases when spoken, however most of the forgetting would happen very early on in the training i.e. \u003C100000 rows), so unless you are doing very extended finetuning it may not make too much of a difference.\n\n## Also Check out\n\nWhile we can't verify these implementations are completely accurate\u002Fbug free, they have been recommended on a couple of forums, so we include them here:\n\n1. [A lightweight client for running Orpheus TTS locally using LM Studio API](https:\u002F\u002Fgithub.com\u002Fisaiahbjork\u002Forpheus-tts-local)\n2. [Open AI compatible Fast-API implementation](https:\u002F\u002Fgithub.com\u002FLex-au\u002FOrpheus-FastAPI)\n3. [HuggingFace Space kindly set up by MohamedRashad](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FMohamedRashad\u002FOrpheus-TTS)\n4. [Gradio WebUI that runs smoothly on WSL and CUDA](https:\u002F\u002Fgithub.com\u002FSaganaki22\u002FOrpheusTTS-WebUI)\n\n\n# Checklist\n\n- [x] Release 3b pretrained model and finetuned models\n- [ ] Release pretrained and finetuned models in sizes: 1b, 400m, 150m parameters\n- [ ] Fix glitch in realtime streaming package that occasionally skips frames.\n- [ ] Fix voice cloning Colab notebook implementation\n","Orpheus TTS 是一个基于 Llama-3b 架构的开源文本转语音系统，旨在生成接近人类自然发音的语音。其核心功能包括零样本声音克隆、通过简单标签控制情感和语调以及低延迟实时应用支持，最低可达约100毫秒的流式传输延迟。技术上，Orpheus利用大规模语言模型进行语音合成，提供了两个英文版本的预训练和微调模型，并发布了多语言模型系列以促进更广泛的应用场景探索。该项目适用于需要高质量、自然流畅且可定制化语音输出的各种应用场景，如客户服务自动化、虚拟助手开发及内容创作等。",2,"2026-06-11 03:40:10","high_star"]