[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83083":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},83083,"MisoTTS","MisoLabsAI\u002FMisoTTS","MisoLabsAI","Miso TTS is an 8 billion, highly emotive text-to-speech model",null,"Python",2688,242,10,9,0,77,889,523,29.16,"Other",false,"main",true,[],"2026-06-12 02:04:30","\u003Cdiv align=\"center\">\n\n\u003Cimg src=\"images\u002Frepo_banner.png\" alt=\"Miso TTS 8B\" width=\"100%\">\n\n# Miso TTS 8B\n\n### State-of-the-Art Text-to-Speech Model\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fmisolabs.ai\">\u003Cimg alt=\"Website\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-misolabs.ai-black?style=for-the-badge\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FMisoLabs\u002FMisoTTS\">\u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHugging%20Face-MisoTTS-yellow?style=for-the-badge\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FMisoLabsAI\">\u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-MisoLabsAI-181717?style=for-the-badge&logo=github&labelColor=555555\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002FMisoLabsAI\">\u003Cimg alt=\"X\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-MisoLabsAI-181717?style=for-the-badge&logo=x&labelColor=555555\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp>\n  \u003Ca href=\"#quickstart\">Quickstart\u003C\u002Fa> |\n  \u003Ca href=\"#model-introduction\">Model Introduction\u003C\u002Fa> |\n  \u003Ca href=\"#model-summary\">Model Summary\u003C\u002Fa> |\n  \u003Ca href=\"#usage\">Usage\u003C\u002Fa> |\n  \u003Ca href=\"#safety\">Safety\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n---\n\n## Quickstart\n\nTo quickly try the model, you can use the demo hosted on our [landing page](https:\u002F\u002Fmisolabs.ai)\nat misolabs.ai. To try it locally, follow the instructions below.\n\nIf you do not have `uv` installed yet:\n\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n```\n\nThen clone the repository and create the environment:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FMisoLabsAI\u002FMisoTTS.git\ncd MisoTTS\nuv sync --python 3.10\nsource .venv\u002Fbin\u002Factivate\n```\n\nThen run the example conversation. By default, `run_misotts.py` loads the public\nmodel from [MisoLabs\u002FMisoTTS](https:\u002F\u002Fhuggingface.co\u002FMisoLabs\u002FMisoTTS) and\ndownloads it into the Hugging Face cache if it is not already present on your\nmachine:\n\n```bash\nuv run python run_misotts.py\n```\n\nThe script writes `full_conversation.wav` in the repository root.\n\nWith `pip` instead of `uv`:\n\n```bash\npython3.10 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -e .\npython run_misotts.py\n```\n\n---\n\n## Model Introduction\n\nMiso TTS 8B is a text-to-dialogue RVQ Transformer inspired by the Sesame CSM architecture. It\ngenerates Mimi audio codes from text and optional audio context, using a large\nLlama 3.2-style backbone and a smaller autoregressive audio decoder. To find out more\nabout the architecture, read [our blog post](https:\u002F\u002Fmisolabs.ai\u002Fblog\u002Fmiso-tts-8b).\n\nThe model is designed for high-quality conversational speech generation.\nThis repository contains the inference\ncode, model definition, and setup instructions for running Miso TTS locally.\n\n> **Language support:** Miso TTS 8B currently supports **English only**.\n\n---\n\n## Model Summary\n\n| Item                | Value           |\n| ------------------- | --------------- |\n| Model               | Miso TTS 8B     |\n| Organization        | Miso Labs       |\n| Task                | Text-to-speech  |\n| Architecture        | RVQ Transformer |\n| Backbone            | `llama-8B`      |\n| Audio decoder       | `llama-300M`    |\n| Text vocabulary     | `128,256`       |\n| Audio vocabulary    | `2,051`         |\n| Audio codebooks     | `32`            |\n| Audio tokenizer     | Mimi            |\n| Max sequence length | `2,048`         |\n| Languages           | English only    |\n\n### Architecture\n\nMiso TTS 8B uses two transformer components:\n\n- A large backbone transformer that consumes text\u002Faudio-frame embeddings.\n- A smaller decoder transformer that autoregressively predicts higher-order\n  audio codebooks within each frame.\n\nThe backbone accepts interleaved text and audio tokens, allowing it to condition its generations on\nthe conversation history.\n\n---\n\n## Usage\n\n### Python\n\n```python\nimport torch\nimport torchaudio\n\nfrom generator import load_miso_8b\n\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\ngenerator = load_miso_8b(\n    device=device,\n    model_path_or_repo_id=\"MisoLabs\u002FMisoTTS\",\n)\n\naudio = generator.generate(\n    text=\"Hello from Miso.\",\n    speaker=0,\n    context=[],\n    max_audio_length_ms=10_000,\n)\n\ntorchaudio.save(\"miso.wav\", audio.unsqueeze(0).cpu(), generator.sample_rate)\n```\n\n### Prompted generation\n\nMiso TTS can condition on prior audio for voice cloning.\nThis is optional; the quickstart example above runs without\nprompt audio.\n\n```python\nimport torchaudio\n\nfrom generator import Segment, load_miso_8b\n\ngenerator = load_miso_8b(device=\"cuda\")\n\nprompt_audio, sample_rate = torchaudio.load(\"prompt.wav\")\nprompt_audio = torchaudio.functional.resample(\n    prompt_audio.squeeze(0),\n    orig_freq=sample_rate,\n    new_freq=generator.sample_rate,\n)\n\ncontext = [\n    Segment(\n        speaker=0,\n        text=\"This is the transcript for the prompt audio.\",\n        audio=prompt_audio,\n    )\n]\n\naudio = generator.generate(\n    text=\"This is the next sentence to synthesize.\",\n    speaker=0,\n    context=context,\n    max_audio_length_ms=10_000,\n)\n```\n\n---\n\n## Weights\n\nThe model weights are hosted publicly on Hugging Face:\n\n```bash\nuv run python run_misotts.py\n```\n\nThe default model repository is\n[MisoLabs\u002FMisoTTS](https:\u002F\u002Fhuggingface.co\u002FMisoLabs\u002FMisoTTS). The first run\ndownloads the model automatically through Hugging Face Hub; later runs reuse the\ncached copy.\n\nThe first run also downloads the SilentCipher watermarking model from\n`sony\u002Fsilentcipher`. If that separate download times out, rerun the command; the\nHugging Face cache resumes from files that already completed.\n\n---\n\n## Deployment Notes\n\nMiso TTS 8B is a large model. For best results, use a CUDA GPU with sufficient\nVRAM for the checkpoint precision you are loading. The default inference path\nuses `torch.bfloat16`.\n\n---\n\n## Safety\n\nMiso TTS is a speech generation model. Do not use it to impersonate people,\ncreate deceptive audio, commit fraud, or generate harmful content.\n\nGenerated audio is watermarked by default. If you deploy this model in another\napplication, use your own private watermark key and keep it secret.\n\n---\n\n## Links\n\n- Website: [misolabs.ai](https:\u002F\u002Fmisolabs.ai)\n- Hugging Face: [MisoLabs\u002FMisoTTS](https:\u002F\u002Fhuggingface.co\u002FMisoLabs\u002FMisoTTS)\n- GitHub: [MisoLabsAI](https:\u002F\u002Fgithub.com\u002FMisoLabsAI)\n- X: [@MisoLabsAI](https:\u002F\u002Fx.com\u002FMisoLabsAI)\n","Miso TTS 是一个拥有80亿参数的高情感文本转语音模型。它基于RVQ Transformer架构，使用类似Llama 3.2的大规模骨干网络和较小的自回归音频解码器来生成高质量的对话音频。该模型支持从文本生成富有情感的语音，并且可以接受可选的音频上下文以提高输出质量。Miso TTS特别适合需要自然流畅、情感丰富的语音合成的应用场景，如虚拟助手、有声书制作以及游戏角色配音等。目前仅支持英文。",2,"2026-06-11 04:10:04","CREATED_QUERY"]