[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72379":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72379,"delayed-streams-modeling","kyutai-labs\u002Fdelayed-streams-modeling","kyutai-labs","Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.","",null,"Python",2944,306,30,35,0,11,16,24,33,29.46,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:02","# Delayed Streams Modeling: Kyutai STT & TTS\n\nThis repo contains instructions and examples of how to run\n[Kyutai Speech-To-Text](#kyutai-speech-to-text)\nand [Kyutai Text-To-Speech](#kyutai-text-to-speech) models.\nSee also [Unmute](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Funmute), a voice AI system built using Kyutai STT and Kyutai TTS.\n\nBut wait, what is \"Delayed Streams Modeling\"? It is a technique for solving many streaming X-to-Y tasks (with X, Y in `{speech, text}`)\nthat formalize the approach we had with Moshi and Hibiki. See our [pre-print about DSM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08753).\n\n## Kyutai Speech-To-Text\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fkyutai\u002Fspeech-to-text-685403682cf8a23ab9466886\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-KyutaiSTT-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Fstt_pytorch.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\n**More details can be found on the [project page](https:\u002F\u002Fkyutai.org\u002Fnext\u002Fstt).**\n\nKyutai STT models are optimized for real-time usage, can be batched for efficiency, and return word level timestamps.\nWe provide two models:\n- `kyutai\u002Fstt-1b-en_fr`, an English and French model with ~1B parameters, a 0.5 second delay, and a [semantic VAD](https:\u002F\u002Fkyutai.org\u002Fnext\u002Fstt#semantic-vad).\n- `kyutai\u002Fstt-2.6b-en`, an English-only model with ~2.6B parameters and a 2.5 second delay.\n\nThese speech-to-text models have several advantages:\n- Streaming inference: the models can process audio in chunks, which allows\n  for real-time transcription, and is great for interactive applications.\n- Easy batching for maximum efficiency: a H100 can process 400 streams in\n  real-time.\n- They return word-level timestamps.\n- The 1B model has a semantic Voice Activity Detection (VAD) component that\n  can be used to detect when the user is speaking. This is especially useful\n  for building voice agents.\n\n### Implementations overview\n\nWe provide different implementations of Kyutai STT for different use cases.\nHere is how to choose which one to use:\n\n- **PyTorch: for research and tinkering.**\n  If you want to call the model from Python for research or experimentation, use our PyTorch implementation.\n- **Rust: for production.**\n  If you want to serve Kyutai STT in a production setting, use our Rust server.\n  Our robust Rust server provides streaming access to the model over websockets.\n  We use this server to run [Unmute](https:\u002F\u002Funmute.sh\u002F); on a L40S GPU, we can serve 64 simultaneous connections at a real-time factor of 3x.\n- **MLX: for on-device inference on iPhone and Mac.**\n  MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon.\n  If you want to run the model on a Mac or an iPhone, choose the MLX implementation.\n\n\u003Cdetails>\n\u003Csummary>PyTorch implementation\u003C\u002Fsummary>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fstt-2.6b-en\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Fstt_pytorch.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\nFor an example of how to use the model in a way where you can directly stream in PyTorch tensors,\n[see our Colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Fstt_pytorch.ipynb).\n\nThis requires the [moshi package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmoshi\u002F)\nwith version 0.2.6 or later, which can be installed via pip.\n\nIf you just want to run the model on a file, you can use `moshi.run_inference`.\n\n```bash\npython -m moshi.run_inference --hf-repo kyutai\u002Fstt-2.6b-en audio\u002Fbria.mp3\n```\n\nIf you have [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) installed, you can skip the installation step\nand just prefix the command above with `uvx --with moshi`.\n\nAdditionally, we provide two scripts that highlight different usage scenarios. The first script illustrates how to extract word-level timestamps from the model's outputs:\n\n```bash\nuv run \\\n  scripts\u002Fstt_from_file_pytorch.py \\\n  --hf-repo kyutai\u002Fstt-2.6b-en \\\n  audio\u002Fbria.mp3\n```\n\nThe second script can be used to run a model on an existing Hugging Face dataset and calculate its performance metrics: \n```bash\nuv run scripts\u002Fevaluate_on_dataset.py  \\\n  --dataset meanwhile  \\\n  --hf-repo kyutai\u002Fstt-2.6b-en\n```\n\nAnother example shows how one can provide a text-, audio-, or text-audio prompt to our STT model:\n```bash\nuv run scripts\u002Fstt_from_file_pytorch_with_prompt.py \\\n  --hf-repo kyutai\u002Fstt-2.6b-en \\\n  --file bria.mp3 \\\n  --prompt_file .\u002Faudio\u002Floonah.mp3 \\\n  --prompt_text \"Loonah\" \\\n  --cut-prompt-transcript\n```\nProduces the transcript of `bria.mp3` using the `Loonah` spelling for the name, instead of the `Luna` used without any prompt:\n```\nIn the heart of an ancient forest, where the trees whispered secrets of the past, there lived a peculiar rabbit named Loonah (...)\n```\n\nApart from nudging the model for a specific spelling of a word, other potential use-cases include speaker adaptation and steering the model towards a specific formatting style or even a language.\nHowever, please bear in mind that is an experimental feature and its behavior is very sensitive to the prompt provided.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Rust server\u003C\u002Fsummary>\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fstt-2.6b-en-candle\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\nThe Rust implementation provides a server that can process multiple streaming\nqueries in parallel. Depending on the amount of memory on your GPU, you may\nhave to adjust the batch size from the config file. For a L40S GPU, a batch size\nof 64 works well and requests can be processed at 3x real-time speed.\n\nIn order to run the server, install the [moshi-server\ncrate](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fmoshi-server) via the following command. The\nserver code can be found in the\n[kyutai-labs\u002Fmoshi](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fmoshi\u002Ftree\u002Fmain\u002Frust\u002Fmoshi-server)\nrepository.\n```bash\ncargo install --features cuda moshi-server\n```\n\nThen the server can be started via the following command using the config file\nfrom this repository.\nFor `kyutai\u002Fstt-1b-en_fr`, use `configs\u002Fconfig-stt-en_fr.hf.toml`,\nand for `kyutai\u002Fstt-2.6b-en`, use `configs\u002Fconfig-stt-en-hf.toml`,\n\n```bash\nmoshi-server worker --config configs\u002Fconfig-stt-en_fr-hf.toml\n```\n\nOnce the server has started you can transcribe audio from your microphone with the following script.\n```bash\nuv run scripts\u002Fstt_from_mic_rust_server.py\n```\n\nWe also provide a script for transcribing from an audio file.\n```bash\nuv run scripts\u002Fstt_from_file_rust_server.py audio\u002Fbria.mp3\n```\n\nThe script limits the decoding speed to simulates real-time processing of the audio. \nFaster processing can be triggered by setting \nthe real-time factor, e.g. `--rtf 1000` will process\nthe data as fast as possible.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Rust standalone\u003C\u002Fsummary>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fstt-2.6b-en-candle\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\nA standalone Rust example script is provided in the `stt-rs` directory in this repo.\nThis can be used as follows:\n```bash\ncd stt-rs\ncargo run --features cuda -r -- ..\u002Faudio\u002Fbria.mp3\n```\nYou can get the timestamps by adding the `--timestamps` flag, and see the output\nof the semantic VAD by adding the `--vad` flag.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>MLX implementation\u003C\u002Fsummary>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fstt-2.6b-en-mlx\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-Model-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\n[MLX](https:\u002F\u002Fml-explore.github.io\u002Fmlx\u002Fbuild\u002Fhtml\u002Findex.html) is Apple's ML framework that allows you to use\nhardware acceleration on Apple silicon.\n\nThis requires the [moshi-mlx package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmoshi-mlx\u002F)\nwith version 0.2.6 or later, which can be installed via pip.\n\nIf you just want to run the model on a file, you can use `moshi_mlx.run_inference`:\n\n```bash\npython -m moshi_mlx.run_inference --hf-repo kyutai\u002Fstt-2.6b-en-mlx audio\u002Fbria.mp3 --temp 0\n```\n\nIf you have [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) installed, you can skip the installation step\nand just prefix the command above with `uvx --with moshi-mlx`.\n\nIf you want to transcribe audio from your microphone, use:\n\n```bash\npython scripts\u002Fstt_from_mic_mlx.py\n```\n\nThe MLX models can also be used in swift using the [moshi-swift\ncodebase](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fmoshi-swift), the 1b model has been\ntested to work fine on an iPhone 16 Pro.\n\u003C\u002Fdetails>\n\n## Kyutai Text-to-Speech\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fkyutai\u002Ftext-to-speech-6866192e7e004ed04fd39e29\" target=\"_blank\" style=\"margin: 2px;\">\n    \u003Cimg alt=\"Hugging Face\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging%20Face-KyutaiTTS-blue\" style=\"display: inline-block; vertical-align: middle;\"\u002F>\n\u003C\u002Fa>\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Ftts_pytorch.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\n**More details can be found on the [project page](https:\u002F\u002Fkyutai.org\u002Fnext\u002Ftts).**\n\nWe provide different implementations of Kyutai TTS for different use cases. Here is how to choose which one to use:\n\n- PyTorch: for research and tinkering. If you want to call the model from Python for research or experimentation, use our PyTorch implementation.\n- Rust: for production. If you want to serve Kyutai TTS in a production setting, use our Rust server. Our robust Rust server provides streaming access to the model over websockets. We use this server to run Unmute.\n- MLX: for on-device inference on iPhone and Mac. MLX is Apple's ML framework that allows you to use hardware acceleration on Apple silicon. If you want to run the model on a Mac or an iPhone, choose the MLX implementation.\n\n\u003Cdetails>\n\u003Csummary>PyTorch implementation\u003C\u002Fsummary>\n\n\u003Ca target=\"_blank\" href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Ftts_pytorch.ipynb\">\n  \u003Cimg src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\" alt=\"Open In Colab\"\u002F>\n\u003C\u002Fa>\n\nCheck out our [Colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkyutai-labs\u002Fdelayed-streams-modeling\u002Fblob\u002Fmain\u002Ftts_pytorch.ipynb) or use the script:\n\n```bash\n# From stdin, plays audio immediately\necho \"Hey, how are you?\" | python scripts\u002Ftts_pytorch.py - -\n\n# From text file to audio file\npython scripts\u002Ftts_pytorch.py text_to_say.txt audio_output.wav\n```\n\nThe `tts_pytorch.py` script waits for all the text to be available before\nstarting the audio generation. A fully streaming implementation is available in\nthe `tts_pytorch_streaming.py` script, which can be used as follows:\n\n```bash\necho \"Hey, how are you?\" | python scripts\u002Ftts_pytorch_streaming.py audio_output.wav\n```\n\nThis requires the [moshi package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmoshi\u002F), which can be installed via pip.\nIf you have [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) installed, you can skip the installation step\nand just prefix the command above with `uvx --with moshi`.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Rust server\u003C\u002Fsummary>\n\n\nThe Rust implementation provides a server that can process multiple streaming\nqueries in parallel.\n\nInstalling the Rust server is a bit tricky because it uses our Python implementation under the hood,\nwhich also requires installing the Python dependencies.\nUse the [start_tts.sh](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Funmute\u002Fblob\u002Fmain\u002Fdockerless\u002Fstart_tts.sh) script to properly install the Rust server.\nIf you already installed the `moshi-server` crate before and it's not working, you might need to force a reinstall by running `cargo uninstall moshi-server` first.\nFeel free to open an issue if the installation is still broken.\n\nOnce installed, the server can be started via the following command using the config file\nfrom this repository.\n\n```bash\nmoshi-server worker --config configs\u002Fconfig-tts.toml\n```\n\nOnce the server has started you can connect to it using our script as follows:\n```bash\n# From stdin, plays audio immediately\necho \"Hey, how are you?\" | python scripts\u002Ftts_rust_server.py - -\n\n# From text file to audio file\npython scripts\u002Ftts_rust_server.py text_to_say.txt audio_output.wav\n```\n\nYou can configure the server by modifying `configs\u002Fconfig-tts.toml`. See comments in that file to see what options are available.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>MLX implementation\u003C\u002Fsummary>\n\n[MLX](https:\u002F\u002Fml-explore.github.io\u002Fmlx\u002Fbuild\u002Fhtml\u002Findex.html) is Apple's ML framework that allows you to use\nhardware acceleration on Apple silicon.\n\nUse our example script to run Kyutai TTS on MLX.\nThe script takes text from stdin or a file and can output to a file or stream the resulting audio.\nWhen streaming the output, if the model is not fast enough to keep with\nreal-time, you can use the `--quantize 8` or `--quantize 4` flags to quantize\nthe model resulting in faster inference.\n\n```bash\n# From stdin, plays audio immediately\necho \"Hey, how are you?\" | python scripts\u002Ftts_mlx.py - - --quantize 8\n\n# From text file to audio file\npython scripts\u002Ftts_mlx.py text_to_say.txt audio_output.wav\n```\n\nThis requires the [moshi-mlx package](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmoshi-mlx\u002F), which can be installed via pip.\nIf you have [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) installed, you can skip the installation step\nand just prefix the command above with `uvx --with moshi-mlx`.\n\u003C\u002Fdetails>\n\n## FAQ\n\nCheckout the [Frequently Asked Questions](FAQ.md) section before opening an issue.\n\n## License\n\nThe present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.\nThe web client code is provided under the MIT license.\nNote that parts of this code is based on [AudioCraft](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft), released under\nthe MIT license.\n\nThe weights for the speech-to-text models are released under the CC-BY 4.0 license.\n\n## Developing\n\nInstall the [pre-commit hooks](https:\u002F\u002Fpre-commit.com\u002F) by running:\n\n```bash\npip install pre-commit\npre-commit install\n```\n\nIf you're using `uv`, you can replace the two commands with `uvx pre-commit install`.\n\n## Citation\n\nPlease cite the following paper.\n```\n@techreport{kyutai2025streaming,\n      title={Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling}, \n      author={Neil Zeghidour and Eugene Kharitonov and Manu Orsini and Václav Volhejn and Gabriel de Marmiesse and Edouard Grave and Patrick Pérez and Laurent Mazaré and Alexandre Défossez},\n      year={2025},\n      eprint={2509.08753},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.08753}, \n}\n```\n","该项目基于Delayed Streams Modeling框架，提供了Kyutai的语音转文字（STT）和文字转语音（TTS）模型。其核心功能包括实时音频处理、支持批量处理以提高效率，并提供词级时间戳。技术特点上，项目利用了流式推理技术，能够在接收到音频片段时立即开始转录，非常适合需要即时反馈的应用场景，如构建语音助手。此外，还针对不同使用场景提供了多种实现方式：PyTorch版本适用于研究与实验；Rust服务器版则更适合生产环境部署，能够通过WebSockets提供稳定的流式访问服务；MLX版本专为苹果设备上的本地推理设计，利用Apple Silicon进行硬件加速。此项目适合于任何需要高质量语音识别或合成解决方案的场合，尤其是对延迟敏感的应用。",2,"2026-06-11 03:41:35","high_star"]