[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71958":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},71958,"moshi","kyutai-labs\u002Fmoshi","kyutai-labs","Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.","",null,"Python",10368,968,98,68,0,29,64,181,87,118.96,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:03","# Moshi: a speech-text foundation model for real time dialogue\n\n![precommit badge](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fmoshi\u002Fworkflows\u002Fprecommit\u002Fbadge.svg)\n![rust ci badge](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fmoshi\u002Fworkflows\u002FRust%20CI\u002Fbadge.svg)\n\n[[Read the paper]][moshi] [[Demo]](https:\u002F\u002Fmoshi.chat) [[Hugging Face]](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fkyutai\u002Fmoshi-v01-release-66eaeaf3302bef6bd9ad7acd)\n\n[Moshi][moshi] is a speech-text foundation model and **full-duplex** spoken dialogue framework.\nIt uses [Mimi][moshi], a state-of-the-art streaming neural audio codec.\n[Talk to Moshi](https:\u002F\u002Fmoshi.chat) now in our live demo.\n\n## Organisation of the repository\n\nThere are three separate versions of the Moshi inference stack in this repo.\n\n\n- **[PyTorch](#pytorch-implementation): for research and tinkering.** The code is in the [`moshi\u002F`](moshi\u002F) directory.\n- **[MLX](#mlx-implementation-for-local-inference-on-macos): for on-device inference on iPhone and Mac.** The code is in the [`moshi_mlx\u002F`](moshi_mlx\u002F) directory.\n- **[Rust](#rust-implementation): for production.** The code is in the [`rust\u002F`](rust\u002F) directory.\n    This contains in particular a Mimi implementation in Rust, with Python bindings available\n    as `rustymimi`.\n\nFinally, the code for the web UI client used in the [Moshi demo](https:\u002F\u002Fmoshi.chat) is provided in the [`client\u002F`](client\u002F) directory.\n\nIf you want to fine tune Moshi, head out to [kyutai-labs\u002Fmoshi-finetune](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fmoshi-finetune).\n\n### Other Kyutai models\n\nThe Moshi codebase is also used to run related models from Kyutai that use a multi-stream architecture similar to Moshi:\n- **Hibiki: simultaneous speech translation.** Check out the [Hibiki repo](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fhibiki) for more info.\n- **Kyutai Text-To-Speech and Speech-To-Text.** Check out the [Delayed Streams Modeling repo](https:\u002F\u002Fgithub.com\u002Fkyutai-labs\u002Fdelayed-streams-modeling) for more info.\n\n## Model architecture\n\nMoshi models **two streams of audio**: one corresponds to Moshi speaking, and the other one to the user speaking.\nAlong with these two audio streams, Moshi predicts text tokens corresponding to its own speech, its **inner monologue**,\nwhich greatly improves the quality of its generation.\nA small Depth Transformer models inter-codebook dependencies for a given time step,\nwhile a large, 7B-parameter Temporal Transformer models the temporal dependencies. Moshi achieves a theoretical latency\nof 160ms (80ms for the frame size of Mimi + 80ms of acoustic delay), with a practical overall latency as low as 200ms on an L4 GPU.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\".\u002Fmoshi.png\" alt=\"Schema representing the structure of Moshi. Moshi models two streams of audio:\n    one corresponds to Moshi, and the other one to the user. At inference, the audio stream of the user is taken from the audio input, and the audio stream for Moshi is sampled from the model's output. Along that, Moshi predicts text tokens corresponding to its own speech for improved accuracy. A small Depth Transformer models inter codebook dependencies for a given step.\"\nwidth=\"650px\">\u003C\u002Fp>\n\n### Mimi\n\nMimi is a neural audio codec that processes 24 kHz audio, down to a 12.5 Hz representation\nwith a bandwidth of 1.1 kbps, in a fully streaming manner (latency of 80ms, the frame size),\nyet performs better than existing, non-streaming, codecs like\n[SpeechTokenizer](https:\u002F\u002Fgithub.com\u002FZhangXInFD\u002FSpeechTokenizer) (50 Hz, 4kbps), or [SemantiCodec](https:\u002F\u002Fgithub.com\u002Fhaoheliu\u002FSemantiCodec-inference) (50 Hz, 1.3kbps).\n\nMimi builds on previous neural audio codecs such as [SoundStream](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.03312)\nand [EnCodec](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fencodec), adding a Transformer both in the encoder and decoder,\nand adapting the strides to match an overall frame rate of 12.5 Hz. This allows Mimi to get closer to the\naverage frame rate of text tokens (~3-4 Hz), and limit the number of autoregressive steps in Moshi.\nSimilarly to SpeechTokenizer, Mimi uses a distillation loss so that the first codebook tokens match\na self-supervised representation from [WavLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.13900), which allows modeling semantic and acoustic information with a single model. Finally, and similarly to [EBEN](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2210.14090),\nMimi uses **only an adversarial training loss**, along with feature matching, showing strong improvements in terms of\nsubjective quality despite its low bitrate.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\".\u002Fmimi.png\" alt=\"Schema representing the structure of Mimi, our proposed neural codec. Mimi contains a Transformer\nin both its encoder and decoder, and achieves a frame rate closer to that of text tokens. This allows us to reduce\nthe number of auto-regressive steps taken by Moshi, thus reducing the latency of the model.\"\nwidth=\"800px\">\u003C\u002Fp>\n\n## Models\n\nWe release three models:\n- Moshi fine-tuned on a male synthetic voice (Moshiko),\n- Moshi fine-tuned on a female synthetic voice (Moshika),\n- Mimi, our speech codec.\n\nDepending on the backend, the file format and quantization available will vary. Here is the list\nof the HuggingFace repo with each model. Mimi is bundled in each of those, and always use the same checkpoint format.\n\n- Moshika for PyTorch (bf16, int8): [kyutai\u002Fmoshika-pytorch-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-pytorch-bf16), [kyutai\u002Fmoshika-pytorch-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-pytorch-q8) (experimental).\n- Moshiko for PyTorch (bf16, int8): [kyutai\u002Fmoshiko-pytorch-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-pytorch-bf16), [kyutai\u002Fmoshiko-pytorch-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-pytorch-q8) (experimental).\n- Moshika for MLX (int4, int8, bf16): [kyutai\u002Fmoshika-mlx-q4](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-mlx-q4), [kyutai\u002Fmoshika-mlx-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-mlx-q8),  [kyutai\u002Fmoshika-mlx-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-mlx-bf16).\n- Moshiko for MLX (int4, int8, bf16): [kyutai\u002Fmoshiko-mlx-q4](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-mlx-q4), [kyutai\u002Fmoshiko-mlx-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-mlx-q8),  [kyutai\u002Fmoshiko-mlx-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-mlx-bf16).\n- Moshika for Rust\u002FCandle (int8, bf16): [kyutai\u002Fmoshika-candle-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-candle-q8),  [kyutai\u002Fmoshika-mlx-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshika-candle-bf16).\n- Moshiko for Rust\u002FCandle (int8, bf16): [kyutai\u002Fmoshiko-candle-q8](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-candle-q8),  [kyutai\u002Fmoshiko-mlx-bf16](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmoshiko-candle-bf16).\n\nAll models are released under the CC-BY 4.0 license.\n\n## Requirements\n\nYou will need at least Python 3.10, with 3.12 recommended. For specific requirements, please check the individual backends\ndirectories. You can install the PyTorch and MLX clients with the following:\n\n```bash\npip install -U moshi      # moshi PyTorch, from PyPI\npip install -U moshi_mlx  # moshi MLX, from PyPI, best with Python 3.12.\n# Or the bleeding edge versions for Moshi and Moshi-MLX.\npip install -U -e \"git+https:\u002F\u002Fgit@github.com\u002Fkyutai-labs\u002Fmoshi.git#egg=moshi&subdirectory=moshi\"\npip install -U -e \"git+https:\u002F\u002Fgit@github.com\u002Fkyutai-labs\u002Fmoshi.git#egg=moshi_mlx&subdirectory=moshi_mlx\"\n\npip install rustymimi  # mimi, rust implementation with Python bindings from PyPI\n```\n\nIf you are not using Python 3.12, you might get an error when installing\n`moshi_mlx` or `rustymimi` (which `moshi_mlx` depends on). Then, you will need to install the [Rust toolchain](https:\u002F\u002Frustup.rs\u002F), or switch to Python 3.12.\n\nWhile we hope that the present codebase will work on Windows, we do not provide official support for it.\nWe have tested the MLX version on a MacBook Pro M3. At the moment, we do not support quantization\nfor the PyTorch version, so you will need a GPU with a significant amount of memory (24GB).\n\nFor using the Rust backend, you will need a recent version of the [Rust toolchain](https:\u002F\u002Frustup.rs\u002F).\nTo compile GPU support, you will also need the [CUDA](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit) properly installed for your GPU, in particular with `nvcc`.\n\n## PyTorch implementation\n\nThe PyTorch based API can be found in the `moshi` directory. It provides a streaming\nversion of the audio tokenizer (mimi) and the language model (moshi).\n\nIn order to run in interactive mode, you need to start a server which will\nrun the model, you can then use either the web UI or a command line client.\n\nStart the server with:\n```bash\npython -m moshi.server [--gradio-tunnel] [--hf-repo kyutai\u002Fmoshika-pytorch-bf16]\n```\n\nAnd then access the web UI on [localhost:8998](http:\u002F\u002Flocalhost:8998).\nIf your GPU is on a distant machine this will not work because for security reasons, websites using HTTP\nare not allowed to use the microphone. There are two ways to get around this:\n- Forward the remote 8998 port to your localhost using ssh `-L` flag. Then\n  connects to [localhost:8998](http:\u002F\u002Flocalhost:8998) as mentioned previously.\n- Use the `--gradio-tunnel` argument, setting up a tunnel with a URL accessible from anywhere.\n  Keep in mind that this tunnel goes through the US and can add significant\n  latency (up to 500ms from Europe). You can use `--gradio-tunnel-token` to set a\n  fixed secret token and reuse the same address over time.\n\nYou can use `--hf-repo` to select a different pretrained model, by setting the proper Hugging Face repository.\n\nAccessing a server that is not localhost via http may cause issues with using\nthe microphone in the web UI (in some browsers this is only allowed using\nhttps).\n\nA command-line client is also available, as\n```bash\npython -m moshi.client [--url URL_TO_GRADIO]\n```\nHowever note that, unlike the web browser, this client is barebones: it does not perform any echo cancellation,\nnor does it try to compensate for a growing lag by skipping frames.\n\nFor more information, in particular on how to use the API directly, please\ncheckout [moshi\u002FREADME.md](moshi\u002FREADME.md).\n\n## MLX implementation for local inference on macOS\n\nOnce you have installed `moshi_mlx`, you can run\n```bash\npython -m moshi_mlx.local -q 4   # weights quantized to 4 bits\npython -m moshi_mlx.local -q 8   # weights quantized to 8 bits\n# And using a different pretrained model:\npython -m moshi_mlx.local -q 4 --hf-repo kyutai\u002Fmoshika-mlx-q4\npython -m moshi_mlx.local -q 8 --hf-repo kyutai\u002Fmoshika-mlx-q8\n# be careful to always match the `-q` and `--hf-repo` flag.\n```\n\nThis command line interface is also barebone. It does not perform any echo cancellation,\nnor does it try to compensate for a growing lag by skipping frames.\n\nAlternatively you can run `python -m moshi_mlx.local_web` to use\nthe web UI, the connection is via http and will be at [localhost:8998](http:\u002F\u002Flocalhost:8998).\n\n\n## Rust implementation\n\nIn order to run the Rust inference server, use the following command from within\nthe `rust` directory:\n\n```bash\ncargo run --features cuda --bin moshi-backend -r -- --config moshi-backend\u002Fconfig.json standalone\n```\n\nWhen using macOS, you can replace `--features cuda` with `--features metal`.\n\nAlternatively you can use `config-q8.json` rather than `config.json` to use the\nquantized q8 model. You can select a different pretrained model, e.g. Moshika,\nby changing the `\"hf_repo\"` key in either file.\n\nOnce the server has printed 'standalone worker listening', you can use the web\nUI. By default the Rust server uses https so it will be at\n[localhost:8998](https:\u002F\u002Flocalhost:8998).\n\nYou will get warnings about the site being unsafe. When using chrome you\ncan bypass these by selecting \"Details\" or \"Advanced\", then \"Visit this unsafe\nsite\" or \"Proceed to localhost (unsafe)\".\n\n## Clients\n\nWe recommend using the web UI as it provides additional echo cancellation that helps\nthe overall model quality. Note that most commands will directly serve this UI\nin the provided URL, and there is in general nothing more to do.\n\nAlternatively, we provide command line interfaces\nfor the Rust and Python versions, the protocol is the same as with the web UI so\nthere is nothing to change on the server side.\n\nFor reference, here is the list of clients for Moshi.\n\n### Web UI\n\nThe web UI can be built from this repo via the\nfollowing steps (these will require `npm` being installed).\n```bash\ncd client\nnpm install\nnpm run build\n```\n\nThe web UI can then be found in the `client\u002Fdist` directory.\n\n### Rust Command Line\n\nFrom within the `rust` directory, run the following:\n```bash\ncargo run --bin moshi-cli -r -- tui --host localhost\n```\n\n### Python with PyTorch\n\n```bash\npython -m moshi.client\n```\n\n### Gradio Demo\n\nYou can launch a Gradio demo locally with the following command:\n\n```bash\npython -m moshi.client_gradio --url \u003Cmoshi-server-url>\n```\n\nPrior to running the Gradio demo, please install `gradio-webrtc>=0.0.18`.\n\n### Docker Compose (CUDA only)\n\n```bash\ndocker compose up\n```\n\n* Requires [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)\n\n## Development\n\nIf you wish to install from a clone of this repository, maybe to further develop Moshi, you can do the following:\n```bash\n# From the root of the clone of the repo\npip install -e 'moshi[dev]'\npip install -e 'moshi_mlx[dev]'\npre-commit install\n```\n\nIf you wish to build locally `rustymimi` (assuming you have Rust properly installed):\n```bash\npip install maturin\nmaturin dev -r -m rust\u002Fmimi-pyo3\u002FCargo.toml\n```\n\n## FAQ\n\nCheckout the [Frequently Asked Questions](FAQ.md) section before opening an issue.\n\n\n## License\n\nThe present code is provided under the MIT license for the Python parts, and Apache license for the Rust backend.\nThe web client code is provided under the MIT license.\nNote that parts of this code is based on [AudioCraft](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Faudiocraft), released under\nthe MIT license.\n\nThe weights for the models are released under the CC-BY 4.0 license.\n\n## Citation\n\nIf you use either Mimi or Moshi, please cite the following paper,\n\n```\n@techreport{kyutai2024moshi,\n      title={Moshi: a speech-text foundation model for real-time dialogue},\n      author={Alexandre D\\'efossez and Laurent Mazar\\'e and Manu Orsini and\n      Am\\'elie Royer and Patrick P\\'erez and Herv\\'e J\\'egou and Edouard Grave and Neil Zeghidour},\n      year={2024},\n      eprint={2410.00037},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00037},\n}\n```\n\n[moshi]: https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00037\n","Moshi 是一个语音-文本基础模型及全双工口语对话框架，使用了先进的流式神经音频编解码器 Mimi。其核心功能包括处理两个音频流（用户和Moshi的语音），并预测对应于Moshi自身语音的文本标记，从而提高生成质量。该模型架构结合了深度变换器与时间变换器，前者用于建模给定时间步长内的代码本依赖关系，后者则负责处理时间上的依赖性，实现了低至200毫秒的实际总体延迟。Moshi 适用于需要实时对话交互的应用场景，如客户服务聊天机器人、智能助手等。项目提供了基于PyTorch的研究版、面向iOS和Mac设备的MLX版本以及适合生产的Rust实现。",2,"2026-06-11 03:39:41","high_star"]