[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72162":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},72162,"ultravox","fixie-ai\u002Fultravox","fixie-ai","A fast multimodal LLM for real-time voice","https:\u002F\u002Fultravox.ai",null,"Python",4446,378,55,57,0,3,11,31,9,29.74,"MIT License",false,"main",true,[27,28,29,30],"ai","llm","slm","speech","2026-06-12 02:02:59","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Cimg alt=\"Ultravox\" src=\"https:\u002F\u002Fgo.ultravox.ai\u002Frs\u002F577-CTA-193\u002Fimages\u002Fuv-logo-600x140.png\">\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\nA fast multimodal LLM designed for real-time voice interactions\n\u003C\u002Fh3>\n\n_Latest News_\n* 2025\u002F12 - [Ultravox 0.7](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v07) available\n* 2025\u002F06 — [Ultravox 0.6](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Ffixie-ai\u002Fultravox-v06-6865e7885cf904e21c31da03) available\n* 2025\u002F02 — [Ultravox 0.5](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.5) available\n* 2024\u002F11 — [Ultravox 0.4.1](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4.1) available\n* 2024\u002F08 — [Ultravox 0.4](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.4) available\n* 2024\u002F08 — [Ultravox 0.3](https:\u002F\u002Fgithub.com\u002Ffixie-ai\u002Fultravox\u002Freleases\u002Ftag\u002Fv0.3) available\n* 2024\u002F08 — Preview of Ultravox APIs available, more information [here](https:\u002F\u002Ffixie-ai.github.io\u002Fultradox\u002F)\n\n_Key Links_\n* [Ultravox Realtime](https:\u002F\u002Fultravox.ai) — Build real-time Voice AI agents on top of the Ultravox model\n* [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai) — Our Hugging Face page\n\n---\n\n# About\n\nUltravox is a new kind of multimodal LLM that can understand text as well as human speech, without the need for a separate Audio Speech Recognition (ASR) stage. Building on research like [AudioLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2209.03143), [SeamlessM4T](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fseamless-m4t\u002F), [Gazelle](https:\u002F\u002Ftincans.ai\u002Fslm), [SpeechGPT](https:\u002F\u002Fgithub.com\u002F0nutation\u002FSpeechGPT\u002Ftree\u002Fmain\u002Fspeechgpt), and others, Ultravox is able to extend any open-weight LLM with a multimodal projector that converts audio directly into the high-dimensional space used by LLM. We've trained versions on Llama 3, Mistral, and Gemma. This direct coupling allows Ultravox to respond much more quickly than systems that combine separate ASR and LLM components. In the future this will also allow Ultravox to natively understand the paralinguistic cues of timing and emotion that are omnipresent in human speech.\n\nUltravox currently takes in audio and emits streaming text. As we evolve the model, we'll train it to be able to emit a stream of speech tokens that can then be converted directly into raw audio by an appropriate unit vocoder.\n\nOur default model is built on top of Llama 3.3 70B. We also have an 8B variant available on Hugging Face.\n\nUltravox can be trained against any open-weight model. See below for more details on training.\n\n### Demo\n\nSee Ultravox in action on our [demo page](https:\u002F\u002Fdemo.ultravox.ai). You can build your own voice-to-voice agents on our Realtime platform at ultravox.ai.\n\n\n### Discord\n\nJoin us on our Discord server [here](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB).\n\n### Jobs\n\nIf you're interested in working on Ultravox fulltime, we're hiring! Check out our jobs page [here](https:\u002F\u002Fcareers.fixie.ai).\n\n### Inference Server\n\nYou can try out Ultravox using your own audio content (as a WAV file) by spinning up an Ultravox instance on our partner, BaseTen: [https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F](https:\u002F\u002Fwww.baseten.co\u002Flibrary\u002Fultravox\u002F). They offer free credits to get started.\n\nIf you're interested in running Ultravox in a real-time capacity, we offer a set of managed APIs as well. You can learn more about getting access to those [here](https:\u002F\u002Fdocs.ultravox.ai).\n\n### Model\n\nYou can download the latest weights from the [Ultravox Hugging Face page](https:\u002F\u002Fhuggingface.co\u002Ffixie-ai\u002F).\n\n### Architecture\n\n[![architecture diagram](https:\u002F\u002Fraw.githubusercontent.com\u002Ffixie-ai\u002Fultravox\u002Fmain\u002Fdocs\u002Fassets\u002FUltravox%20Model%20Architecture.svg)](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1ey81xuuMzrJaBwztb_Rq24Cit37GQokD2aAes_KkGVI\u002Fedit)\n\n# Contributing\n\nRead on if you're interested in training your own version of Ultravox.\n\n## Environment Setup (Mac)\n\nInstall the basic tools:\n\n- [`Homebrew`](https:\u002F\u002Fbrew.sh) is a package manager for MacOS that also mostly works for Linux. If you're running Debian or Ubuntu Linux, you can alternatively get by with apt.\n- [`Just`](https:\u002F\u002Fjust.systems\u002Fman\u002Fen\u002F) simplifies our shell workflows. It frequently functions as our interface to all the other tools.\n\n```bash\n\u002Fbin\u002Fbash -c \"$(curl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FHomebrew\u002Finstall\u002FHEAD\u002Finstall.sh)\"\nbrew update\nbrew install just\n```\n\nIt's recommended to use pyenv for managing environments due to the use of Poetry:\n\n```bash\nbrew install xz\nbrew install pyenv\npyenv init\npyenv install 3.11\npyenv global 3.11\n\n# Optional\npyenv shell 3.11\n```\n\n>**Note**: Use of conda is NOT recommended with Poetry\n\nAfter creating a virtual environment, install required packages using `just` and `poetry`:\n\n```bash\njust install\n```\n\nIf you plan to use augmentations (optional), you may also want to install system packages necessary for augmentations. You can do that with `just install-augs-system`. Read more about augmentations [here](ultravox\u002Fdata\u002Faug\u002FAUGMENTATIONS.md).\n\nWe're using Poetry to manage the Python virtual environment. You can observe your environment with `poetry env info`.\n\n## Training\n\nCurrently, we keep both the LLM and the audio encoder frozen and only train the adapter\u002Fprojector. Training Ultraox v0.4 took 2-3 hours on 8xH100 GPUs for 14K training steps.\n\n### Use-Cases for Training Ultravox\n\nWhy would you want to (re-) train Ultravox? Here are a few scenarios:\n\n1. You want to use a different LLM or audio encoder backbone.\n\n   a. In this case you need to re-train the adapter. You can use `example_config.yaml`, which contains our config for our latest release, and you should be able to simply change the base LLM or encoder by specifying `--text-model \u003Chf-model-id-for-llm>` and\u002For `--audio-model \u003Chf-model-id-for-encoder>`.\n\n2. You want to improve the knowledge of the model\n\n    a. We suggest to either use RAG on the fly (no training needed), or fine-tune the LLM backbone instead. Fine-tuning the LLM backbone does not require re-training Ultravox (i.e., the existing adapter will work).\n\n3. You want to use your own audio data, for example to add support for a new language.\n\n   a. First step, prepare your dataset: at bare minimum, the samples should have an `audio` and a text `continuation` field.\n\n   b. Take a look at [`ds_tool.py`](ultravox\u002Ftools\u002Fds_tool\u002Fds_tool.py) and [`continuation.jinja`](ultravox\u002Ftools\u002Fds_tool\u002Fcontinuation.jinja) as well as [our variant of Common Voice](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffixie-ai\u002Fcommon_voice_17_0\u002Fviewer\u002Ffr) that was created using `ds_tool` to add the `continuation` field.\n\n   c. Add your dataset to the dataset mix in `example_config.yaml` and train.\n\nThere's no one-size fits all. If you need help you can find us on our Discord server [here](https:\u002F\u002Fdiscord.gg\u002FQw6KHxv8YB).\n\n### How to Train\n\nWe do most of our training on the [MosaicML platform](https:\u002F\u002Fdocs.mosaicml.com) and therefore most of our tooling and docs are Mosaic-related. However, the MosaicML platform is being shut down at the end of July 2025, but we still left the configs in here for the commands. You can do the same training on your own GPU without much difficulty. Here we assume you have the environment set up (run `just install`). You can also take a look at [`setup.sh`](setup.sh)\n\nTo kick off a training run you can do:\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\n```\n\nFor DDP training make sure to add `torchrun`. We also recommend prefetching weights in advance:\n\n```bash\nTRAIN_ARGS=\"--config_path ultravox\u002Ftraining\u002Fconfigs\u002Fexample_config.yaml\"\npoetry run python -m ultravox.training.helpers.prefetch_weights $TRAIN_ARGS\npoetry run torchrun --nproc_per_node=8 -m ultravox.training.train $TRAIN_ARGS\n```\n\nFor a debug run, you can use smaller models, datasets, or batch size. Here's a config that uses TinyLlama as the LLM backbone:\n\n```bash\npoetry run python -m ultravox.training.train --config_path ultravox\u002Ftraining\u002Fconfigs\u002Fasr_tinyllama_100s.yaml --batch_size 1 --report_logs_to tensorboard\n```\n\nWe use [SimpleParsing](https:\u002F\u002Fgithub.com\u002Flebrice\u002Fsimpleparsing\u002F) for configs. Configs are composable (i.e. you can specify zero or many configs) and `meta_config.yaml` is always used as the default.\nSee [`configs_base.py`](ultravox\u002Ftraining\u002Fconfig_base.py) to find the parameters you modify, such as the `--text-model`, `--device`, `--exp-name`, etc.\n\n#### Multi-node training\n\n- For multi-node training, all you need to do is update `compute.gpus` line on `mcli_train.yaml` to get more GPUs for training\n  - All factors of 8 are supported\n- For more than 4 nodes, you might need to increase `val_dataset_args.max_samples`\n\n**NOTE:** W&B doesn't currently support multiple nodes. You'll only get info from the main node. It's possible to support it with grouped runs, so let me know if you think this is important for you.\n\n### Running evaluations\n\nFor inference or evaluations, you can use:\n\n```bash\njust eval --config_path ultravox\u002Fevaluation\u002Fconfigs\u002Feval_config.yaml\n```\n\nwhere `eval_config.yaml` is a config file that specifies the model, datasets, and configurations to use for inference or evaluation. If your dataset is not already defined in ultravox, you need to create a config file for your dataset in `ultravox\u002Fdata\u002Fconfigs\u002F` (with the appropriate `eval_config` field to specify evaluation metrics and arguments), and register it in `ultravox\u002Fdata\u002Fregistry.py`. Please refer to examples in `ultravox\u002Fdata\u002Fconfigs\u002F`.\n\n## Misc\n\nThe [Justfile](Justfile) is a good resource for finding popular commands. Here are a few:\n\n```bash\njust update    # update dependencies\njust format    # run formatting (black, isort, autoflake)\njust test      # run tests\njust python    # activate venv and run python\n```\n","Ultravox 是一个专为实时语音交互设计的多模态大语言模型。其核心功能在于能够直接处理音频输入并生成文本输出，无需额外的语音识别步骤，从而实现更快的响应速度。基于 Llama 3、Mistral 和 Gemma 等模型训练而成，Ultravox 利用一个多模态投影器将音频直接转换为适用于大语言模型的高维空间表示。这使得它不仅能够理解文本还能理解人类语音，并且未来有望直接理解和生成包含时间与情感线索的语音。该技术特别适合需要快速语音到文本转换的应用场景，如智能客服、语音助手等。项目采用 Python 编写，遵循 MIT 许可协议，确保了开源社区的广泛参与和使用灵活性。",2,"2026-06-11 03:40:37","high_star"]