[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72342":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":32,"discoverSource":33},72342,"LLaMA-Omni","ictnlp\u002FLLaMA-Omni","ictnlp","LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.","https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06666",null,"Python",3142,223,34,51,0,2,59.25,"Apache License 2.0",false,"main",[23,24,25,26,27,28],"large-language-models","multimodal-large-language-models","speech-interaction","speech-language-model","speech-to-speech","speech-to-text","2026-06-12 04:01:04","# 🦙🎧 LLaMA-Omni: Seamless Speech Interaction with Large Language Models\n\n> **Authors: [Qingkai Fang](https:\u002F\u002Ffangqingkai.github.io\u002F), [Shoutao Guo](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=XwHtPyAAAAAJ), [Yan Zhou](https:\u002F\u002Fzhouyan19.github.io\u002Fzhouyan\u002F), [Zhengrui Ma](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=dUgq6tEAAAAJ), [Shaolei Zhang](https:\u002F\u002Fzhangshaolei1998.github.io\u002F), [Yang Feng*](https:\u002F\u002Fpeople.ucas.edu.cn\u002F~yangfeng?language=en)**\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2409.06666-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06666)\n[![code](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGithub-Code-keygen.svg?logo=github)](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FLLaMA-Omni)\n[![model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging_Face-Model-blue.svg)](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FLlama-3.1-8B-Omni)\n[![dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Hugging_Face-Dataset-blue.svg)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FICTNLP\u002FMultiturn-Speech-Conversations)\n[![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-blue.svg)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FICTNLP\u002FLlama-3.1-8B-Omni)\n[![Wisemodel](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWisemodel-Model-blue.svg)](https:\u002F\u002Fwww.wisemodel.cn\u002Fmodels\u002FICT_NLP\u002FLlama-3.1-8B-Omni\u002F)\n[![Replicate](https:\u002F\u002Freplicate.com\u002Fictnlp\u002Fllama-omni\u002Fbadge)](https:\u002F\u002Freplicate.com\u002Fictnlp\u002Fllama-omni)\n\n\nLLaMA-Omni is a speech-language model built upon Llama-3.1-8B-Instruct. It supports low-latency and high-quality speech interactions, simultaneously generating both text and speech responses based on speech instructions.\n\n\u003Cdiv align=\"center\">\u003Cimg src=\"images\u002Fmodel.png\" width=\"75%\"\u002F>\u003C\u002Fdiv>\n\n\n## 🔥 News\n\n- [25\u002F05] LLaMA-Omni 2 is accepted at ACL 2025 main conference!\n- [25\u002F05] An improved version of InstructS2S-200K is publicly available at [this link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FICTNLP\u002FMultiturn-Speech-Conversations). We have extended it to multi-turn conversations and diversified the input speech timbres. Sorry for the long wait!\n- [25\u002F04] We release [LLaMA-Omni2](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FLLaMA-Omni2), a series of speech language models ranging from 0.5B to 32B parameters, offering improved response quality and speech generation quality.\n- [25\u002F01] LLaMA-Omni is accepted at ICLR 2025! See you in Singapore!\n\n  \n## 💡 Highlights\n\n- 💪 **Built on Llama-3.1-8B-Instruct, ensuring high-quality responses.**\n\n- 🚀 **Low-latency speech interaction with a latency as low as 226ms.**\n\n- 🎧 **Simultaneous generation of both text and speech responses.**\n\n- ♻️ **Trained in less than 3 days using just 4 GPUs.**\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2b097af8-47d7-494f-b3b3-6be17ca0247a\n\n## Install\n\n1. Clone this repository.\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fictnlp\u002FLLaMA-Omni\ncd LLaMA-Omni\n```\n\n2. Install packages.\n\n```shell\nconda create -n llama-omni python=3.10\nconda activate llama-omni\npip install pip==24.0\npip install -e .\n```\n\n3. Install `fairseq`.\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq\ncd fairseq\npip install -e . --no-build-isolation\n```\n\n4. Install `flash-attention`.\n\n```shell\npip install flash-attn --no-build-isolation\n```\n\n## Quick Start\n\n1. Download the `Llama-3.1-8B-Omni` model from 🤗[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FLlama-3.1-8B-Omni). \n\n2. Download the `Whisper-large-v3` model.\n\n```shell\nimport whisper\nmodel = whisper.load_model(\"large-v3\", download_root=\"models\u002Fspeech_encoder\u002F\")\n```\n\n3. Download the unit-based HiFi-GAN vocoder.\n\n```shell\nwget https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fg_00500000 -P vocoder\u002F\nwget https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fconfig.json -P vocoder\u002F\n```\n\n## Gradio Demo\n\n1. Launch a controller.\n```shell\npython -m omni_speech.serve.controller --host 0.0.0.0 --port 10000\n```\n\n2. Launch a gradio web server.\n```shell\npython -m omni_speech.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --port 8000 --model-list-mode reload --vocoder vocoder\u002Fg_00500000 --vocoder-cfg vocoder\u002Fconfig.json\n```\n\n3. Launch a model worker.\n```shell\npython -m omni_speech.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path Llama-3.1-8B-Omni --model-name Llama-3.1-8B-Omni --s2s\n```\n\n4. Visit [http:\u002F\u002Flocalhost:8000\u002F](http:\u002F\u002Flocalhost:8000\u002F) and interact with LLaMA-3.1-8B-Omni!\n\n**Note: Due to the instability of streaming audio playback in Gradio, we have only implemented streaming audio synthesis without enabling autoplay. If you have a good solution, feel free to submit a PR. Thanks!**\n\n## Local Inference\n\nTo run inference locally, please organize the speech instruction files according to the format in the `omni_speech\u002Finfer\u002Fexamples` directory, then refer to the following script.\n```shell\nbash omni_speech\u002Finfer\u002Frun.sh omni_speech\u002Finfer\u002Fexamples\n```\n\n## LICENSE\n\nOur code is released under the Apache-2.0 License. Our model is intended for academic research purposes only and may **NOT** be used for commercial purposes.\n\nYou are free to use, modify, and distribute this model in academic settings, provided that the following conditions are met:\n\n- **Non-commercial use**: The model may not be used for any commercial purposes.\n- **Citation**: If you use this model in your research, please cite the original work.\n\n### Commercial Use Restriction\n\nFor any commercial use inquiries or to obtain a commercial license, please contact `fengyang@ict.ac.cn`.\n\n## Acknowledgements\n\n- [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): The codebase we built upon.\n- [SLAM-LLM](https:\u002F\u002Fgithub.com\u002FX-LANCE\u002FSLAM-LLM): We borrow some code about speech encoder and speech adaptor.\n\n## Citation\n\nIf you have any questions, please feel free to submit an issue or contact `fangqingkai21b@ict.ac.cn`.\n\nIf our work is useful for you, please cite as:\n\n```\n@article{fang-etal-2024-llama-omni,\n  title={LLaMA-Omni: Seamless Speech Interaction with Large Language Models},\n  author={Fang, Qingkai and Guo, Shoutao and Zhou, Yan and Ma, Zhengrui and Zhang, Shaolei and Feng, Yang},\n  journal={arXiv preprint arXiv:2409.06666},\n  year={2024}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=ictnlp\u002Fllama-omni&type=Date)](https:\u002F\u002Fstar-history.com\u002F#ictnlp\u002Fllama-omni&Date)\n","LLaMA-Omni 是一个基于 Llama-3.1-8B-Instruct 的端到端语音交互模型，旨在实现GPT-4级别的语音能力。其核心功能包括低延迟（最低226毫秒）的高质量语音交互，能够同时生成文本和语音响应。该模型在训练过程中仅使用了4个GPU，并在不到3天的时间内完成，展示了高效的训练效率。LLaMA-Omni适用于需要实时语音交流的应用场景，如智能助手、客户服务系统以及多轮对话应用等，尤其适合对响应速度和交互质量有高要求的环境。","2026-06-11 03:41:25","high_star"]