[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71998":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},71998,"higgs-audio","boson-ai\u002Fhiggs-audio","boson-ai","Text-audio foundation model from Boson AI",null,"Python",8173,627,57,89,0,44,82,133,132,39.39,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:02:57","\u003Ch1 align=\"center\">Higgs Audio: Redefining Expressiveness in Audio Generation\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\" style=\"display: flex; justify-content: center; margin-top: 10px;\">\n  \u003Ca href=\"https:\u002F\u002Fboson.ai\u002Fblog\u002Fhiggs-audio-v2\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🚀-V2 Blogpost-228B22' style=\"margin-right: 5px;\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.boson.ai\u002Fblog\u002Fhiggs-audio-v2.5\">\u003Cimg src='https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🚀-V2.5 Blogpost-228B22' style=\"margin-right: 5px;\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fboson.ai\u002Fdemo\u002Ftts\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🕹️-Boson%20AI%20Playground-9C276A\" style=\"margin-right: 5px;\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fsmola\u002Fhiggs_audio_v2\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🎮-HF%20Space%20Playground-8A2BE2\" style=\"margin-right: 5px;\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fbosonai\u002Fhiggs-audio-v2-generation-3B-base\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗-Checkpoints (3.6B LLM + 2.2B audio adapter)-ED5A22.svg\" style=\"margin-right: 5px;\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## NEWS！\n\nWe are proud to launch **Higgs-Audio V2.5**, the latest iteration of Boson AI’s Audio model, designed to bring high-fidelity generation into production environments. Building on Higgs-Audio V2, this release combines improved efficiency with the stability required for real-world deployment.\n\nWith V2.5, we condensed the model architecture to 1B parameters while surpassing speed and accuracy of the prior 3B model. The result is achieved through a new alignment strategy using Group Relative Policy Optimization (GRPO) on our curated Voice Bank dataset, combined with improved voice cloning and finer-grained style control.\n\nFor detailed model performance, key improvements, and usage, please check our [blog](https:\u002F\u002Fwww.boson.ai\u002Fblog\u002Fhiggs-audio-v2.5).\n\n\n## Higgs Audio V2\n\n\nWe are open-sourcing Higgs Audio v2, a powerful audio foundation model pretrained on over 10 million hours of audio data and a diverse set of text data. Despite having no post-training or fine-tuning, Higgs Audio v2 excels in expressive audio generation, thanks to its deep language and acoustic understanding.\n\nOn [EmergentTTS-Eval](https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Femergenttts-eval-public), it achieves win rates of **75.7%** and **55.7%** over \"gpt-4o-mini-tts\" on the \"Emotions\" and \"Questions\" categories, respectively. It also obtains state-of-the-art performance on traditional TTS benchmarks like Seed-TTS Eval and Emotional Speech Dataset (ESD). Moreover, the model demonstrates capabilities rarely seen in previous systems, including generating natural multi-speaker dialogues in multiple languages, automatic prosody adaptation during narration, melodic humming with the cloned voice, and simultaneous generation of speech and background music.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"figures\u002Femergent-tts-emotions-win-rate.png\" width=900>\n\u003C\u002Fp>\n\nHere's the demo video that shows some of its emergent capabilities (remember to unmute):\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0fd73fad-097f-48a9-9f3f-bc2a63b3818d\" type=\"video\u002Fmp4\" width=\"80%\" controls>\n\u003C\u002Fvideo>\n\nHere's another demo video that show-cases the model's multilingual capability and how it enabled live translation (remember to unmute):\n\n\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2b9b01ff-67fc-4bd9-9714-7c7df09e38d6\" type=\"video\u002Fmp4\" width=\"80%\" controls>\n\u003C\u002Fvideo>\n\n## Installation\n\nWe recommend to use NVIDIA Deep Learning Container to manage the CUDA environment. Following are two docker images that we have verified:\n- nvcr.io\u002Fnvidia\u002Fpytorch:25.02-py3\n- nvcr.io\u002Fnvidia\u002Fpytorch:25.01-py3\n\nHere's an example command for launching a docker container environment. Please also check the [official NVIDIA documentations](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fpytorch).\n\n```bash\ndocker run --gpus all --ipc=host --net=host --ulimit memlock=-1 --ulimit stack=67108864 -it --rm nvcr.io\u002Fnvidia\u002Fpytorch:25.02-py3 bash\n```\n\n### Option 1: Direct installation\n\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Fhiggs-audio.git\ncd higgs-audio\n\npip install -r requirements.txt\npip install -e .\n```\n\n### Option 2: Using venv\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Fhiggs-audio.git\ncd higgs-audio\n\npython3 -m venv higgs_audio_env\nsource higgs_audio_env\u002Fbin\u002Factivate\npip install -r requirements.txt\npip install -e .\n```\n\n\n### Option 3: Using conda\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Fhiggs-audio.git\ncd higgs-audio\n\nconda create -y --prefix .\u002Fconda_env --override-channels --strict-channel-priority --channel \"conda-forge\" \"python==3.10.*\"\nconda activate .\u002Fconda_env\npip install -r requirements.txt\npip install -e .\n\n# Uninstalling environment:\nconda deactivate\nconda remove -y --prefix .\u002Fconda_env --all\n```\n\n### Option 4: Using uv\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Fhiggs-audio.git\ncd higgs-audio\n\nuv venv --python 3.10\nsource .venv\u002Fbin\u002Factivate\nuv pip install -r requirements.txt\nuv pip install -e .\n```\n\n### Option 5: Using vllm\n\nFor advanced usage with higher throughput, we also built OpenAI compatible API server backed by vLLM engine for you to use.\nPlease refer to [examples\u002Fvllm](.\u002Fexamples\u002Fvllm) for more details.\n\n\n## Usage\n\n> [!TIP]\n> For optimal performance, run the generation examples on a machine equipped with GPU with at least 24GB memory!\n\n### Get Started\n\nHere's a basic python snippet to help you get started.\n\n```python\nfrom boson_multimodal.serve.serve_engine import HiggsAudioServeEngine, HiggsAudioResponse\nfrom boson_multimodal.data_types import ChatMLSample, Message, AudioContent\n\nimport torch\nimport torchaudio\nimport time\nimport click\n\nMODEL_PATH = \"bosonai\u002Fhiggs-audio-v2-generation-3B-base\"\nAUDIO_TOKENIZER_PATH = \"bosonai\u002Fhiggs-audio-v2-tokenizer\"\n\nsystem_prompt = (\n    \"Generate audio following instruction.\\n\\n\u003C|scene_desc_start|>\\nAudio is recorded from a quiet room.\\n\u003C|scene_desc_end|>\"\n)\n\nmessages = [\n    Message(\n        role=\"system\",\n        content=system_prompt,\n    ),\n    Message(\n        role=\"user\",\n        content=\"The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.\",\n    ),\n]\ndevice = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\nserve_engine = HiggsAudioServeEngine(MODEL_PATH, AUDIO_TOKENIZER_PATH, device=device)\n\noutput: HiggsAudioResponse = serve_engine.generate(\n    chat_ml_sample=ChatMLSample(messages=messages),\n    max_new_tokens=1024,\n    temperature=0.3,\n    top_p=0.95,\n    top_k=50,\n    stop_strings=[\"\u003C|end_of_text|>\", \"\u003C|eot_id|>\"],\n)\ntorchaudio.save(f\"output.wav\", torch.from_numpy(output.audio)[None, :], output.sampling_rate)\n```\n\nWe also provide a list of examples under [examples](.\u002Fexamples). In the following we highlight a few examples to help you use Higgs Audio v2.\n\n### Zero-Shot Voice Cloning\nGenerate audio that sounds similar as the provided [reference audio](.\u002Fexamples\u002Fvoice_prompts\u002Fbelinda.wav).\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript \"The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.\" \\\n--ref_audio belinda \\\n--temperature 0.3 \\\n--out_path generation.wav\n```\n\nThe generation script will automatically use `cuda:0` if it founds cuda is available. To change the device id, specify `--device_id`:\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript \"The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.\" \\\n--ref_audio belinda \\\n--temperature 0.3 \\\n--device_id 0 \\\n--out_path generation.wav\n```\n\nYou can also try other voices. Check more example voices in [examples\u002Fvoice_prompts](.\u002Fexamples\u002Fvoice_prompts). You can also add your own voice to the folder.\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript \"The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.\" \\\n--ref_audio broom_salesman \\\n--temperature 0.3 \\\n--out_path generation.wav\n```\n\n### Single-speaker Generation with Smart Voice\nIf you do not specify reference voice, the model will decide the voice based on the transcript it sees.\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript \"The sun rises in the east and sets in the west. This simple fact has been observed by humans for thousands of years.\" \\\n--temperature 0.3 \\\n--out_path generation.wav\n```\n\n\n### Multi-speaker Dialog with Smart Voice\nGenerate multi-speaker dialog. The model will decide the voices based on the transcript it sees.\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript examples\u002Ftranscript\u002Fmulti_speaker\u002Fen_argument.txt \\\n--seed 12345 \\\n--out_path generation.wav\n```\n\n### Multi-speaker Dialog with Voice Clone\n\nGenerate multi-speaker dialog with the voices you picked.\n\n```bash\npython3 examples\u002Fgeneration.py \\\n--transcript examples\u002Ftranscript\u002Fmulti_speaker\u002Fen_argument.txt \\\n--ref_audio belinda,broom_salesman \\\n--ref_audio_in_system_message \\\n--chunk_method speaker \\\n--seed 12345 \\\n--out_path generation.wav\n```\n\n\n## Technical Details\n\u003Cimg src=\"figures\u002Fhiggs_audio_v2_architecture_combined.png\" width=900>\n\n\nHiggs Audio v2 adopts the \"generation variant\" depicted in the architecture figure above. Its strong performance is driven by three key technical innovations:\n- We developed an automated annotation pipeline that leverages multiple ASR models, sound event classification models, and our in-house audio understanding model. Using this pipeline, we cleaned and annotated 10 million hours audio data, which we refer to as **AudioVerse**. The in-house understanding model is finetuned on top of [Higgs Audio v1 Understanding](https:\u002F\u002Fwww.boson.ai\u002Fblog\u002Fhiggs-audio), which adopts the \"understanding variant\" shown in the architecture figure.\n- We trained a unified audio tokenizer from scratch that captures both semantic and acoustic features. We also open-sourced our evaluation set on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbosonai\u002FAudioTokenBench). Learn more in the [tokenizer blog](.\u002Ftech_blogs\u002FTOKENIZER_BLOG.md).\n- We proposed the DualFFN architecture, which enhances the LLM’s ability to model acoustics tokens with minimal computational overhead. See the [architecture blog](.\u002Ftech_blogs\u002FARCHITECTURE_BLOG.md).\n\n## Evaluation\n\nHere's the performance of Higgs Audio v2 on four benchmarks,  [Seed-TTS Eval](https:\u002F\u002Fgithub.com\u002FBytedanceSpeech\u002Fseed-tts-eval), [Emotional Speech Dataset (ESD)](https:\u002F\u002Fpaperswithcode.com\u002Fdataset\u002Fesd), [EmergentTTS-Eval](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23009), and Multi-speaker Eval:\n\n#### Seed-TTS Eval & ESD\n\nWe prompt Higgs Audio v2 with the reference text, reference audio, and target text for zero-shot TTS. We use the standard evaluation metrics from Seed-TTS Eval and ESD.\n\n|                              | SeedTTS-Eval| | ESD   |                 |\n|------------------------------|--------|--------|---------|-------------------|\n|                              | WER ↓ | SIM ↑ | WER ↓ | SIM (emo2vec) ↑ |\n| Cosyvoice2                   | 2.28   | 65.49  | 2.71    | 80.48             |\n| Qwen2.5-omni†                | 2.33   | 64.10  | -       | -                 |\n| ElevenLabs Multilingual V2   | **1.43**   | 50.00  | 1.66    | 65.87             |\n| Higgs Audio v1                | 2.18   | 66.27  | **1.49**    | 82.84             |\n| Higgs Audio v2 (base)         | 2.44   | **67.70**  | 1.78    | **86.13**         |\n\n\n#### EmergentTTS-Eval (\"Emotions\" and \"Questions\")\n\nFollowing the [EmergentTTS-Eval Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23009), we report the win-rate over \"gpt-4o-mini-tts\" with the \"alloy\" voice. The judge model is Gemini 2.5 Pro.\n\n| Model                              | Emotions (%) ↑ | Questions (%) ↑ |\n|------------------------------------|--------------|----------------|\n| Higgs Audio v2 (base)               | **75.71%**   | **55.71%**         |\n| [gpt-4o-audio-preview†](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fmodels\u002Fgpt-4o-audio-preview)       | 61.64%       | 47.85%         |\n| [Hume.AI](https:\u002F\u002Fwww.hume.ai\u002Fresearch)                            | 61.60%       | 43.21%         |\n| **BASELINE:** [gpt-4o-mini-tts](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fmodels\u002Fgpt-4o-mini-tts)  | 50.00%       | 50.00%         |\n| [Qwen 2.5 Omni†](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni)      | 41.60%       | 51.78%         |\n| [minimax\u002Fspeech-02-hd](https:\u002F\u002Freplicate.com\u002Fminimax\u002Fspeech-02-hd)               | 40.86%        | 47.32%         |\n| [ElevenLabs Multilingual v2](https:\u002F\u002Felevenlabs.io\u002Fblog\u002Feleven-multilingual-v2)         | 30.35%       | 39.46%         |\n| [DeepGram Aura-2](https:\u002F\u002Fdeepgram.com\u002Flearn\u002Fintroducing-aura-2-enterprise-text-to-speech)                    | 29.28%       | 48.21%         |\n| [Sesame csm-1B](https:\u002F\u002Fgithub.com\u002FSesameAILabs\u002Fcsm)                      | 15.96%       | 31.78%         |\n\n\u003Csup>\u003Csub>'†' means using the strong-prompting method described in the paper.\u003C\u002Fsub>\u003C\u002Fsup>\n\n\n#### Multi-speaker Eval\n\nWe also designed a multi-speaker evaluation benchmark to evaluate the capability of Higgs Audio v2 for multi-speaker dialog generation. The benchmark contains three subsets\n\n- `two-speaker-conversation`: 1000 synthetic dialogues involving two speakers. We fix two reference audio clips to evaluate the model's ability in double voice cloning for utterances ranging from 4 to 10 dialogues between two randomly chosen persona.\n- `small talk (no ref)`: 250 synthetic dialogues curated in the same way as above, but are characterized by short utterances and a limited number of turns (4–6), we do not fix reference audios in this case and this set is designed to evaluate the model's ability to automatically assign appropriate voices to speakers.\n- `small talk (ref)`: 250 synthetic dialogues similar to above, but contains even shorter utterances as this set is meant to include reference clips in it's context, similar to `two-speaker-conversation`.\n\n\nWe report the word-error-rate (WER) and the geometric mean between intra-speaker similarity and inter-speaker dis-similarity on these three subsets. Other than Higgs Audio v2, we also evaluated [MoonCast](https:\u002F\u002Fgithub.com\u002Fjzq2000\u002FMoonCast) and [nari-labs\u002FDia-1.6B-0626](https:\u002F\u002Fhuggingface.co\u002Fnari-labs\u002FDia-1.6B-0626), two of the most popular open-source models capable of multi-speaker dialog generation. Results are summarized in the following table. We are not able to run [nari-labs\u002FDia-1.6B-0626](https:\u002F\u002Fhuggingface.co\u002Fnari-labs\u002FDia-1.6B-0626) on our \"two-speaker-conversation\" subset due to its strict limitation on the length of the utterances and output audio.\n\n|                                                | two-speaker-conversation |                |small talk |                | small talk (no ref) |                |\n| ---------------------------------------------- | -------------- | ------------------ | ---------- | -------------- | ------------------- | -------------- |\n|                                                | WER ↓                      | Mean Sim & Dis-sim ↑ | WER ↓       |  Mean Sim & Dis-sim ↑ | WER ↓               | Mean Sim & Dis-sim ↑ |\n| [MoonCast](https:\u002F\u002Fgithub.com\u002Fjzq2000\u002FMoonCast) | 38.77                    | 46.02         | **8.33**       | 63.68          | 24.65               | 53.94 |\n| [nari-labs\u002FDia-1.6B-0626](https:\u002F\u002Fhuggingface.co\u002Fnari-labs\u002FDia-1.6B-0626)         | \\-                       | \\-             | 17.62      | 63.15          | 19.46               | **61.14**          |\n| Higgs Audio v2 (base)     | **18.88**                    | **51.95**          | 11.89      | **67.92**              | **14.65**               | 55.28              |\n\n## Contribution and Support\n\nFor contribution and support guidelines, please see the support guidelines at [SUPPORT_GUIDELINES.md](SUPPORT_GUIDELINES.md).\n\n## Citation\n\nIf you feel the repository is helpful, please kindly cite as:\n\n```\n@misc{higgsaudio2025,\n  author       = {{Boson AI}},\n  title        = {{Higgs Audio V2: Redefining Expressiveness in Audio Generation}},\n  year         = {2025},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fboson-ai\u002Fhiggs-audio}},\n  note         = {GitHub repository. Release blog available at \\url{https:\u002F\u002Fwww.boson.ai\u002Fblog\u002Fhiggs-audio-v2}},\n}\n```\n\n## Third-Party Licenses\n\nThe `boson_multimodal\u002Faudio_processing\u002F` directory contains code derived from third-party repositories, primarily from [xcodec](https:\u002F\u002Fgithub.com\u002Fzhenye234\u002Fxcodec). Please see the [`LICENSE`](boson_multimodal\u002Faudio_processing\u002FLICENSE) in that directory for complete attribution and licensing information.\n\n## We Are Hiring!\n\nIf you are passionate about multimodal AI, speech\u002Faudio models, or large-scale systems, \ncheck out our open positions at [Boson AI Careers](https:\u002F\u002Fjobs.lever.co\u002Fbosonai).\n","Higgs Audio 是由 Boson AI 开发的一款文本-音频基础模型，旨在通过深度语言和声学理解生成富有表现力的音频。该项目的核心功能包括自然多说话人对话生成、自动韵律调整、克隆声音哼唱以及同时生成语音和背景音乐等，这些特性在传统 TTS 系统中较为罕见。技术上，Higgs Audio V2.5 优化了模型架构至 10 亿参数规模，在保持甚至超越前代 30 亿参数模型的速度与准确度的同时，显著提升了效率与稳定性。适用于需要高质量音频合成的各种场景，如多媒体内容创作、虚拟助手开发及教育软件等。",2,"2026-06-11 03:39:53","high_star"]