[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-856":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},856,"VibeVoice","microsoft\u002FVibeVoice","microsoft","Open-Source Frontier Voice AI","https:\u002F\u002Fmicrosoft.github.io\u002FVibeVoice\u002F",null,"Python",49255,5471,242,117,0,37,1478,2253,474,45,"MIT License",false,"main",true,[],"2026-06-12 02:00:19","\u003Cdiv align=\"center\">\n\n## 🎙️ VibeVoice: Open-Source Frontier Voice AI\n[![Project Page](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue?logo=githubpages)](https:\u002F\u002Fmicrosoft.github.io\u002FVibeVoice)\n[![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Collection-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fmicrosoft\u002Fvibevoice-68a2ef24a875c44be47b034f)\n[![TTS Report](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTTS-Report-red?logo=arxiv)](https:\u002F\u002Fopenreview.net\u002Fpdf?id=FihSkzyxdv)\n[![ASR Report](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FASR-Report-yellow?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2601.18184)\n[![Colab](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStreamingTTS-Colab-green?logo=googlecolab)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmicrosoft\u002FVibeVoice\u002Fblob\u002Fmain\u002Fdemo\u002FVibeVoice_colab.ipynb)\n[![ASR Playground](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FASR-Playground-6F42C1?logo=gradio)](https:\u002F\u002Faka.ms\u002Fvibevoice-asr)\n\n[![microsoft%2FVibeVoice | Trendshift](https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F15465)](https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F15465)\n\n\u003C\u002Fdiv>\n\n\n\u003Cdiv align=\"center\">\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"Figures\u002FVibeVoice_logo_white.png\">\n  \u003Cimg src=\"Figures\u002FVibeVoice_logo.png\" alt=\"VibeVoice Logo\" width=\"300\">\n\u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"left\">\n\n\u003Ch3>📰 News\u003C\u002Fh3>\n\n\n\n\n\u003Cstrong>2026-03-06: 🚀 VibeVoice ASR is now part of a \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-ASR-HF\">Transformers release\u003C\u002Fa>! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.\u003C\u002Fstrong>\n\n\u003Cstrong>2026-01-21:\u003C\u002Fstrong> 📣 We open-sourced \u003Ca href=\"docs\u002Fvibevoice-asr.md\">\u003Cstrong>VibeVoice-ASR\u003C\u002Fstrong>\u003C\u002Fa>, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in [Playground](https:\u002F\u002Faka.ms\u002Fvibevoice-asr).\n- ⭐️ VibeVoice-ASR is natively multilingual, supporting over 50 languages — check the [supported languages](docs\u002Fvibevoice-asr.md#language-distribution) for details.\n- 🔥 The VibeVoice-ASR [finetuning code](finetuning-asr\u002FREADME.md) is now available!\n- ⚡️ **vLLM inference** is now supported for faster inference; see [vllm-asr](docs\u002Fvibevoice-vllm-asr.md) for more details.\n- 📑 [VibeVoice-ASR Technique Report](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2601.18184) is available.\n\n2025-12-16: 📣 We added experimental speakers to \u003Ca href=\"docs\u002Fvibevoice-realtime-0.5b.md\">\u003Cstrong>VibeVoice‑Realtime‑0.5B\u003C\u002Fstrong>\u003C\u002Fa> for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. [Try it](docs\u002Fvibevoice-realtime-0.5b.md#optional-more-experimental-voices). More speaker types will be added over time.\n\n2025-12-03: 📣 We open-sourced \u003Ca href=\"docs\u002Fvibevoice-realtime-0.5b.md\">\u003Cstrong>VibeVoice‑Realtime‑0.5B\u003C\u002Fstrong>\u003C\u002Fa>, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmicrosoft\u002FVibeVoice\u002Fblob\u002Fmain\u002Fdemo\u002Fvibevoice_realtime_colab.ipynb).\n\n\n2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.\n\n\n2025-08-25: 📣 We open-sourced \u003Ca href=\"docs\u002Fvibevoice-tts.md\">\u003Cstrong>VibeVoice-TTS\u003C\u002Fstrong>\u003C\u002Fa>, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an [Oral](https:\u002F\u002Fopenreview.net\u002Fforum?id=FihSkzyxdv) at ICLR 2026! 🔥\n\n\u003C\u002Fdiv>\n\n## Overview\n\nVibeVoice is a **family of open-source frontier voice AI models** that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models. \n\nA core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of **7.5 Hz**. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a [next-token diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08635) framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.\n\nFor more information, demos, and examples, please visit our [Project Page](https:\u002F\u002Fmicrosoft.github.io\u002FVibeVoice).\n\n\n\u003Cdiv align=\"center\">\n\n| Model |   Weight | Quick Try |\n|-------|--------------|---------|\n| VibeVoice-ASR-7B | [HF Link](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-ASR) |  [Playground](https:\u002F\u002Faka.ms\u002Fvibevoice-asr) |\n| VibeVoice-TTS-1.5B | [HF Link](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B) | Disabled |\n| VibeVoice-Realtime-0.5B | [HF Link](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-Realtime-0.5B) | [Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmicrosoft\u002FVibeVoice\u002Fblob\u002Fmain\u002Fdemo\u002Fvibevoice_realtime_colab.ipynb) |\n\n\u003C\u002Fdiv>\n\n## Models\n\n\n### 1. 📖 [VibeVoice-ASR](docs\u002Fvibevoice-asr.md) - Long-form Speech Recognition\n\n**VibeVoice-ASR** is a unified speech-to-text model designed to handle **60-minute long-form audio** in a single pass, generating structured transcriptions containing **Who (Speaker), When (Timestamps), and What (Content)**, with support for **Customized Hotwords**.\n\n- **🕒 60-minute Single-Pass Processing**:\n  Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to **60 minutes** of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.\n\n- **👤 Customized Hotwords**:\n  Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.\n\n- **📝 Rich Transcription (Who, When, What)**:\n  The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates *who* said *what* and *when*.\n\n[📖 Documentation](docs\u002Fvibevoice-asr.md) | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-ASR) | [🎮 Playground](https:\u002F\u002Faka.ms\u002Fvibevoice-asr) | [🛠️ Finetuning](finetuning-asr\u002FREADME.md) |  [📊 Paper](docs\u002FVibeVoice-ASR-Report.pdf)\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"Figures\u002FDER.jpg\" alt=\"DER\" width=\"50%\">\u003Cbr>\n  \u003Cimg src=\"Figures\u002FcpWER.jpg\" alt=\"cpWER\" width=\"50%\">\u003Cbr>\n  \u003Cimg src=\"Figures\u002FtcpWER.jpg\" alt=\"tcpWER\" width=\"50%\">\n\u003C\u002Fp>\n\n\n\u003Cdiv align=\"center\" id=\"vibevoice-asr\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Facde5602-dc17-4314-9e3b-c630bc84aefa\n\n\u003C\u002Fdiv>\n\u003Cbr>\n\n### 2. 🎙️ [VibeVoice-TTS](docs\u002Fvibevoice-tts.md) - Long-form Multi-speaker TTS\n\n**Best for**: Long-form conversational audio, podcasts, multi-speaker dialogues\n\n- **⏱️ 90-minute Long-form Generation**:\n  Synthesizes conversational\u002Fsingle-speaker speech up to **90 minutes** in a single pass, maintaining speaker consistency and semantic coherence throughout.\n\n- **👥 Multi-speaker Support**:\n  Supports up to **4 distinct speakers** in a single conversation, with natural turn-taking and speaker consistency across long dialogues.\n\n- **🎭 Expressive Speech**:\n  Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.\n\n- **🌐 Multi-lingual Support**:\n  Supports English, Chinese and other languages.\n\n\n[📖 Documentation](docs\u002Fvibevoice-tts.md) | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-1.5B)  |  [📊 Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.19205)\n\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"Figures\u002FVibeVoice-TTS-results.jpg\" alt=\"VibeVoice Results\" width=\"80%\">\n\u003C\u002Fdiv>\n\n\n**English**\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0967027c-141e-4909-bec8-091558b1b784\n\n\u003C\u002Fdiv>\n\n\n**Chinese**\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F322280b7-3093-4c67-86e3-10be4746c88f\n\n\u003C\u002Fdiv>\n\n**Cross-Lingual**\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F838d8ad9-a201-4dde-bb45-8cd3f59ce722\n\n\u003C\u002Fdiv>\n\n**Spontaneous Singing**\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6f27a8a5-0c60-4f57-87f3-7dea2e11c730\n\n\u003C\u002Fdiv>\n\n\n**Long Conversation with 4 people**\n\u003Cdiv align=\"center\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fa357c4b6-9768-495c-a576-1618f6275727\n\n\u003C\u002Fdiv>\n\n\n\n\n\n\u003Cbr>\n\n### 3. ⚡ [VibeVoice-Streaming](docs\u002Fvibevoice-realtime-0.5b.md) - Real-time Streaming TTS\n\nVibeVoice-Realtime is a **lightweight real‑time** text-to-speech model supporting **streaming text input** and **robust long-form speech generation**.\n\n- Parameter size: 0.5B (deployment-friendly)\n- Real-time TTS (~300 milliseconds first audible latency)\n- Streaming text input\n- Robust long-form speech generation (~10 minutes)\n\n[📖 Documentation](docs\u002Fvibevoice-realtime-0.5b.md) | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FVibeVoice-Realtime-0.5B) | [🚀 Colab](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fmicrosoft\u002FVibeVoice\u002Fblob\u002Fmain\u002Fdemo\u002Fvibevoice_realtime_colab.ipynb)\n\n\n\u003Cdiv align=\"center\" id=\"generated-example-audio-vibevoice-realtime\">\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0901d274-f6ae-46ef-a0fd-3c4fba4f76dc\n\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\n## Contributing\n\nPlease see [CONTRIBUTING.md](CONTRIBUTING.md) for detailed contribution guidelines.\n\n\n\n## ⚠️ Risks and Limitations\n\n\nWhile efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release).\nPotential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.\n\n\nWe do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.\n\n## Star History\n\n![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Microsoft\u002Fvibevoice&type=date&legend=top-left)\n","VibeVoice 是一个开源的前沿语音AI项目，旨在提供高质量的语音识别（ASR）和文本转语音（TTS）服务。其核心功能包括能够处理长达60分钟长音频的统一ASR模型，该模型支持超过50种语言，并能生成包含说话人、时间戳及内容结构化的转录文本；同时，它还提供了实时TTS能力，支持多种语言和风格的声音定制。技术上，VibeVoice采用了先进的深度学习架构以保证高性能与低延迟，且支持vLLM推理加速。该项目非常适合需要高效准确语音处理的应用场景，如会议记录自动化、多语种客户服务系统开发等。",2,"2026-06-11 02:39:50","top_all"]