[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-3656":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},3656,"voicebox","jamiepine\u002Fvoicebox","jamiepine","The open-source AI voice studio. Clone, dictate, create.","https:\u002F\u002Fvoicebox.sh",null,"TypeScript",29768,3663,165,375,0,80,482,4505,358,45,"MIT License",false,"main",true,[27,28,29,30,31,32,33,34],"ai","cuda","mlx","qwen3-tts","qwen3-tts-ui","voice-ai","voice-clone","whisper","2026-06-12 02:00:52","\u003Cp align=\"center\">\n  \u003Cimg src=\".github\u002Fassets\u002Ficon-dark.webp\" alt=\"Voicebox\" width=\"120\" height=\"120\" \u002F>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">Voicebox\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>The open-source AI voice studio.\u003C\u002Fstrong>\u003Cbr\u002F>\n  Clone any voice. Generate speech. Dictate into any app. Talk to agents in voices you own.\u003Cbr\u002F>\n  The full voice I\u002FO stack, running locally on your machine.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox\u002Freleases\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002Fjamiepine\u002Fvoicebox\u002Ftotal?style=flat&color=blue\" alt=\"Downloads\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox\u002Freleases\u002Flatest\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fjamiepine\u002Fvoicebox?style=flat\" alt=\"Release\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox\u002Fstargazers\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fjamiepine\u002Fvoicebox?style=flat\" alt=\"Stars\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fjamiepine\u002Fvoicebox?style=flat\" alt=\"License\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002Fjamiepine\u002Fvoicebox\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Ask&message=DeepWiki&color=5B6EF7\" alt=\"Ask DeepWiki\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F21213\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F21213\" alt=\"jamiepine%2Fvoicebox | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fvoicebox.sh\">voicebox.sh\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fdocs.voicebox.sh\">Docs\u003C\u002Fa> •\n  \u003Ca href=\"#download\">Download\u003C\u002Fa> •\n  \u003Ca href=\"#features\">Features\u003C\u002Fa> •\n  \u003Ca href=\"#api\">API\u003C\u002Fa> •\n  \u003Ca href=\"docs\u002Fcontent\u002Fdocs\u002Foverview\u002Ftroubleshooting.mdx\">Troubleshooting\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cbr\u002F>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fvoicebox.sh\">\n    \u003Cimg src=\"landing\u002Fpublic\u002Fassets\u002Fapp-screenshot-1.webp\" alt=\"Voicebox App Screenshot\" width=\"800\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cem>Click the image above to watch the demo video on \u003Ca href=\"https:\u002F\u002Fvoicebox.sh\">voicebox.sh\u003C\u002Fa>\u003C\u002Fem>\n\u003C\u002Fp>\n\n\u003Cbr\u002F>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"landing\u002Fpublic\u002Fassets\u002Fapp-screenshot-2.webp\" alt=\"Voicebox Screenshot 2\" width=\"800\" \u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"landing\u002Fpublic\u002Fassets\u002Fapp-screenshot-3.webp\" alt=\"Voicebox Screenshot 3\" width=\"800\" \u002F>\n\u003C\u002Fp>\n\n\u003Cbr\u002F>\n\n## What is Voicebox?\n\nVoicebox is a **local-first AI voice studio** — a free and open-source alternative to **ElevenLabs** and **WisprFlow** in one app. Clone voices from a few seconds of audio, generate speech in 23 languages across 7 TTS engines, dictate into any text field with a global hotkey, and give any MCP-aware AI agent a voice of your choosing.\n\nThe two cloud incumbents sit on opposite halves of the voice I\u002FO loop — ElevenLabs on output, WisprFlow on input. Voicebox does both, bridges them with a bundled local LLM for refinement and per-profile personas, and runs the whole thing on your machine.\n\n- **Complete privacy** — models, voice data, and captures never leave your machine\n- **7 TTS engines** — Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox Multilingual, Chatterbox Turbo, HumeAI TADA, and Kokoro\n- **Voice cloning and preset voices** — zero-shot cloning from a reference sample, or 50+ curated preset voices via Kokoro and Qwen CustomVoice\n- **23 languages** — from English to Arabic, Japanese, Hindi, Swahili, and more\n- **Post-processing effects** — pitch shift, reverb, delay, chorus, compression, and filters\n- **Expressive speech** — paralinguistic tags like `[laugh]`, `[sigh]`, `[gasp]` via Chatterbox Turbo; natural-language delivery control via Qwen CustomVoice\n- **Unlimited length** — auto-chunking with crossfade for scripts, articles, and chapters\n- **Stories editor** — multi-track timeline for conversations, podcasts, and narratives\n- **Voice input** — global dictation hotkey with push-to-talk and toggle modes, accessibility-verified auto-paste on macOS, in-app mic on every text field, Whisper-based STT\n- **Agent voice output** — one tool call (`voicebox.speak`) and any MCP-aware agent (Claude Code, Cursor, Cline) speaks to you in a voice you've cloned\n- **Voice personalities** — attach a free-form persona to any voice profile, then Compose, Rewrite, or Respond via a bundled local LLM — agents can invoke the same modes over MCP\n- **API-first** — REST API plus a built-in MCP server for integrating voice I\u002FO into your own apps and agents\n- **Native performance** — built with Tauri (Rust), not Electron\n- **Runs everywhere** — macOS (MLX\u002FMetal), Windows (CUDA), Linux, AMD ROCm, Intel Arc, Docker\n\n---\n\n## Download\n\n| Platform              | Download                                               |\n| --------------------- | ------------------------------------------------------ |\n| macOS (Apple Silicon) | [Download DMG](https:\u002F\u002Fvoicebox.sh\u002Fdownload\u002Fmac-arm)   |\n| macOS (Intel)         | [Download DMG](https:\u002F\u002Fvoicebox.sh\u002Fdownload\u002Fmac-intel) |\n| Windows               | [Download MSI](https:\u002F\u002Fvoicebox.sh\u002Fdownload\u002Fwindows)   |\n| Docker                | `docker compose up`                                    |\n\n> **[View all binaries →](https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox\u002Freleases\u002Flatest)**\n\n> **Linux** — Pre-built binaries are not yet available. See [voicebox.sh\u002Flinux-install](https:\u002F\u002Fvoicebox.sh\u002Flinux-install) for build-from-source instructions.\n\n> **Having trouble?** See the [Troubleshooting Guide](docs\u002Fcontent\u002Fdocs\u002Foverview\u002Ftroubleshooting.mdx) for common install, generation, model-download, and GPU issues.\n\n---\n\n## Features\n\n### Multi-Engine Voice Cloning\n\nSeven TTS engines with different strengths, switchable per-generation:\n\n| Engine                      | Languages | Strengths                                                                                                                                |\n| --------------------------- | --------- | ---------------------------------------------------------------------------------------------------------------------------------------- |\n| **Qwen3-TTS** (0.6B \u002F 1.7B) | 10        | High-quality multilingual cloning, delivery instructions (\"speak slowly\", \"whisper\")                                                     |\n| **Qwen CustomVoice**        | 10        | 9 curated preset voices with natural-language delivery control — no reference audio required                                             |\n| **LuxTTS**                  | English   | Lightweight (~1GB VRAM), 48kHz output, 150x realtime on CPU                                                                              |\n| **Chatterbox Multilingual** | 23        | Broadest language coverage — Arabic, Danish, Finnish, Greek, Hebrew, Hindi, Malay, Norwegian, Polish, Swahili, Swedish, Turkish and more |\n| **Chatterbox Turbo**        | English   | Fast 350M model with paralinguistic emotion\u002Fsound tags                                                                                   |\n| **TADA** (1B \u002F 3B)          | 10        | HumeAI speech-language model — 700s+ coherent audio, text-acoustic dual alignment                                                        |\n| **Kokoro**                  | 8         | 50 curated preset voices, tiny 82M model, fast CPU inference                                                                             |\n\n### Emotions & Paralinguistic Tags\n\nOnly **Chatterbox Turbo** interprets paralinguistic tags like `[laugh]` and\n`[sigh]`. Qwen3-TTS, LuxTTS, Chatterbox Multilingual, and HumeAI TADA read them\nliterally as text.\n\nWith **Chatterbox Turbo** selected, type `\u002F` in the text input to open the tag\ninserter and add expressive tags inline with speech:\n\n`[laugh]` `[chuckle]` `[gasp]` `[cough]` `[sigh]` `[groan]` `[sniff]` `[shush]` `[clear throat]`\n\n### Post-Processing Effects\n\n8 audio effects powered by Spotify's `pedalboard` library. Apply after generation, preview in real time, build reusable presets.\n\n| Effect           | Description                                   |\n| ---------------- | --------------------------------------------- |\n| Pitch Shift      | Up or down by up to 12 semitones              |\n| Reverb           | Configurable room size, damping, wet\u002Fdry mix  |\n| Delay            | Echo with adjustable time, feedback, and mix  |\n| Chorus \u002F Flanger | Modulated delay for metallic or lush textures |\n| Compressor       | Dynamic range compression                     |\n| Gain             | Volume adjustment (-40 to +40 dB)             |\n| High-Pass Filter | Remove low frequencies                        |\n| Low-Pass Filter  | Remove high frequencies                       |\n\nShips with 4 built-in presets (Robotic, Radio, Echo Chamber, Deep Voice) and supports custom presets. Effects can be assigned per-profile as defaults.\n\n### Unlimited Generation Length\n\nText is automatically split at sentence boundaries and each chunk is generated independently, then crossfaded together. Works with all engines.\n\n- Configurable auto-chunking limit (100–5,000 chars)\n- Crossfade slider (0–200ms) for smooth transitions\n- Max text length: 50,000 characters\n- Smart splitting respects abbreviations, CJK punctuation, and `[tags]`\n\n### Generation Versions\n\nEvery generation supports multiple versions with provenance tracking:\n\n- **Original** — clean TTS output, always preserved\n- **Effects versions** — apply different effects chains from any source version\n- **Takes** — regenerate with a new seed for variation\n- **Source tracking** — each version records its lineage\n- **Favorites** — star generations for quick access\n\n### Async Generation Queue\n\nGeneration is non-blocking. Submit and immediately start typing the next one.\n\n- Serial execution queue prevents GPU contention\n- Real-time SSE status streaming\n- Failed generations can be retried\n- Stale generations from crashes auto-recover on startup\n\n### Voice Profile Management\n\n- Create profiles from audio files or record directly in-app\n- Import\u002Fexport profiles to share or back up\n- Multi-sample support for higher quality cloning\n- Per-profile default effects chains\n- Organize with descriptions and language tags\n\n### Stories Editor\n\nMulti-voice timeline editor for conversations, podcasts, and narratives.\n\n- Multi-track composition with drag-and-drop\n- Inline audio trimming and splitting\n- Auto-playback with synchronized playhead\n- Version pinning per track clip\n\n### Global Dictation & Voice Input\n\nThe other half of the voice I\u002FO loop. Hold a hotkey anywhere on your system, speak, release — on macOS the transcript pastes straight into the focused text field. Or hit the mic on any Voicebox text input and dictate directly into the app.\n\n- **Configurable chord bindings** — hold-to-speak and tap-to-toggle chords, each rebindable in the in-app chord picker. Holding push-to-talk and tapping `Space` mid-hold upgrades into a toggle session without a gap in audio\n- **Target-aware paste (macOS)** — accessibility-verified injection into the focused text field, with atomic clipboard save\u002Frestore so your clipboard isn't clobbered\n- **First-run permissions UX** — in-app gates walk you through the macOS Accessibility and Input Monitoring grants with deep-links to System Settings\n- **In-app mic button** on every Voicebox text field — generation form, profile descriptions, story titles, anywhere you'd type\n- **LLM refinement** — optional cleanup of ums, stutters, and false starts before paste\n- **On-screen pill** — floating overlay surfacing `recording`, `transcribing`, `refining`, and `speaking` states. Same pill agents use when they speak to you, so there's one mental model for both directions of the loop\n\n### Speech-to-Text\n\nVoicebox runs OpenAI Whisper for transcription — the same model that backs dictation, the Captures tab, and the `\u002Ftranscribe` API. Running on MLX (Apple Silicon) or PyTorch (CUDA \u002F ROCm \u002F DirectML \u002F CPU) depending on your platform.\n\n| Size                          | Notes                                              |\n| ----------------------------- | -------------------------------------------------- |\n| Base \u002F Small \u002F Medium \u002F Large | Standard Whisper quality ladder                    |\n| Turbo                         | ~8x faster than Whisper Large, minimal quality loss |\n\nMore engines (Parakeet v3, Qwen3-ASR) are planned — see [Roadmap](#roadmap).\n\n### Captures\n\nEvery dictation, in-app recording, and uploaded audio file lands in the Captures tab — original audio paired with transcript, always preserved.\n\n- **Replay, re-transcribe, refine** — rerun STT with any Whisper size, or re-run the raw transcript through the local LLM with different flags (filler cleanup, self-correction removal, technical-term preservation)\n- **Edit inline** — tweak the transcript and save on blur\n- **Play as voice profile** — turn any capture into speech with a cloned voice, one click\n- **Promote to voice sample** — use a capture's audio + transcript as a reference sample on any voice profile\n- **Local capture storage** — original audio and transcript stay in your Voicebox data directory, with a folder shortcut in Settings\n\n### Agent Voice Output\n\nEvery agent gets a voice. One tool call and any MCP-aware agent can speak to you in a voice you've cloned — task completions, questions, notifications. The same pill that surfaces during dictation surfaces during agent speech, so you always see what's coming out of your machine.\n\n```ts\n\u002F\u002F In any MCP-aware agent:\nawait voicebox.speak({\n  text: \"Deploy complete.\",\n  profile: \"Morgan\",\n});\n```\n\nAlso exposed as `POST \u002Fspeak` for anything that doesn't speak MCP — ACP, A2A, shell scripts, custom harnesses.\n\n- **Bidirectional pill** — `recording`, `transcribing`, `refining`, and `speaking` are all states of the same OS-level overlay, so dictation and agent speech share one surface\n- **Per-agent voice binding** — in **Settings → MCP**, pin Claude Code to Morgan and Cursor to Scarlett so you can tell which agent is talking without looking. Each client's `last_seen_at` timestamp confirms the install actually took\n- **Always visible** — no silent background TTS; every agent-initiated speak surfaces the pill with the voice profile name for the full duration\n- **HTTP + stdio transports** — install as a URL in Claude Code \u002F Cursor \u002F Windsurf \u002F VS Code MCP, or point stdio-only clients at the bundled `voicebox-mcp` binary\n\n### Voice Personalities\n\nAttach a free-form personality to any voice profile — who this voice is, how they speak, what they care about. Two actions appear on the generate box when a personality is set, powered by a bundled Qwen3 LLM running entirely locally.\n\n- **Compose** — a shuffle button that drops a fresh in-character line into the textarea; edit and speak, or click again for a different take\n- **Speak in character** — a toggle that routes your input text through the personality LLM to be rewritten in their voice before TTS\n\nAgents can reach the same rewrite path over MCP by passing `personality: true` to `voicebox.speak`, turning the tool into a text-in → personality-LLM → TTS pipeline. The same LLM backs dictation's refinement step — one LLM in the app, one model cache, one GPU-memory footprint.\n\n**Local LLM options:** Qwen3 0.6B \u002F 1.7B \u002F 4B, sharing the TTS runtime (MLX on Apple Silicon, PyTorch elsewhere).\n\nUse cases: agent dev loops (dictate a question, hear the answer in a cloned voice), interactive characters for games and narrative tools, speech assistance for people who can't speak in their original voice.\n\n### Model Management\n\n- Per-model unload to free GPU memory without deleting downloads\n- Custom models directory via `VOICEBOX_MODELS_DIR`\n- Model folder migration with progress tracking\n- Download cancel\u002Fclear UI\n\n### GPU Support\n\n| Platform                 | Backend        | Notes                                          |\n| ------------------------ | -------------- | ---------------------------------------------- |\n| macOS (Apple Silicon)    | MLX (Metal)    | 4-5x faster via Neural Engine                  |\n| Windows \u002F Linux (NVIDIA) | PyTorch (CUDA) | Auto-downloads CUDA binary from within the app |\n| Linux (AMD)              | PyTorch (ROCm) | Auto-configures HSA_OVERRIDE_GFX_VERSION       |\n| Windows (any GPU)        | DirectML       | Universal Windows GPU support                  |\n| Intel Arc                | IPEX\u002FXPU       | Intel discrete GPU acceleration                |\n| Any                      | CPU            | Works everywhere, just slower                  |\n\n---\n\n## API\n\nVoicebox exposes a REST API for integrating voice I\u002FO into your own apps and agents.\n\n```bash\n# Generate speech\ncurl -X POST http:\u002F\u002F127.0.0.1:17493\u002Fgenerate \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"text\": \"Hello world\", \"profile_id\": \"abc123\", \"language\": \"en\"}'\n\n# Agent voice output — any app or script can speak in a cloned voice\ncurl -X POST http:\u002F\u002F127.0.0.1:17493\u002Fspeak \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -H \"X-Voicebox-Client-Id: my-script\" \\\n  -d '{\"text\": \"Deploy complete.\", \"profile\": \"Morgan\"}'\n\n# Transcribe an audio file\ncurl -X POST http:\u002F\u002F127.0.0.1:17493\u002Ftranscribe \\\n  -F \"audio=@recording.wav\" \\\n  -F \"model=whisper-turbo\"\n\n# List voice profiles\ncurl http:\u002F\u002F127.0.0.1:17493\u002Fprofiles\n```\n\n`POST \u002Fspeak` accepts `profile` as a name (case-insensitive) or id, and resolves via the same precedence as the MCP tool: explicit arg → per-client binding → `capture_settings.default_playback_voice_id`.\n\n### MCP server\n\nVoicebox ships a built-in **Model Context Protocol** server so any MCP-aware agent (Claude Code, Cursor, Windsurf, Cline, VS Code MCP extensions) can speak, transcribe, and browse captures and profiles.\n\n**Claude Code one-liner:**\n\n```\nclaude mcp add voicebox \\\n  --transport http \\\n  --url http:\u002F\u002F127.0.0.1:17493\u002Fmcp \\\n  --header \"X-Voicebox-Client-Id: claude-code\"\n```\n\n**Any HTTP MCP client** (Cursor, Windsurf, VS Code, etc.):\n\n```json\n{\n  \"mcpServers\": {\n    \"voicebox\": {\n      \"url\": \"http:\u002F\u002F127.0.0.1:17493\u002Fmcp\",\n      \"headers\": { \"X-Voicebox-Client-Id\": \"cursor\" }\n    }\n  }\n}\n```\n\n**Stdio fallback** for clients that don't speak HTTP MCP — point at the bundled `voicebox-mcp` binary inside the app:\n\n```json\n{\n  \"mcpServers\": {\n    \"voicebox\": {\n      \"command\": \"\u002FApplications\u002FVoicebox.app\u002FContents\u002FMacOS\u002Fvoicebox-mcp\",\n      \"env\": { \"VOICEBOX_CLIENT_ID\": \"claude-desktop\" }\n    }\n  }\n}\n```\n\nFour tools ship: `voicebox.speak`, `voicebox.transcribe`, `voicebox.list_captures`, `voicebox.list_profiles`. Per-client voice bindings are managed in **Voicebox → Settings → MCP**. See the [full MCP guide](docs\u002Fcontent\u002Fdocs\u002Foverview\u002Fmcp-server.mdx) for tool signatures, resolution precedence, the speaking-pill contract, and security notes.\n\n```ts\n\u002F\u002F In any MCP-aware agent:\nawait voicebox.speak({\n  text: \"Tests passing. Ready to merge.\",\n  profile: \"Morgan\",      \u002F\u002F optional — falls back to the per-client binding\n  personality: true,      \u002F\u002F optional — rewrites text through the profile's personality LLM first\n});\n```\n\n**Use cases:** agent dev loops (voice in, voice out), game dialogue, podcast production, accessibility tools, voice assistants, content automation.\n\nFull API documentation available at `http:\u002F\u002F127.0.0.1:17493\u002Fdocs`.\n\n---\n\n## Tech Stack\n\n| Layer         | Technology                                                                      |\n| ------------- | ------------------------------------------------------------------------------- |\n| Desktop App   | Tauri (Rust)                                                                    |\n| Frontend      | React, TypeScript, Tailwind CSS                                                 |\n| State         | Zustand, React Query                                                            |\n| Backend       | FastAPI (Python)                                                                |\n| TTS Engines   | Qwen3-TTS, Qwen CustomVoice, LuxTTS, Chatterbox, Chatterbox Turbo, TADA, Kokoro |\n| STT           | Whisper \u002F Whisper Turbo (PyTorch or MLX)                                        |\n| Local LLM     | Qwen3 (0.6B \u002F 1.7B \u002F 4B), shared runtime with TTS \u002F STT                         |\n| MCP Server    | FastMCP mounted at `\u002Fmcp` (Streamable HTTP) + bundled stdio shim binary         |\n| Native Shim   | Rust (inside Tauri) for global hotkey, paste injection, focus introspection     |\n| Effects       | Pedalboard (Spotify)                                                            |\n| Inference     | MLX (Apple Silicon) \u002F PyTorch (CUDA\u002FROCm\u002FXPU\u002FCPU)                               |\n| Database      | SQLite                                                                          |\n| Audio         | WaveSurfer.js, librosa                                                          |\n\n---\n\n## Roadmap\n\n| Feature                            | Description                                                              |\n| ---------------------------------- | ------------------------------------------------------------------------ |\n| **Windows \u002F Linux auto-paste**     | Dictation paste parity — `SendInput` on Windows, `uinput` \u002F AT-SPI on Linux |\n| **STT engine expansion**           | Parakeet v3 and Qwen3-ASR joining Whisper — 50+ languages, better non-English quality |\n| **Pipeline routing**               | Configurable source → transform → sink chains with webhook + MCP sinks and a preset editor |\n| **Streaming transcription**        | WebSocket `\u002Ftranscribe\u002Fstream` for partial transcripts as you speak      |\n| **End-to-end speech LLMs**         | Moshi, GLM-4-Voice, Qwen2.5 Omni — real voice-to-voice, no text between  |\n| **Voice Design**                   | Create new voices from text descriptions                                 |\n| **Long-form capture**              | Dual-stream recorder (mic + system audio) with summary LLM transform     |\n| **Platform sinks**                 | Apple Notes, Obsidian, and other opt-in integrations                     |\n| **Plugin architecture**            | Extend with custom models, transforms, and sinks                         |\n| **Mobile companion**               | Control Voicebox from your phone                                         |\n\nFor the **full engineering status, open-issue triage, and prioritized work queue**, see [`docs\u002FPROJECT_STATUS.md`](docs\u002FPROJECT_STATUS.md) — a living document that tracks what's shipped, what's in-flight, candidate TTS engines under evaluation, and why we've accepted or backlogged specific integrations.\n\n---\n\n## Development\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for detailed setup and contribution guidelines.\n\n### Quick Start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjamiepine\u002Fvoicebox.git\ncd voicebox\n\njust setup   # creates Python venv, installs all deps\njust dev     # starts backend + desktop app\n```\n\nInstall [just](https:\u002F\u002Fgithub.com\u002Fcasey\u002Fjust): `brew install just` or `cargo install just`. Run `just --list` to see all commands.\n\n**Prerequisites:** [Bun](https:\u002F\u002Fbun.sh), [Rust](https:\u002F\u002Frustup.rs), [Python 3.11+](https:\u002F\u002Fpython.org), [Tauri Prerequisites](https:\u002F\u002Fv2.tauri.app\u002Fstart\u002Fprerequisites\u002F), and [Xcode](https:\u002F\u002Fdeveloper.apple.com\u002Fxcode\u002F) on macOS.\n\nThe repo ships a pre-wired `.mcp.json` at the root — running Claude Code inside this checkout picks up the Voicebox MCP tools automatically once the dev app is running.\n\n### Building Locally\n\n```bash\njust build          # Build CPU server binary + Tauri app\njust build-local    # (Windows) Build CPU + CUDA server binaries + Tauri app\n```\n\n### Adding New Voice Models\n\nThe multi-engine architecture makes adding new TTS engines straightforward. A [step-by-step guide](docs\u002Fcontent\u002Fdocs\u002Fdeveloper\u002Ftts-engines.mdx) covers the full process: dependency research, backend protocol implementation, frontend wiring, and PyInstaller bundling.\n\nThe guide is optimized for AI coding agents. An [agent skill](.agents\u002Fskills\u002Fadd-tts-engine\u002FSKILL.md) can pick up a model name and handle the entire integration autonomously — you just test the build locally.\n\n### Project Structure\n\n```\nvoicebox\u002F\n├── app\u002F              # Shared React frontend\n├── tauri\u002F            # Desktop app (Tauri + Rust)\n├── web\u002F              # Web deployment\n├── backend\u002F          # Python FastAPI server\n├── landing\u002F          # Marketing website\n└── scripts\u002F          # Build & release scripts\n```\n\n---\n\n## Contributing\n\nContributions welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n1. Fork the repo\n2. Create a feature branch\n3. Make your changes\n4. Submit a PR\n\n## Security\n\nFound a security vulnerability? Please report it responsibly. See [SECURITY.md](SECURITY.md) for details.\n\n---\n\n## License\n\nMIT License — see [LICENSE](LICENSE) for details.\n\n---\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fvoicebox.sh\">voicebox.sh\u003C\u002Fa>\n\u003C\u002Fp>\n","Voicebox 是一个开源的AI语音工作室，支持本地运行。其核心功能包括通过几秒钟的音频克隆声音、使用7种TTS引擎生成23种语言的语音、通过全局快捷键将语音转文字输入到任何文本框中以及为任何MCP感知的AI代理赋予你选择的声音。项目基于TypeScript开发，并利用CUDA和MLX等技术加速处理过程，确保高效运行。适用于需要高质量语音合成与转换的应用场景，如内容创作、辅助写作、游戏配音及个人或企业级的自动化语音解决方案。",2,"2026-06-11 02:55:20","top_language"]