[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72128":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},72128,"Kokoro-FastAPI","remsky\u002FKokoro-FastAPI","remsky","Dockerized FastAPI wrapper for Kokoro-82M text-to-speech model w\u002Fmultiplatform CPU, AMD, NVIDIA GPU PyTorch support, handling, and auto-stitching","",null,"Python",4981,822,38,98,0,14,51,155,42,30.75,"Apache License 2.0",false,"master",true,[27,28,29,30,31,32,33,34,35,36,37,38,39],"fastapi","huggingface-spaces","kokoro","kokoro-tts","onnx","onnxruntime","openai-compatible-api","openwebui","pytorch","sillytavern","tts","tts-api","uv","2026-06-12 02:02:58","\u003Cp align=\"center\">\n  \u003Cimg src=\"githubbanner.png\" alt=\"Kokoro TTS Banner\">\n\u003C\u002Fp>\n\n# \u003Csub>\u003Csub>_`FastKoko`_ \u003C\u002Fsub>\u003C\u002Fsub>\n[![Tests](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftests-81-darkgreen)]()\n[![Coverage](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcoverage-52%25-tan)]()\n[![Try on Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20Try%20on-Spaces-blue)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FRemsky\u002FKokoro-TTS-Zero)\n\n[![Kokoro](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fkokoro-0.9.4-BB5420)](https:\u002F\u002Fgithub.com\u002Fhexgrad\u002Fkokoro)\n[![Misaki](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmisaki-0.9.4-B8860B)](https:\u002F\u002Fgithub.com\u002Fhexgrad\u002Fmisaki)\n\n[![Tested at Model Commit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flast--tested--model--commit-1.0::9901c2b-blue)](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M\u002Fcommit\u002F9901c2b79161b6e898b7ea857ae5298f47b8b0d6)\n\nDockerized FastAPI wrapper for [Kokoro-82M](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) text-to-speech model\n- Multi-language support (English, Japanese, Chinese, _Vietnamese soon_)\n- OpenAI-compatible Speech endpoint with NVIDIA GPU, AMD GPU (ROCm, experimental), or CPU inference via PyTorch. Apple Silicon (MPS) supported when running directly via UV.\n- ONNX support coming soon, see v0.1.5 and earlier for legacy ONNX support in the interim\n- Debug endpoints for monitoring system stats, integrated web UI on localhost:8880\u002Fweb\n- Phoneme-based audio generation, phoneme generation\n- Per-word timestamped caption generation\n- Voice mixing with weighted combinations\n\n### Integration Guides\n [![Helm Chart](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHelm%20Chart-black?style=flat&logo=helm&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fwiki\u002FSetup-Kubernetes) [![DigitalOcean](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDigitalOcean-black?style=flat&logo=digitalocean&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fwiki\u002FIntegrations-DigitalOcean) [![SillyTavern](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSillyTavern-black?style=flat&color=red)](https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fwiki\u002FIntegrations-SillyTavern)\n[![OpenWebUI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpenWebUI-black?style=flat&color=white)](https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fwiki\u002FIntegrations-OpenWebUi)\n## Get Started\n\n\u003Cdetails>\n\u003Csummary>Quickest Start (docker run)\u003C\u002Fsummary>\n\n\nPre built images are available to run, with arm\u002Fmulti-arch support, and baked in models\nRefer to the core\u002Fconfig.py file for a full list of variables which can be managed via the environment\n\n```bash\n# the `latest` tag can be used, though it may have some unexpected bonus features which impact stability.\n### Named versions should be pinned for your regular usage.\n### Feedback\u002Ftesting is always welcome\n\ndocker run -p 8880:8880 ghcr.io\u002Fremsky\u002Fkokoro-fastapi-cpu:latest # CPU, or:\ndocker run --gpus all -p 8880:8880 ghcr.io\u002Fremsky\u002Fkokoro-fastapi-gpu:latest  # NVIDIA GPU, or:\ndocker run --device=\u002Fdev\u002Fkfd --device=\u002Fdev\u002Fdri -p 8880:8880 ghcr.io\u002Fremsky\u002Fkokoro-fastapi-rocm:latest  # AMD GPU (ROCm, experimental, amd64 only)\n```\n\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\n\u003Csummary>Quick Start (docker compose) \u003C\u002Fsummary>\n\n1. Install prerequisites, and start the service using Docker Compose (Full setup including UI):\n   - Install [Docker](https:\u002F\u002Fwww.docker.com\u002Fproducts\u002Fdocker-desktop\u002F)\n   - Clone the repository:\n        ```bash\n        git clone https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI.git\n        cd Kokoro-FastAPI\n\n        cd docker\u002Fgpu   # For NVIDIA GPU support\n        # or cd docker\u002Fcpu   # For CPU support\n        # or cd docker\u002Frocm  # For AMD GPU (ROCm, experimental, amd64 only)\n        docker compose up --build\n\n        # *Note for Apple Silicon (M1\u002FM2\u002FM3) users:\n        # The Docker GPU image is CUDA-only and won't run on Apple Silicon. With Docker, use `docker\u002Fcpu`.\n        # For native MPS (Apple GPU) acceleration, run directly via UV with `.\u002Fstart-gpu_mac.sh`.\n\n        # Models will auto-download, but if needed you can manually download:\n        python docker\u002Fscripts\u002Fdownload_model.py --output api\u002Fsrc\u002Fmodels\u002Fv1_0\n\n        # Or run directly via UV:\n        .\u002Fstart-gpu.sh  # For GPU support\n        .\u002Fstart-cpu.sh  # For CPU support\n        ```\n\u003C\u002Fdetails>\n\u003Cdetails>\n\u003Csummary>Direct Run (via uv) \u003C\u002Fsummary>\n\n1. Install prerequisites ():\n   - Install [astral-uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)\n   - Install [espeak-ng](https:\u002F\u002Fgithub.com\u002Fespeak-ng\u002Fespeak-ng) in your system if you want it available as a fallback for unknown words\u002Fsounds. The upstream libraries may attempt to handle this, but results have varied.\n   - Clone the repository:\n        ```bash\n        git clone https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI.git\n        cd Kokoro-FastAPI\n        ```\n        \n        Run the [model download script](https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fblob\u002Fmaster\u002Fdocker\u002Fscripts\u002Fdownload_model.py) if you haven't already\n     \n        Start directly via UV (with hot-reload)\n        \n        Linux and macOS\n        ```bash\n        .\u002Fstart-cpu.sh OR\n        .\u002Fstart-gpu.sh \n        ```\n\n        Windows\n        ```powershell\n        .\\start-cpu.ps1 OR\n        .\\start-gpu.ps1 \n        ```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary> Up and Running? \u003C\u002Fsummary>\n\n\nRun locally as an OpenAI-Compatible Speech Endpoint\n    \n```python\nfrom openai import OpenAI\n\nclient = OpenAI(\n    base_url=\"http:\u002F\u002Flocalhost:8880\u002Fv1\", api_key=\"not-needed\"\n)\n\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_sky+af_bella\", #single or multiple voicepack combo\n    input=\"Hello world!\"\n  ) as response:\n      response.stream_to_file(\"output.mp3\")\n```\n  \n- The API will be available at http:\u002F\u002Flocalhost:8880\n- API Documentation: http:\u002F\u002Flocalhost:8880\u002Fdocs\n\n- Web Interface: http:\u002F\u002Flocalhost:8880\u002Fweb\n\n\u003Cdiv align=\"center\" style=\"display: flex; justify-content: center; gap: 10px;\">\n  \u003Cimg src=\"assets\u002Fdocs-screenshot.png\" width=\"42%\" alt=\"API Documentation\" style=\"border: 2px solid #333; padding: 10px;\">\n  \u003Cimg src=\"assets\u002Fwebui-screenshot.png\" width=\"42%\" alt=\"Web UI Screenshot\" style=\"border: 2px solid #333; padding: 10px;\">\n\u003C\u002Fdiv>\n\n\u003C\u002Fdetails>\n\n## Features \n\n\u003Cdetails>\n\u003Csummary>OpenAI-Compatible Speech Endpoint\u003C\u002Fsummary>\n\n```python\n# Using OpenAI's Python library\nfrom openai import OpenAI\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:8880\u002Fv1\", api_key=\"not-needed\")\nresponse = client.audio.speech.create(\n    model=\"kokoro\",  \n    voice=\"af_bella+af_sky\", # see \u002Fapi\u002Fsrc\u002Fcore\u002Fopenai_mappings.json to customize\n    input=\"Hello world!\",\n    response_format=\"mp3\"\n)\n\nresponse.stream_to_file(\"output.mp3\")\n```\nOr Via Requests:\n```python\nimport requests\n\n\nresponse = requests.get(\"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fvoices\")\nvoices = response.json()[\"voices\"]\n\n# Generate audio\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"model\": \"kokoro\",  \n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"mp3\",  # Supported: mp3, wav, opus, flac\n        \"speed\": 1.0\n    }\n)\n\n# Save audio\nwith open(\"output.mp3\", \"wb\") as f:\n    f.write(response.content)\n```\n\nQuick tests (run from another terminal):\n```bash\npython examples\u002Fassorted_checks\u002Ftest_openai\u002Ftest_openai_tts.py # Test OpenAI Compatibility\npython examples\u002Fassorted_checks\u002Ftest_voices\u002Ftest_all_voices.py # Test all available voices\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Voice Combination\u003C\u002Fsummary>\n\n- Weighted voice combinations using ratios (e.g., \"af_bella(2)+af_heart(1)\" for 67%\u002F33% mix)\n- Ratios are automatically normalized to sum to 100%\n- Available through any endpoint by adding weights in parentheses\n- Saves generated voicepacks for future use\n\nCombine voices and generate audio:\n```python\nimport requests\nresponse = requests.get(\"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fvoices\")\nvoices = response.json()[\"voices\"]\n\n# Example 1: Simple voice combination (50%\u002F50% mix)\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella+af_sky\",  # Equal weights\n        \"response_format\": \"mp3\"\n    }\n)\n\n# Example 2: Weighted voice combination (67%\u002F33% mix)\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella(2)+af_sky(1)\",  # 2:1 ratio = 67%\u002F33%\n        \"response_format\": \"mp3\"\n    }\n)\n\n# Example 3: Download combined voice as .pt file\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fvoices\u002Fcombine\",\n    json=\"af_bella(2)+af_sky(1)\"  # 2:1 ratio = 67%\u002F33%\n)\n\n# Save the .pt file\nwith open(\"combined_voice.pt\", \"wb\") as f:\n    f.write(response.content)\n\n# Use the downloaded voice file\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"combined_voice\",  # Use the saved voice file\n        \"response_format\": \"mp3\"\n    }\n)\n\n```\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fvoice_analysis.png\" width=\"80%\" alt=\"Voice Analysis Comparison\" style=\"border: 2px solid #333; padding: 10px;\">\n\u003C\u002Fp>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Multiple Output Audio Formats\u003C\u002Fsummary>\n\n- mp3\n- wav\n- opus \n- flac\n- m4a\n- pcm\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"assets\u002Fformat_comparison.png\" width=\"80%\" alt=\"Audio Format Comparison\" style=\"border: 2px solid #333; padding: 10px;\">\n\u003C\u002Fp>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Streaming Support\u003C\u002Fsummary>\n\n```python\n# OpenAI-compatible streaming\nfrom openai import OpenAI\nclient = OpenAI(\n    base_url=\"http:\u002F\u002Flocalhost:8880\u002Fv1\", api_key=\"not-needed\")\n\n# Stream to file\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    input=\"Hello world!\"\n) as response:\n    response.stream_to_file(\"output.mp3\")\n\n# Stream to speakers (requires PyAudio)\nimport pyaudio\nplayer = pyaudio.PyAudio().open(\n    format=pyaudio.paInt16, \n    channels=1, \n    rate=24000, \n    output=True\n)\n\nwith client.audio.speech.with_streaming_response.create(\n    model=\"kokoro\",\n    voice=\"af_bella\",\n    response_format=\"pcm\",\n    input=\"Hello world!\"\n) as response:\n    for chunk in response.iter_bytes(chunk_size=1024):\n        player.write(chunk)\n```\n\nOr via requests:\n```python\nimport requests\n\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"response_format\": \"pcm\"\n    },\n    stream=True\n)\n\nfor chunk in response.iter_content(chunk_size=1024):\n    if chunk:\n        # Process streaming chunks\n        pass\n```\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fgpu_first_token_timeline_openai.png\" width=\"45%\" alt=\"GPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\">\n  \u003Cimg src=\"assets\u002Fcpu_first_token_timeline_stream_openai.png\" width=\"45%\" alt=\"CPU First Token Timeline\" style=\"border: 2px solid #333; padding: 10px;\">\n\u003C\u002Fp>\n\nKey Streaming Metrics:\n- First token latency @ chunksize\n    - ~300ms  (GPU) @ 400 \n    - ~3500ms (CPU) @ 200 (older i7)\n    - ~\u003C1s    (CPU) @ 200 (M3 Pro)\n- Adjustable chunking settings for real-time playback \n\n*Note: Artifacts in intonation can increase with smaller chunks*\n\u003C\u002Fdetails>\n\n## Processing Details\n\u003Cdetails>\n\u003Csummary>Performance Benchmarks\u003C\u002Fsummary>\n\nBenchmarking was performed on generation via the local API using text lengths up to feature-length books (~1.5 hours output), measuring processing time and realtime factor. Tests were run on: \n- Windows 11 Home w\u002F WSL2 \n- NVIDIA 4060Ti 16gb GPU @ CUDA 12.1\n- 11th Gen i7-11700 @ 2.5GHz\n- 64gb RAM\n- WAV native output\n- H.G. Wells - The Time Machine (full text)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fgpu_processing_time.png\" width=\"45%\" alt=\"Processing Time\" style=\"border: 2px solid #333; padding: 10px; margin-right: 1%;\">\n  \u003Cimg src=\"assets\u002Fgpu_realtime_factor.png\" width=\"45%\" alt=\"Realtime Factor\" style=\"border: 2px solid #333; padding: 10px;\">\n\u003C\u002Fp>\n\nKey Performance Metrics:\n- Realtime Speed: Ranges between 35x-100x (generation time to output audio length)\n- Average Processing Rate: 137.67 tokens\u002Fsecond (cl100k_base)\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Transcription Roundtrip (WER\u002FCER)\u003C\u002Fsummary>\n\nEnd-to-end roundtrip: synthesize with Kokoro, transcribe the result back with [`faster-whisper`](https:\u002F\u002Fgithub.com\u002FSYSTRAN\u002Ffaster-whisper), compare to the source text. Scripts and data live under `examples\u002Fassorted_checks\u002Ftest_transcription\u002F`.\n\n**Long-form English** (full book, *A Journey to the Centre of the Earth*, Project Gutenberg, voice `af_heart`, `base.en` Whisper on CUDA float16, baseline captured on cu126 GPU build):\n\n| Run | Input chars | Audio length | Synth speedup | Transcribe speedup | WER |\n| --- | --- | --- | --- | --- | --- |\n| Short (~ch.7) | 64,996 | 66m 06s | 36.4x rt | 62.4x rt | **0.047** |\n| Full book | 502,766 | 507m 52s | 45.7x rt | 65.1x rt | **0.033** |\n\nSee `examples\u002Fassorted_checks\u002Ftest_transcription\u002FBASELINE.md` for the full regression bands.\n\n**Per-language check** (single-sentence per voice, multilingual Whisper `small`. WER for Latin scripts, CER for ja\u002Fzh\u002Fhi):\n\n| Language | Voice | Metric | Score |\n| --- | --- | --- | --- |\n| English | `af_heart` | WER | 0.000 |\n| English (UK) | `bf_emma` | WER | 0.111 |\n| Spanish | `ef_dora` | WER | 0.000 |\n| French | `ff_siwis` | WER | 0.000 |\n| Italian | `if_sara` | WER | 0.000 |\n| Portuguese | `pf_dora` | WER | 0.000 |\n| Hindi | `hf_alpha` | CER | 0.059 |\n| Japanese | `jf_alpha` | CER | 0.000 |\n| Chinese | `zf_xiaobei` | CER | 0.143 |\n\n*Caveat: these are single short sentences, not a comprehensive per-language quality benchmark. They confirm each voice produces transcribable audio in its target language; deeper quality evaluation per language is open work.*\n\nTo reproduce, see `examples\u002Fassorted_checks\u002Ftest_transcription\u002FREADME.md`.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>GPU Vs. CPU\u003C\u002Fsummary>\n\n```bash\n# GPU: Requires NVIDIA driver with CUDA 12.6+ support (~35x-100x realtime speed)\ncd docker\u002Fgpu\ndocker compose up --build\n\n# CPU: PyTorch CPU inference\ncd docker\u002Fcpu\ndocker compose up --build\n\n# AMD GPU: ROCm 6.4 (experimental, amd64 only)\ncd docker\u002Frocm\ndocker compose up --build\n\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Natural Boundary Detection\u003C\u002Fsummary>\n\n- Automatically splits and stitches at sentence boundaries \n- Helps to reduce artifacts and allow long form processing as the base model is only currently configured for approximately 30s output\n\nThe model is capable of processing up to a 510 phonemized token chunk at a time, however, this can often lead to 'rushed' speech or other artifacts. An additional layer of chunking is applied in the server, that creates flexible chunks with a `TARGET_MIN_TOKENS` , `TARGET_MAX_TOKENS`, and `ABSOLUTE_MAX_TOKENS` which are configurable via environment variables, and set to 175, 250, 450 by default\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Timestamped Captions & Phonemes\u003C\u002Fsummary>\n\nGenerate audio with word-level timestamps without streaming:\n```python\nimport requests\nimport base64\nimport json\n\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fdev\u002Fcaptioned_speech\",\n    json={\n        \"model\": \"kokoro\",\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"speed\": 1.0,\n        \"response_format\": \"mp3\",\n        \"stream\": False,\n    },\n    stream=False\n)\n\nwith open(\"output.mp3\",\"wb\") as f:\n\n    audio_json=json.loads(response.content)\n    \n    # Decode base 64 stream to bytes\n    chunk_audio=base64.b64decode(audio_json[\"audio\"].encode(\"utf-8\"))\n    \n    # Process streaming chunks\n    f.write(chunk_audio)\n    \n    # Print word level timestamps\n    print(audio_json[\"timestamps\"])\n```\n\nGenerate audio with word-level timestamps with streaming:\n```python\nimport requests\nimport base64\nimport json\n\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fdev\u002Fcaptioned_speech\",\n    json={\n        \"model\": \"kokoro\",\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_bella\",\n        \"speed\": 1.0,\n        \"response_format\": \"mp3\",\n        \"stream\": True,\n    },\n    stream=True\n)\n\nf=open(\"output.mp3\",\"wb\")\nfor chunk in response.iter_lines(decode_unicode=True):\n    if chunk:\n        chunk_json=json.loads(chunk)\n        \n        # Decode base 64 stream to bytes\n        chunk_audio=base64.b64decode(chunk_json[\"audio\"].encode(\"utf-8\"))\n        \n        # Process streaming chunks\n        f.write(chunk_audio)\n        \n        # Print word level timestamps\n        print(chunk_json[\"timestamps\"])\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Phoneme & Token Routes\u003C\u002Fsummary>\n\nConvert text to phonemes and\u002For generate audio directly from phonemes:\n```python\nimport requests\n\ndef get_phonemes(text: str, language: str = \"a\"):\n    \"\"\"Get phonemes and tokens for input text\"\"\"\n    response = requests.post(\n        \"http:\u002F\u002Flocalhost:8880\u002Fdev\u002Fphonemize\",\n        json={\"text\": text, \"language\": language}  # \"a\" for American English\n    )\n    response.raise_for_status()\n    result = response.json()\n    return result[\"phonemes\"], result[\"tokens\"]\n\ndef generate_audio_from_phonemes(phonemes: str, voice: str = \"af_bella\"):\n    \"\"\"Generate audio from phonemes\"\"\"\n    response = requests.post(\n        \"http:\u002F\u002Flocalhost:8880\u002Fdev\u002Fgenerate_from_phonemes\",\n        json={\"phonemes\": phonemes, \"voice\": voice},\n        headers={\"Accept\": \"audio\u002Fwav\"}\n    )\n    if response.status_code != 200:\n        print(f\"Error: {response.text}\")\n        return None\n    return response.content\n\n# Example usage\ntext = \"Hello world!\"\ntry:\n    # Convert text to phonemes\n    phonemes, tokens = get_phonemes(text)\n    print(f\"Phonemes: {phonemes}\")  # e.g. ðɪs ɪz ˈoʊnli ɐ tˈɛst\n    print(f\"Tokens: {tokens}\")      # Token IDs including start\u002Fend tokens\n\n    # Generate and save audio\n    if audio_bytes := generate_audio_from_phonemes(phonemes):\n        with open(\"speech.wav\", \"wb\") as f:\n            f.write(audio_bytes)\n        print(f\"Generated {len(audio_bytes)} bytes of audio\")\nexcept Exception as e:\n    print(f\"Error: {e}\")\n```\n\nSee `examples\u002Fphoneme_examples\u002Fgenerate_phonemes.py` for a sample script.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Debug Endpoints\u003C\u002Fsummary>\n\nMonitor system state and resource usage with these endpoints:\n\n- `\u002Fdebug\u002Fthreads` - Get thread information and stack traces\n- `\u002Fdebug\u002Fstorage` - Monitor temp file and output directory usage\n- `\u002Fdebug\u002Fsystem` - Get system information (CPU, memory, GPU)\n- `\u002Fdebug\u002Fsession_pools` - View ONNX session and CUDA stream status\n\nUseful for debugging resource exhaustion or performance issues.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Logging\u003C\u002Fsummary>\n\nGlobal API [loguru logging level](https:\u002F\u002Floguru.readthedocs.io\u002Fen\u002Fstable\u002Fapi\u002Flogger.html#levels) can be set using the `API_LOG_LEVEL` environment variable. Defaults to `DEBUG`.\n\n**Docker**\n\nModify the appropriate compose `yml` or append to command line.\n```bash\ndocker run --env 'API_LOG_LEVEL=WARNING' ...\n```\n\n**Direct via UV**\n\nLinux and macOS\n```bash\nexport API_LOG_LEVEL=WARNING\n.\u002Fstart-cpu.sh OR\n.\u002Fstart-gpu.sh\n```\n\nWindows\n```powershell\n$env:API_LOG_LEVEL = 'WARNING'\n.\\start-cpu.ps1 OR\n.\\start-gpu.ps1\n```\n\u003C\u002Fdetails>\n\n## Known Issues & Troubleshooting\n\n\u003Cdetails>\n\u003Csummary>Missing words & Missing some timestamps\u003C\u002Fsummary>\n\nThe api will automaticly do text normalization on input text which may incorrectly remove or change some phrases. This can be disabled by adding `\"normalization_options\":{\"normalize\": false}` to your request json:\n```python\nimport requests\n\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:8880\u002Fv1\u002Faudio\u002Fspeech\",\n    json={\n        \"input\": \"Hello world!\",\n        \"voice\": \"af_heart\",\n        \"response_format\": \"pcm\",\n        \"normalization_options\":\n        {\n            \"normalize\": False\n        }\n    },\n    stream=True\n)\n\nfor chunk in response.iter_content(chunk_size=1024):\n    if chunk:\n        # Process streaming chunks\n        pass\n```\n  \n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Versioning & Development\u003C\u002Fsummary>\n\n**Branching Strategy:**\n*   **`release` branch:** Contains the latest stable build, recommended for production use. Docker images tagged with specific versions (e.g., `v0.3.0`) are built from this branch.\n*   **`master` branch:** Used for active development. It may contain experimental features, ongoing changes, or fixes not yet in a stable release. Use this branch if you want the absolute latest code, but be aware it might be less stable. The `latest` Docker tag often points to builds from this branch.\n\nNote: This is a *development* focused project at its core. \n\nIf you run into trouble, you may have to roll back a version on the release tags if something comes up, or build up from source and\u002For troubleshoot + submit a PR.\n\nFree and open source is a community effort, and there's only really so many hours in a day. If you'd like to support the work, feel free to open a PR, buy me a coffee, or report any bugs\u002Ffeatures\u002Fetc you find during use.\n\n  \u003Ca href=\"https:\u002F\u002Fwww.buymeacoffee.com\u002Fremsky\" target=\"_blank\">\n    \u003Cimg \n      src=\"https:\u002F\u002Fcdn.buymeacoffee.com\u002Fbuttons\u002Fv2\u002Fdefault-violet.png\" \n      alt=\"Buy Me A Coffee\" \n      style=\"height: 30px !important;width: 110px !important;\"\n    >\n  \u003C\u002Fa>\n\n  \n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Linux GPU Permissions\u003C\u002Fsummary>\n\nSome Linux users may encounter GPU permission issues when running as non-root. \nCan't guarantee anything, but here are some common solutions, consider your security requirements carefully\n\n### Option 1: Container Groups (Likely the best option)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    group_add:\n      - \"video\"\n      - \"render\"\n```\n\n### Option 2: Host System Groups\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    user: \"${UID}:${GID}\"\n    group_add:\n      - \"video\"\n```\nNote: May require adding host user to groups: `sudo usermod -aG docker,video $USER` and system restart.\n\n### Option 3: Device Permissions (Use with caution)\n```yaml\nservices:\n  kokoro-tts:\n    # ... existing config ...\n    devices:\n      - \u002Fdev\u002Fnvidia0:\u002Fdev\u002Fnvidia0\n      - \u002Fdev\u002Fnvidiactl:\u002Fdev\u002Fnvidiactl\n      - \u002Fdev\u002Fnvidia-uvm:\u002Fdev\u002Fnvidia-uvm\n```\n⚠️ Warning: Reduces system security. Use only in development environments.\n\nPrerequisites: NVIDIA GPU, drivers, and container toolkit must be properly configured.\n\nVisit [NVIDIA Container Toolkit installation](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html) for more detailed information\n\n\u003C\u002Fdetails>\n\n## Model and License\n\n\u003Cdetails open>\n\u003Csummary>Model\u003C\u002Fsummary>\n\nThis API uses the [Kokoro-82M](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) model from HuggingFace. \n\nVisit the model page for more details about training, architecture, and capabilities. I have no affiliation with any of their work, and produced this wrapper for ease of use and personal projects.\n\u003C\u002Fdetails>\n\u003Cdetails>\n\u003Csummary>License\u003C\u002Fsummary>\nThis project is licensed under the Apache License 2.0 - see below for details:\n\n- The Kokoro model weights are licensed under Apache 2.0 (see [model page](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M))\n- The FastAPI wrapper code in this repository is licensed under Apache 2.0 to match\n- The inference code adapted from StyleTTS2 is MIT licensed\n\nThe full Apache 2.0 license text can be found at: https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\u003C\u002Fdetails>\n\n\u003C\u002Fdetails open>\n\n## Contributor Stats\n![Alt](https:\u002F\u002Frepobeats.axiom.co\u002Fapi\u002Fembed\u002Ff9694366bf96febc749d592316ff0a275fe77219.svg \"Repobeats analytics image\")\n\u003C\u002Fdetails>\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fremsky\u002FKokoro-FastAPI\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=remsky\u002FKokoro-FastAPI\" \u002F>\n\u003C\u002Fa>\n\nMade with [contrib.rocks](https:\u002F\u002Fcontrib.rocks).","Kokoro-FastAPI 是一个基于 Docker 的 FastAPI 封装，用于 Kokoro-82M 文本转语音模型，支持 CPU 和 NVIDIA GPU 推理。该项目的核心功能包括多语言支持（英语、日语、中文等）、OpenAI 兼容的语音端点、基于音素的音频生成以及按词时间戳字幕生成。技术特点方面，它提供了调试端点以监控系统状态，并在本地主机上集成了一个 Web UI。此外，项目还支持通过 PyTorch 在多种硬件平台上进行推理，包括 Apple Silicon (MPS)。适用于需要高质量文本转语音服务的应用场景，如虚拟助手、自动语音合成和多媒体内容制作等。",2,"2026-06-11 03:40:29","high_star"]