[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81030":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":13,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":43,"readmeContent":44,"aiSummary":45,"trendingCount":15,"starSnapshotCount":15,"syncStatus":46,"lastSyncTime":47,"discoverSource":48},81030,"voice-agents-from-scratch","pguso\u002Fvoice-agents-from-scratch","pguso","From-scratch voice agents in Python: end-to-end speech pipelines, runnable chapters, and a small shared library. Local models, explicit streaming behavior.","",null,"Python",36,3,1,0,4,7,46.51,"MIT License",false,"main",[23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42],"agents","edge-ai","faster-whisper","kokoro-82m","llm","local-ai","onnx","python","speech-to-text","streaming","text-to-speech","tool-calling","tool-calling-agent","tutorial","uv","vad","voice-activity-detection","voice-agent","voice-agents","whisper","2026-06-12 04:01:31","![banner.jpeg](diagrams\u002Fbanner.jpeg)\n\n# Voice agents from scratch\n\n[![Python 3.11+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.11%2B-3776AB?style=flat&logo=python&logoColor=white)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](LICENSE)\n[![uv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fuv-package%20manager-5A45FF?style=flat&logo=uv&logoColor=white)](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fpguso\u002Fvoice-agents-from-scratch?style=flat&logo=github&label=stars)](https:\u002F\u002Fgithub.com\u002Fpguso\u002Fvoice-agents-from-scratch)\n\nBuild **real-time voice agents** from the ground up - microphone → **STT** → **LLM** → **TTS** → speaker - with tools, personality, and streaming latency called out explicitly. This repo is a **hands-on tutorial**: numbered chapters, runnable scripts, and a small shared library under `src\u002Fvoice_agents\u002F`.\n\n---\n\n## What you need before you start\n\n| Requirement | Notes |\n|-------------|--------|\n| **Python 3.11+** | Declared in `pyproject.toml` (`requires-python >= 3.11`). |\n| **Microphone and speakers\u002Fheadphones** | Examples record from the default input device and play to the default output. |\n| **Disk space** | Roughly **~800 MB** for bundled tutorial models (Whisper tiny.en, Qwen 2.5 0.5B Q4 GGUF, Kokoro v1.0). `models\u002F` is gitignored. |\n| **Network** | Needed for `uv sync` (packages) and the first **`download_models.py`** run (Hugging Face + release URLs). |\n| **PortAudio** | Used by `sounddevice`. On macOS it is often already satisfied; on Linux you may need your distro’s PortAudio dev package so `sounddevice` can load. |\n\nSome lessons use extra dependency groups:\n\n- **`vad`**  -  PyTorch + torchaudio for voice-activity and related examples (`uv sync --extra vad`).\n- **`serve`**  -  FastAPI + Uvicorn + websockets for the local WebSocket server \u002F browser client (`uv sync --extra serve`).\n- **`deploy`**  -  Modal client for hosted deployment in chapter 10 (`uv sync --extra deploy`); combine with `serve` to run the ASGI app locally too.\n\n---\n\n## Install the toolchain\n\nThis project standardizes on **[uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F)** for environments and running scripts.\n\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh   # once per machine\n```\n\n---\n\n## Get the code and dependencies\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fpguso\u002Fvoice-agents-from-scratch.git\ncd voice-agents-from-scratch\n\nuv sync\n```\n\nInstall optional groups only when you need them (you can combine extras, e.g. `uv sync --extra vad --extra serve --extra deploy`).\n\n```bash\nuv sync --extra vad\nuv sync --extra serve\nuv sync --extra deploy\n```\n\n---\n\n## Verify audio and Python imports\n\n```bash\nuv run python 00_start_here\u002Fcheck_deps.py\n```\n\nYou should see `OK` for each required module. A **WARN** on PortAudio usually means the OS cannot see an audio backend - fix that before relying on mic\u002Fspeaker examples.\n\n---\n\n## Download models (first run)\n\n```bash\nuv run python 00_start_here\u002Fdownload_models.py\n```\n\nThis populates **`models\u002F`** (Whisper weights via `faster-whisper`, LLM GGUF via Hugging Face, Kokoro ONNX + voice bundle via HTTP). Safe to re-run; existing files are skipped.\n\n---\n\n## Run your first end-to-end voice agent\n\nFrom the repository root:\n\n```bash\nuv run python 00_start_here\u002Frun_first_voice_agent.py\n```\n\nYou will be prompted to record a few seconds of speech; the script transcribes with Whisper, replies with a local LLM, and speaks with Kokoro using a **streaming** pipeline so playback can start before the full reply is generated. Chapter zero explains the design in depth: **[00_start_here\u002FREADME.md](00_start_here\u002FREADME.md)**.\n\n---\n\n## Optional: interactive script launcher\n\n```bash\nuv run voice-agent\n```\n\n(or `uv run start`) opens a menu of common tutorial scripts. You can always run any script directly with `uv run python \u003Cpath>`.\n\n---\n\n## If `llama-cpp-python` fails to install\n\nPrebuilt wheels are often available from the [llama-cpp-python wheel index](https:\u002F\u002Fabetlen.github.io\u002Fllama-cpp-python\u002F). Platform-specific hints also live in **[00_start_here\u002FREADME.md](00_start_here\u002FREADME.md)**.\n\n---\n\n## How the course is organized\n\nEach numbered folder is a chapter. Inside lesson folders you typically find a **`.py`** entry script and a **`CODE.md`** walkthrough. Shared building blocks live in **`src\u002Fvoice_agents\u002F`** (`audio`, `stt`, `tts`, `agent`, `tools`, …).\n\n| Chapter | Topic |\n|---------|--------|\n| **[00_start_here](00_start_here\u002F)** | First runnable agent, model download, architecture narrative |\n| **[01_audio_io](01_audio_io\u002F)** | Mic, speaker, WAV, streaming blocks, simple VAD-style debug |\n| **[02_speech_to_text](02_speech_to_text\u002F)** | Whisper: one-shot, streaming, partial results |\n| **[03_text_to_speech](03_text_to_speech\u002F)** | Kokoro: basics, streaming, profiles, latency |\n| **[04_agent_core](04_agent_core\u002F)** | Prompting, loops, memory, debugging |\n| **[05_full_voice_loop](05_full_voice_loop\u002F)** | Blocking vs streaming agents, latency tooling |\n| **[06_real_time_systems](06_real_time_systems\u002F)** | Turn-taking, interruption, duplex patterns |\n| **[07_tools](07_tools\u002F)** | Tool routing, calculator, time, weather, web search, LLM tool loops |\n| **[08_personality](08_personality\u002F)** | Style and emotional variation |\n| **[09_projects](09_projects\u002F)** | Larger compositions (CLI assistant, tutor, interviewer, …) |\n\nFurther reading:\n\n- **[00_start_here\u002Farchitecture_overview.md](00_start_here\u002Farchitecture_overview.md)**  -  big-picture diagram and flow\n- **[GLOSSARY.md](GLOSSARY.md)**  -  PCM, STT, TTS, VAD, …\n- **[OPTIMIZE.md](OPTIMIZE.md)**  -  optional Apple Silicon notes (Metal \u002F CoreML)\n\n---\n\n## License\n\nSee **[LICENSE](LICENSE)**.\n","该项目旨在从零开始构建实时语音代理，实现从麦克风输入到语音识别、语言模型处理再到文本转语音输出的完整流程。核心功能包括端到端的语音处理管道、可运行的章节教程以及一个小型共享库，支持本地模型和显式的流式传输行为。技术特点上，项目使用Python编写，并利用了Faster-Whisper等工具进行语音活动检测与处理。适用于需要在边缘设备上部署低延迟语音交互系统的场景，如智能家居控制、虚拟助手等。",2,"2026-06-11 04:03:15","CREATED_QUERY"]