[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75063":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":40,"readmeContent":41,"aiSummary":42,"trendingCount":16,"starSnapshotCount":16,"syncStatus":43,"lastSyncTime":44,"discoverSource":45},75063,"parlor","fikrikarim\u002Fparlor","fikrikarim","On-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine. Powered by Gemma 4 E2B and Kokoro.","",null,"HTML",1826,224,21,5,0,9,26,74,27,93.46,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39],"apple-silicon","gemma","kokoro","litert-lm","local-llm","mlx","multimodal","on-device-ai","python","real-time","speech-recognition","text-to-speech","voice-assistant","2026-06-12 04:01:17","# Parlor\n\nOn-device, real-time multimodal AI. Have natural voice and vision conversations with an AI that runs entirely on your machine.\n\nParlor uses [Gemma 4 E2B](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-4-E2B-it) for understanding speech and vision, and [Kokoro](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) for text-to-speech. You talk, show your camera, and it talks back, all locally.\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcb0ffb2e-f84f-48e7-872c-c5f7b5c6d51f\n\n> **Research preview.** This is an early experiment. Expect rough edges and bugs.\n\n# Why?\n\nI'm [self-hosting a totally free voice AI](https:\u002F\u002Fwww.fikrikarim.com\u002Fbule-ai-initial-release\u002F) on my home server to help people learn speaking English. It has hundreds of monthly active users, and I've been thinking about how to keep it free while making it sustainable.\n\nThe obvious answer: run everything on-device, eliminating any server cost. Six months ago I needed an RTX 5090 to run just the voice models in real-time.\n\nGoogle just released a super capable small model that I can run on my M3 Pro in real-time, with vision too! Sure you can't do agentic coding with this, but it is a game-changer for people learning a new language. Imagine a few years from now that people can run this locally on their phones. They can point their camera at objects and talk about them. And this model is multi-lingual, so people can always fallback to their native language if they want. This is essentially what OpenAI demoed a few years ago.\n\n## How it works\n\n```\nBrowser (mic + camera)\n    │\n    │  WebSocket (audio PCM + JPEG frames)\n    ▼\nFastAPI server\n    ├── Gemma 4 E2B via LiteRT-LM (GPU)  →  understands speech + vision\n    └── Kokoro TTS (MLX on Mac, ONNX on Linux)  →  speaks back\n    │\n    │  WebSocket (streamed audio chunks)\n    ▼\nBrowser (playback + transcript)\n```\n\n- **Voice Activity Detection** in the browser ([Silero VAD](https:\u002F\u002Fgithub.com\u002Fricky0123\u002Fvad)). Hands-free, no push-to-talk.\n- **Barge-in.** Interrupt the AI mid-sentence by speaking.\n- **Sentence-level TTS streaming.** Audio starts playing before the full response is generated.\n\n## Requirements\n\n- Python 3.12+\n- macOS with Apple Silicon, or Linux with a supported GPU\n- ~3 GB free RAM for the model\n\n## Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffikrikarim\u002Fparlor.git\ncd parlor\n\n# Install uv if you don't have it\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n\ncd src\nuv sync\nuv run server.py\n```\n\nOpen [http:\u002F\u002Flocalhost:8000](http:\u002F\u002Flocalhost:8000), grant camera and microphone access, and start talking.\n\nModels are downloaded automatically on first run (~2.6 GB for Gemma 4 E2B, plus TTS models).\n\n## Configuration\n\n| Variable     | Default                        | Description                                    |\n| ------------ | ------------------------------ | ---------------------------------------------- |\n| `MODEL_PATH` | auto-download from HuggingFace | Path to a local `gemma-4-E2B-it.litertlm` file |\n| `PORT`       | `8000`                         | Server port                                    |\n\n## Performance (Apple M3 Pro)\n\n| Stage                            | Time          |\n| -------------------------------- | ------------- |\n| Speech + vision understanding    | ~1.8-2.2s     |\n| Response generation (~25 tokens) | ~0.3s         |\n| Text-to-speech (1-3 sentences)   | ~0.3-0.7s     |\n| **Total end-to-end**             | **~2.5-3.0s** |\n\nDecode speed: ~83 tokens\u002Fsec on GPU (Apple M3 Pro).\n\n## Project structure\n\n```\nsrc\u002F\n├── server.py              # FastAPI WebSocket server + Gemma 4 inference\n├── tts.py                 # Platform-aware TTS (MLX on Mac, ONNX on Linux)\n├── index.html             # Frontend UI (VAD, camera, audio playback)\n├── pyproject.toml         # Dependencies\n└── benchmarks\u002F\n    ├── bench.py           # End-to-end WebSocket benchmark\n    └── benchmark_tts.py   # TTS backend comparison\n```\n\n## Acknowledgments\n\n- [Gemma 4](https:\u002F\u002Fai.google.dev\u002Fgemma) by Google DeepMind\n- [LiteRT-LM](https:\u002F\u002Fgithub.com\u002Fgoogle-ai-edge\u002FLiteRT-LM) by Google AI Edge\n- [Kokoro](https:\u002F\u002Fhuggingface.co\u002Fhexgrad\u002FKokoro-82M) TTS by Hexgrad\n- [Silero VAD](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad) for browser voice activity detection\n\n## License\n\n[Apache 2.0](LICENSE)\n","Parlor 是一个在设备上运行的实时多模态AI项目，能够通过语音和视觉与用户进行自然对话。该项目利用Gemma 4 E2B模型处理语音和视觉输入，Kokoro模型实现文本到语音的转换，整个过程完全在本地完成，无需依赖云端服务器。其核心技术特点包括基于浏览器的语音活动检测、打断功能以及句子级TTS流式传输，使得交互更加流畅自然。适用于希望保护隐私且需要离线工作的场景，特别是对于语言学习者而言，能够在没有网络连接的情况下提供有效的口语练习支持。",2,"2026-06-11 03:52:09","high_star"]