[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-5759":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":46,"readmeContent":47,"aiSummary":48,"trendingCount":16,"starSnapshotCount":16,"syncStatus":49,"lastSyncTime":50,"discoverSource":51},5759,"shimmy","Michael-A-Kuykendall\u002Fshimmy","Michael-A-Kuykendall","⚡ Pure-Rust WebGPU inference engine — OpenAI-API compatible, GGUF native, runs on any GPU. No Python. No llama.cpp. Single binary.","",null,"Rust",5399,513,39,11,0,16,78,601,91,39.13,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45],"api-server","command-line-tool","developer-tools","gguf","huggingface","huggingface-models","huggingface-transformers","inference-server","llama","llamacpp","llm-inference","local-ai","lora","machine-learning","ollama-api","openai-compatible","rust","rust-crate","transformers","2026-06-12 02:01:14","\u003Cdiv align=\"center\">\r\n  \u003Cimg src=\"assets\u002Fshimmy-logo.png\" alt=\"Shimmy Logo\" width=\"300\" height=\"auto\" \u002F>\r\n\r\n  # The Lightweight OpenAI API Server\r\n\r\n  ### 🔒 Local Inference Without Dependencies 🚀\r\n\r\n  [![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\r\n  [![Security](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSecurity-Audited-green)](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Fsecurity)\r\n  [![Crates.io](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fv\u002Fshimmy.svg)](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fshimmy)\r\n  [![Downloads](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fd\u002Fshimmy.svg)](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fshimmy)\r\n  [![Rust](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Frust-stable-brightgreen.svg)](https:\u002F\u002Frustup.rs\u002F)\r\n  [![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMichael-A-Kuykendall\u002Fshimmy?style=social)](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Fstargazers)\r\n\r\n  [![💝 Sponsor this project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F💝_Sponsor_this_project-ea4aaa?style=for-the-badge&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall)\r\n\u003C\u002Fdiv>\r\n\r\n**Shimmy will be free forever.** No asterisks. No \"free for now.\" No pivot to paid.\r\n\r\n### 💝 Support Shimmy's Growth\r\n\r\n🚀 **If Shimmy helps you, consider [sponsoring](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall) — 100% of support goes to keeping it free forever.**\r\n\r\n- **$5\u002Fmonth**: Coffee tier ☕ - Eternal gratitude + sponsor badge\r\n- **$25\u002Fmonth**: Bug prioritizer 🐛 - Priority support + name in [SPONSORS.md](SPONSORS.md)\r\n- **$100\u002Fmonth**: Corporate backer 🏢 - Logo placement + monthly office hours\r\n- **$500\u002Fmonth**: Infrastructure partner 🚀 - Direct support + roadmap input\r\n\r\n[**🎯 Become a Sponsor**](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall) | See our amazing [sponsors](SPONSORS.md) 🙏\r\n\r\n---\r\n\r\n## Drop-in OpenAI API Replacement for Local LLMs\r\n\r\nShimmy is a **single-binary** that provides **100% OpenAI-compatible endpoints** for GGUF models. Point your existing AI tools to Shimmy and they just work — locally, privately, and free.\r\n\r\n**🎉 NEW in v1.9.0**: One download, all GPU backends included! No compilation, no backend confusion - just download and run.\r\n\r\n## Developer Tools\r\n\r\nWhether you're forking Shimmy or integrating it as a service, we provide complete documentation and integration templates.\r\n\r\n### Try it in 30 seconds\r\n\r\n```bash\r\n# 1) Download pre-built binary (includes all GPU backends)\r\n# Windows:\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-windows-x86_64.exe -o shimmy.exe\r\n.\u002Fshimmy.exe serve &\r\n\r\n# Linux:\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-x86_64 -o shimmy && chmod +x shimmy\r\n.\u002Fshimmy serve &\r\n\r\n# macOS (Apple Silicon):\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-arm64 -o shimmy && chmod +x shimmy\r\n.\u002Fshimmy serve &\r\n\r\n# 2) See models and pick one\r\n.\u002Fshimmy list\r\n\r\n# 3) Smoke test the OpenAI API\r\ncurl -s http:\u002F\u002F127.0.0.1:11435\u002Fv1\u002Fchat\u002Fcompletions \\\r\n  -H 'Content-Type: application\u002Fjson' \\\r\n  -d '{\r\n        \"model\":\"REPLACE_WITH_MODEL_FROM_list\",\r\n        \"messages\":[{\"role\":\"user\",\"content\":\"Say hi in 5 words.\"}],\r\n        \"max_tokens\":32\r\n      }' | jq -r '.choices[0].message.content'\r\n```\r\n\r\n## 🚀 Compatible with OpenAI SDKs and Tools\r\n\r\n**No code changes needed** - just change the API endpoint:\r\n\r\n- **Any OpenAI client**: Python, Node.js, curl, etc.\r\n- **Development applications**: Compatible with standard SDKs\r\n- **VSCode Extensions**: Point to `http:\u002F\u002Flocalhost:11435`\r\n- **Cursor Editor**: Built-in OpenAI compatibility\r\n- **Continue.dev**: Drop-in model provider\r\n\r\n### Use with OpenAI SDKs\r\n\r\n- Node.js (openai v4)\r\n\r\n```ts\r\nimport OpenAI from \"openai\";\r\n\r\nconst openai = new OpenAI({\r\n  baseURL: \"http:\u002F\u002F127.0.0.1:11435\u002Fv1\",\r\n  apiKey: \"sk-local\", \u002F\u002F placeholder, Shimmy ignores it\r\n});\r\n\r\nconst resp = await openai.chat.completions.create({\r\n  model: \"REPLACE_WITH_MODEL\",\r\n  messages: [{ role: \"user\", content: \"Say hi in 5 words.\" }],\r\n  max_tokens: 32,\r\n});\r\n\r\nconsole.log(resp.choices[0].message?.content);\r\n```\r\n\r\n- Python (openai>=1.0.0)\r\n\r\n```python\r\nfrom openai import OpenAI\r\n\r\nclient = OpenAI(base_url=\"http:\u002F\u002F127.0.0.1:11435\u002Fv1\", api_key=\"sk-local\")\r\n\r\nresp = client.chat.completions.create(\r\n    model=\"REPLACE_WITH_MODEL\",\r\n    messages=[{\"role\": \"user\", \"content\": \"Say hi in 5 words.\"}],\r\n    max_tokens=32,\r\n)\r\n\r\nprint(resp.choices[0].message.content)\r\n```\r\n\r\n## ⚡ Zero Configuration Required\r\n\r\n- **Automatically finds models** from Hugging Face cache, Ollama, local dirs\r\n- **Auto-allocates ports** to avoid conflicts\r\n- **Auto-detects LoRA adapters** for specialized models\r\n- **Just works** - no config files, no setup wizards\r\n\r\n## 🧠 Advanced MOE (Mixture of Experts) Support\r\n\r\n**Run 70B+ models on consumer hardware** with intelligent CPU\u002FGPU hybrid processing:\r\n\r\n- **🔄 CPU MOE Offloading**: Automatically distribute model layers across CPU and GPU\r\n- **🧮 Intelligent Layer Placement**: Optimizes which layers run where for maximum performance\r\n- **💾 Memory Efficiency**: Fit larger models in limited VRAM by using system RAM strategically\r\n- **⚡ Hybrid Acceleration**: Get GPU speed where it matters most, CPU reliability everywhere else\r\n- **🎛️ Configurable**: `--cpu-moe` and `--n-cpu-moe` flags for fine control\r\n\r\n```bash\r\n# Enable MOE CPU offloading during installation\r\ncargo install shimmy --features moe\r\n\r\n# Run with MOE hybrid processing\r\nshimmy serve --cpu-moe --n-cpu-moe 8\r\n\r\n# Automatically balances: GPU layers (fast) + CPU layers (memory-efficient)\r\n```\r\n\r\n**Perfect for**: Large models (70B+), limited VRAM systems, cost-effective inference\r\n\r\n## 🎯 Perfect for Local Development\r\n\r\n- **Privacy**: Your code never leaves your machine\r\n- **Cost**: No API keys, no per-token billing\r\n- **Speed**: Local inference, sub-second responses\r\n- **Reliability**: No rate limits, no downtime\r\n\r\n## Quick Start (30 seconds)\r\n\r\n### Installation\r\n\r\n**✨ v1.9.0 NEW**: Download pre-built binaries with ALL GPU backends included!\r\n\r\n#### **📥 Pre-Built Binaries (Recommended - Zero Dependencies)**\r\n\r\nPick your platform and download - no compilation needed:\r\n\r\n```bash\r\n# Windows x64 (includes CUDA + Vulkan + OpenCL)\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-windows-x86_64.exe -o shimmy.exe\r\n\r\n# Linux x86_64 (includes CUDA + Vulkan + OpenCL)\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-x86_64 -o shimmy && chmod +x shimmy\r\n\r\n# macOS ARM64 (includes MLX for Apple Silicon)\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-arm64 -o shimmy && chmod +x shimmy\r\n\r\n# macOS Intel (CPU-only)\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-intel -o shimmy && chmod +x shimmy\r\n\r\n# Linux ARM64 (CPU-only)\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-aarch64 -o shimmy && chmod +x shimmy\r\n```\r\n\r\n**That's it!** Your GPU will be detected automatically at runtime.\r\n\r\n#### **🛠️ Build from Source (Advanced)**\r\n\r\nWant to customize or contribute?\r\n\r\n```bash\r\n# Basic installation (CPU only)\r\ncargo install shimmy --features huggingface\r\n\r\n# Kitchen Sink builds (what pre-built binaries use):\r\n# Windows\u002FLinux x64:\r\ncargo install shimmy --features huggingface,llama,llama-cuda,llama-vulkan,llama-opencl,vision\r\n\r\n# macOS ARM64:\r\ncargo install shimmy --features huggingface,llama,mlx,vision\r\n\r\n# CPU-only (any platform):\r\ncargo install shimmy --features huggingface,llama,vision\r\n```\r\n\r\n> **⚠️ Build Notes**:\r\n> - **Windows**: Install [LLVM](https:\u002F\u002Freleases.llvm.org\u002Fdownload.html) first for libclang.dll\r\n> - **Recommended**: Use pre-built binaries to avoid dependency issues\r\n> - **Advanced users only**: Building from source requires C++ compiler + CUDA\u002FVulkan SDKs\r\n\r\n### GPU Acceleration\r\n\r\n**✨ NEW in v1.9.0**: One binary per platform with automatic GPU detection!\r\n\r\n> **⚠️ IMPORTANT - Vision Feature Performance**:  \r\n> CPU-based vision inference (MiniCPM-V) is **5-10x slower** than GPU acceleration.  \r\n> **CPU**: 15-45 seconds per image | **GPU (CUDA\u002FVulkan)**: 2-8 seconds per image  \r\n> **For production vision workloads, GPU acceleration is strongly recommended.**\r\n\r\n#### **📥 Download Pre-Built Binaries (Recommended)**\r\n\r\nNo compilation needed! Each binary includes ALL GPU backends for your platform:\r\n\r\n| Platform | Download | GPU Support | Auto-Detects |\r\n|----------|----------|-------------|--------------|\r\n| **Windows x64** | [shimmy-windows-x86_64.exe](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-windows-x86_64.exe) | CUDA + Vulkan + OpenCL | ✅ |\r\n| **Linux x86_64** | [shimmy-linux-x86_64](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-x86_64) | CUDA + Vulkan + OpenCL | ✅ |\r\n| **macOS ARM64** | [shimmy-macos-arm64](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-arm64) | MLX (Apple Silicon) | ✅ |\r\n| **macOS Intel** | [shimmy-macos-intel](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-intel) | CPU only | N\u002FA |\r\n| **Linux ARM64** | [shimmy-linux-aarch64](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-aarch64) | CPU only | N\u002FA |\r\n\r\n**How it works**: Download one file, run it. Shimmy automatically detects and uses your GPU!\r\n\r\n```bash\r\n# Windows example\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-windows-x86_64.exe -o shimmy.exe\r\n.\u002Fshimmy.exe serve --gpu-backend auto  # Auto-detects CUDA\u002FVulkan\u002FOpenCL\r\n\r\n# Linux example  \r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-linux-x86_64 -o shimmy\r\nchmod +x shimmy\r\n.\u002Fshimmy serve --gpu-backend auto  # Auto-detects CUDA\u002FVulkan\u002FOpenCL\r\n\r\n# macOS ARM64 example\r\ncurl -L https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest\u002Fdownload\u002Fshimmy-macos-arm64 -o shimmy\r\nchmod +x shimmy  \r\n.\u002Fshimmy serve  # Auto-detects MLX on Apple Silicon\r\n```\r\n\r\n#### **🎯 GPU Auto-Detection**\r\n\r\nShimmy uses intelligent GPU detection with this priority order:\r\n\r\n1. **CUDA** (NVIDIA GPUs via nvidia-smi)\r\n2. **Vulkan** (Cross-platform GPUs via vulkaninfo)\r\n3. **OpenCL** (AMD\u002FIntel GPUs via clinfo)\r\n4. **MLX** (Apple Silicon via system detection)\r\n5. **CPU** (Fallback if no GPU detected)\r\n\r\n**No manual configuration needed!** Just run with `--gpu-backend auto` (default).\r\n\r\n#### **🔧 Manual Backend Override**\r\n\r\nWant to force a specific backend? Use the `--gpu-backend` flag:\r\n\r\n```bash\r\n# Auto-detect (default - recommended)\r\nshimmy serve --gpu-backend auto\r\n\r\n# Force CPU (for testing or compatibility)\r\nshimmy serve --gpu-backend cpu\r\n\r\n# Force CUDA (NVIDIA GPUs only)\r\nshimmy serve --gpu-backend cuda\r\n\r\n# Force Vulkan (AMD\u002FIntel\u002FCross-platform)\r\nshimmy serve --gpu-backend vulkan\r\n\r\n# Force OpenCL (AMD\u002FIntel alternative)\r\nshimmy serve --gpu-backend opencl\r\n```\r\n\r\n**🛡️ Error Handling & Robustness**: If you force an unavailable backend (e.g., `--gpu-backend cuda` on AMD GPU), Shimmy will:\r\n1. ✅ Display clear error message explaining the issue\r\n2. ✅ Automatically fallback to next available backend in priority order\r\n3. ✅ Log which backend was actually used (check with `--verbose`)\r\n4. ✅ Continue serving requests (graceful degradation, no crashes)\r\n5. ✅ Support environment variable override: `SHIMMY_GPU_BACKEND=cuda`\r\n\r\n**Common scenarios**:\r\n- `--gpu-backend cuda` on non-NVIDIA → Falls back to Vulkan or OpenCL\r\n- `--gpu-backend vulkan` without drivers → Falls back to OpenCL or CPU\r\n- `--gpu-backend invalid` → Clear error + fallback to auto-detection\r\n- No GPU detected → Runs on CPU with performance warning\r\n\r\n**Environment Variable**: Set `SHIMMY_GPU_BACKEND=cuda` to override default without CLI flags.\r\n\r\n#### **🔍 Check GPU Support**\r\n```bash\r\n# Show detected GPU backends\r\nshimmy gpu-info\r\n\r\n# Check which backend is being used\r\nshimmy serve --gpu-backend auto --verbose\r\n```\r\n\r\n#### **⚡ Binary Sizes**\r\n\r\n- **GPU-enabled binaries** (Windows\u002FLinux x64, macOS ARM64): ~40-50MB\r\n- **CPU-only binaries** (macOS Intel, Linux ARM64): ~20-30MB\r\n\r\nTrade-off: Slightly larger binaries for zero compilation and automatic GPU detection.\r\n\r\n#### **🛠️ Build from Source (Advanced)**\r\n\r\nWant to customize or contribute? Build from source:\r\n- Multiple backends can be compiled in, best one selected automatically\r\n- Use `--gpu-backend \u003Cbackend>` to force specific backend\r\n\r\n### Get Models\r\n\r\nShimmy auto-discovers models from:\r\n- **Hugging Face cache**: `~\u002F.cache\u002Fhuggingface\u002Fhub\u002F`\r\n- **Ollama models**: `~\u002F.ollama\u002Fmodels\u002F`\r\n- **Local directory**: `.\u002Fmodels\u002F`\r\n- **Environment**: `SHIMMY_BASE_GGUF=path\u002Fto\u002Fmodel.gguf`\r\n\r\n```bash\r\n# Download models that work out of the box\r\nhuggingface-cli download microsoft\u002FPhi-3-mini-4k-instruct-gguf --local-dir .\u002Fmodels\u002F\r\nhuggingface-cli download bartowski\u002FLlama-3.2-1B-Instruct-GGUF --local-dir .\u002Fmodels\u002F\r\n```\r\n\r\n### Start Server\r\n\r\n```bash\r\n# Auto-allocates port to avoid conflicts\r\nshimmy serve\r\n\r\n# Or use manual port\r\nshimmy serve --bind 127.0.0.1:11435\r\n```\r\n\r\nPoint your development tools to the displayed port — VSCode Copilot, Cursor, Continue.dev all work instantly.\r\n\r\n## 📦 Download & Install\r\n\r\n### Package Managers\r\n- **Rust**: [`cargo install shimmy --features moe`](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fshimmy) *(recommended)*\r\n- **Rust (basic)**: [`cargo install shimmy`](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fshimmy)\r\n- **VS Code**: [Shimmy Extension](https:\u002F\u002Fmarketplace.visualstudio.com\u002Fitems?itemName=targetedwebresults.shimmy-vscode)\r\n- **Windows MSVC**: Uses `shimmy-llama-cpp-2` packages for better compatibility\r\n- **npm**: `npm install -g shimmy-js` *(planned)*\r\n- **Python**: `pip install shimmy` *(planned)*\r\n\r\n### Direct Downloads\r\n- **GitHub Releases**: [Latest binaries](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Freleases\u002Flatest)\r\n- **Docker**: `docker pull shimmy\u002Fshimmy:latest` *(coming soon)*\r\n\r\n### 🍎 macOS Support\r\n\r\n**Full compatibility confirmed!** Shimmy works flawlessly on macOS with Metal GPU acceleration.\r\n\r\n```bash\r\n# Install dependencies\r\nbrew install cmake rust\r\n\r\n# Install shimmy\r\ncargo install shimmy\r\n```\r\n\r\n**✅ Verified working:**\r\n- Intel and Apple Silicon Macs\r\n- Metal GPU acceleration (automatic)\r\n- MLX native acceleration for Apple Silicon\r\n- Xcode 17+ compatibility\r\n- All LoRA adapter features\r\n\r\n## Integration Examples\r\n\r\n### VSCode Copilot\r\n```json\r\n{\r\n  \"github.copilot.advanced\": {\r\n    \"serverUrl\": \"http:\u002F\u002Flocalhost:11435\"\r\n  }\r\n}\r\n```\r\n\r\n### Continue.dev\r\n```json\r\n{\r\n  \"models\": [{\r\n    \"title\": \"Local Shimmy\",\r\n    \"provider\": \"openai\",\r\n    \"model\": \"your-model-name\",\r\n    \"apiBase\": \"http:\u002F\u002Flocalhost:11435\u002Fv1\"\r\n  }]\r\n}\r\n```\r\n\r\n### Cursor IDE\r\nWorks out of the box - just point to `http:\u002F\u002Flocalhost:11435\u002Fv1`\r\n\r\n## Why Shimmy Will Always Be Free\r\n\r\nI built Shimmy to retain privacy-first control on my AI development and keep things local and lean.\r\n\r\n**This is my commitment**: Shimmy stays MIT licensed, forever. If you want to support development, [sponsor it](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall). If you don't, just build something cool with it.\r\n\r\n> 💡 **Shimmy saves you time and money. If it's useful, consider [sponsoring for $5\u002Fmonth](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall) — less than your Netflix subscription, infinitely more useful for developers.**\r\n\r\n## API Reference\r\n\r\n### Endpoints\r\n- `GET \u002Fhealth` - Health check\r\n- `POST \u002Fv1\u002Fchat\u002Fcompletions` - OpenAI-compatible chat\r\n- `GET \u002Fv1\u002Fmodels` - List available models\r\n- `POST \u002Fapi\u002Fgenerate` - Shimmy native API\r\n- `GET \u002Fws\u002Fgenerate` - WebSocket streaming\r\n\r\n### CLI Commands\r\n```bash\r\nshimmy serve                    # Start server (auto port allocation)\r\nshimmy serve --bind 127.0.0.1:8080  # Manual port binding\r\nshimmy serve --cpu-moe --n-cpu-moe 8  # Enable MOE CPU offloading\r\nshimmy list                     # Show available models (LLM-filtered)\r\nshimmy discover                 # Refresh model discovery\r\nshimmy generate --name X --prompt \"Hi\"  # Test generation\r\nshimmy probe model-name         # Verify model loads\r\nshimmy gpu-info                 # Show GPU backend status\r\n```\r\n\r\n## Technical Architecture\r\n\r\n- **Rust + Tokio**: Memory-safe, async performance\r\n- **llama.cpp backend**: Industry-standard GGUF inference\r\n- **OpenAI API compatibility**: Drop-in replacement\r\n- **Dynamic port management**: Zero conflicts, auto-allocation\r\n- **Zero-config auto-discovery**: Just works™\r\n\r\n### 🚀 Advanced Features\r\n\r\n- **🧠 MOE CPU Offloading**: Hybrid GPU\u002FCPU processing for large models (70B+)\r\n- **🎯 Smart Model Filtering**: Automatically excludes non-language models (Stable Diffusion, Whisper, CLIP)\r\n- **🛡️ 6-Gate Release Validation**: Constitutional quality limits ensure reliability\r\n- **⚡ Smart Model Preloading**: Background loading with usage tracking for instant model switching\r\n- **💾 Response Caching**: LRU + TTL cache delivering 20-40% performance gains on repeat queries\r\n- **🚀 Integration Templates**: One-command deployment for Docker, Kubernetes, Railway, Fly.io, FastAPI, Express\r\n- **🔄 Request Routing**: Multi-instance support with health checking and load balancing\r\n- **📊 Advanced Observability**: Real-time metrics with self-optimization and Prometheus integration\r\n- **🔗 RustChain Integration**: Universal workflow transpilation with workflow orchestration\r\n\r\n## Community & Support\r\n\r\n- **🐛 Bug Reports**: [GitHub Issues](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Fissues)\r\n- **💬 Discussions**: [GitHub Discussions](https:\u002F\u002Fgithub.com\u002FMichael-A-Kuykendall\u002Fshimmy\u002Fdiscussions)\r\n- **📖 Documentation**: [docs\u002F](docs\u002F) • [Engineering Methodology](docs\u002FMETHODOLOGY.md) • [OpenAI Compatibility Matrix](docs\u002FOPENAI_COMPAT.md) • [Benchmarks (Reproducible)](docs\u002FBENCHMARKS.md)\r\n- **💝 Sponsorship**: [GitHub Sponsors](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FMichael-A-Kuykendall)\r\n\r\n### Star History\r\n\r\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Michael-A-Kuykendall\u002Fshimmy&type=Timeline)](https:\u002F\u002Fwww.star-history.com\u002F#Michael-A-Kuykendall\u002Fshimmy&Timeline)\r\n\r\n### 🚀 Momentum Snapshot\r\n\r\n📦 **Sub-5MB single binary** (142x smaller than Ollama)\r\n🌟 **![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FMichael-A-Kuykendall\u002Fshimmy?style=flat&color=yellow) stars and climbing fast**\r\n⏱ **\u003C1s startup**\r\n🦀 **100% Rust, no Python**\r\n\r\n### 📰 As Featured On\r\n\r\n🔥 [**Hacker News**](https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=45130322) • [**Front Page Again**](https:\u002F\u002Fnews.ycombinator.com\u002Fitem?id=45199898) • [**IPE Newsletter**](https:\u002F\u002Fipenewsletter.substack.com\u002Fp\u002Fthe-strange-new-side-hustles-of-openai)\r\n\r\n**Companies**: Need invoicing? Email [michaelallenkuykendall@gmail.com](mailto:michaelallenkuykendall@gmail.com)\r\n\r\n## ⚡ Performance Comparison\r\n\r\n| Tool | Binary Size | Startup Time | Memory Usage | OpenAI API |\r\n|------|-------------|--------------|--------------|------------|\r\n| **Shimmy** | **4.8MB** | **\u003C100ms** | **50MB** | **100%** |\r\n| Ollama | 680MB | 5-10s | 200MB+ | Partial |\r\n| llama.cpp | 89MB | 1-2s | 100MB | Via llama-server |\r\n\r\n## Quality & Reliability\r\n\r\nShimmy maintains high code quality through comprehensive testing:\r\n\r\n- **Comprehensive test suite** with property-based testing\r\n- **Automated CI\u002FCD pipeline** with quality gates\r\n- **Runtime invariant checking** for critical operations\r\n- **Cross-platform compatibility testing**\r\n### Development Testing\r\n\r\nRun the complete test suite:\r\n\r\n```bash\r\n# Using cargo aliases\r\ncargo test-quick           # Quick development tests\r\n\r\n# Using Makefile  \r\nmake test                  # Full test suite\r\nmake test-quick            # Quick development tests\r\n```\r\n\r\nSee our [testing approach](docs\u002Fppt-invariant-testing.md) for technical details.\r\n\r\n---\r\n\r\n## License & Philosophy\r\n\r\nMIT License - forever and always.\r\n\r\n**Philosophy**: Infrastructure should be invisible. Shimmy is infrastructure.\r\n\r\n**Testing Philosophy**: Reliability through comprehensive validation and property-based testing.\r\n\r\n---\r\n\r\n**Forever maintainer**: Michael A. Kuykendall\r\n**Promise**: This will never become a paid product\r\n**Mission**: Making local model inference simple and reliable\r\n","Shimmy是一个轻量级的本地推理服务器，完全兼容OpenAI API，支持GGUF模型。其核心功能包括一键部署、热模型切换、自动发现以及单个二进制文件运行，无需额外依赖。采用Rust语言开发，确保了高性能和安全性。特别适用于需要在本地环境快速搭建并运行机器学习模型的应用场景，如个人开发者、小型团队或对数据隐私有较高要求的企业。此外，Shimmy承诺永久免费，为用户提供了稳定可靠的本地AI服务解决方案。",2,"2026-06-11 03:05:00","top_language"]