[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73229":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},73229,"llama-swap","mostlygeek\u002Fllama-swap","mostlygeek","Reliable model swapping for any local OpenAI\u002FAnthropic compatible server - llama.cpp, vllm, etc","",null,"Go",4470,342,23,49,0,56,124,514,168,29.61,"MIT License",false,"main",true,[27,28,29,30,31,32,33,34],"golang","llama","llamacpp","localllama","localllm","openai","openai-api","vllm","2026-06-12 02:03:10","![llama-swap header image](docs\u002Fassets\u002Fhero3.webp)\n![GitHub Downloads (all assets, all releases)](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fdownloads\u002Fmostlygeek\u002Fllama-swap\u002Ftotal)\n![GitHub Actions Workflow Status](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fmostlygeek\u002Fllama-swap\u002Fgo-ci.yml)\n![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fmostlygeek\u002Fllama-swap)\n\n# llama-swap\n\nRun multiple generative AI models on your machine and hot-swap between them on demand. llama-swap works with any OpenAI and Anthropic API compatible server and is used by thousands of people to power their local AI workflows.\n\nBuilt in Go for performance and simplicity, llama-swap has zero dependencies and is incredibly easy to set up. Get started in minutes - just one binary and one configuration file.\n\n## Features:\n\n- ✅ Easy to deploy and configure: one binary, one configuration file. no external dependencies\n- ✅ On-demand model switching\n- ✅ Use any local OpenAI compatible server (llama.cpp, vllm, tabbyAPI, stable-diffusion.cpp, etc.)\n  - future proof, upgrade your inference servers at any time.\n- ✅ OpenAI API supported endpoints:\n  - `v1\u002Fcompletions`\n  - `v1\u002Fchat\u002Fcompletions`\n  - `v1\u002Fresponses`\n  - `v1\u002Fembeddings`\n  - `v1\u002Fmodels` - list available models\n  - `v1\u002Faudio\u002Fspeech` ([#36](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F36))\n  - `v1\u002Faudio\u002Ftranscriptions` ([docs](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F41#issuecomment-2722637867))\n  - `v1\u002Faudio\u002Fvoices`\n  - `v1\u002Fimages\u002Fgenerations`\n  - `v1\u002Fimages\u002Fedits`\n- ✅ Anthropic API supported endpoints:\n  - `v1\u002Fmessages`\n  - `v1\u002Fmessages\u002Fcount_tokens`\n- ✅ llama-server (llama.cpp) supported endpoints\n  - `v1\u002Frerank`, `v1\u002Freranking`, `\u002Frerank`\n  - `\u002Finfill` - for code infilling\n  - `\u002Fcompletion` - for completion endpoint\n- ✅ SDAPI via [stable-diffusion.cpp's server](https:\u002F\u002Fgithub.com\u002Fleejet\u002Fstable-diffusion.cpp\u002Ftree\u002Fmaster\u002Fexamples\u002Fserver)\n  - `\u002Fsdapi\u002Fv1\u002Ftxt2img`\n  - `\u002Fsdapi\u002Fv1\u002Fimg2img`\n  - `\u002Fsdapi\u002Fv1\u002Floras` - requires `model` in request body to fetch the correct loras\n- ✅ llama-swap API\n  - `\u002Fui` - web UI\n  - `\u002Fupstream\u002F:model_id` - direct access to upstream server ([demo](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fpull\u002F31))\n  - `\u002Frunning` - list currently running models ([#61](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F61))\n  - `POST \u002Fapi\u002Fmodels\u002Funload` - manually unload all running models ([#58](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F58))\n  - `POST \u002Fapi\u002Fmodels\u002Funload\u002F:model_id` - unload a specific model\n  - `\u002Flogs` - remote log monitoring\n    - `GET \u002Flogs` returns buffered plain text logs.\n      - If `Accept: text\u002Fhtml` is sent, `\u002Flogs` redirects to `\u002Fui\u002F`.\n    - `GET \u002Flogs\u002Fstream` keeps the connection open for live log streaming.\n      - Stream endpoints send buffered history first by default; add `?no-history` to stream only new lines.\n    - `GET \u002Flogs\u002Fstream\u002Fproxy` streams proxy logs only.\n    - `GET \u002Flogs\u002Fstream\u002Fupstream` streams upstream process logs only.\n    - `GET \u002Flogs\u002Fstream\u002F{model_id}` streams logs for one model (including IDs with slashes, like `author\u002Fmodel`).\n  - `\u002Fhealth` - just returns \"OK\"\n  - `\u002Fmetrics` - system and GPU metrics for prometheus\n- ✅ API Key support - define keys to restrict access to API endpoints\n- ✅ Customizable\n  - Run concurrent models with a custom DSL swap matrix ([#643](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F643))\n  - Automatic unloading of models after timeout by setting a `ttl`\n  - Docker and Podman support using `cmd` and `cmdStop` together\n  - Preload models on startup with `hooks` ([#235](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fpull\u002F235))\n  - Apply filters to requests to control inference with `stripParams`, `setParams` and `setParamsByID`\n\n### Web UI\n\nllama-swap includes a real time web interface with a playground for testing out all sorts of local models:\n\n\u003Cimg width=\"1125\" height=\"876\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8ee41947-97af-463d-b0f0-8e9c478fac07\" \u002F>\n\nView detailed token metrics:\n\n\u003Cimg width=\"1111\" height=\"515\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F64bfb280-d7a3-4126-971a-a128fd40410c\" \u002F>\n\nInspect request and responses:\n\n\u003Cimg width=\"1111\" height=\"720\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F24fe4aca-1448-4d7c-b9e8-a967589bda6c\" \u002F>\n\nManually load and unload models:\n\n\u003Cimg width=\"1109\" height=\"719\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F02b1e1f2-abd0-4050-84ae-facd66ff01c4\" \u002F>\n\nReal time log streaming:\n\n\u003Cimg width=\"1107\" height=\"559\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F39669a10-cff2-409e-836a-5bad8bd0140c\" \u002F>\n\n## Installation\n\nllama-swap can be installed in multiple ways\n\n1. Docker\n2. Homebrew (OSX and Linux)\n3. WinGet\n4. From release binaries\n5. From source\n\n### Docker Install ([download images](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fpkgs\u002Fcontainer\u002Fllama-swap))\n\nTwo types of container images are built nightly for llama-swap:\n\n1. A unified container with llama-server, ik-llama-server, stable-diffusion.cpp, whisper.cpp and llama-swap built from source. This is only available for cuda and vulkan but has more capabilities. This one is recommended for use.\n2. A legacy image that is based on llama.cpp's images and llama-swap copied into the container. Use this one if you prefer to stay close to llama.cpp's container images.\n\n#### Unified container (Recommended)\n\n```shell\n$ docker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:unified-cuda\n\n# run with a custom configuration and models directory\n$ docker run -it --rm --runtime nvidia -p 9292:8080 \\\n -v \u002Fpath\u002Fto\u002Fmodels:\u002Fmodels \\\n -v \u002Fpath\u002Fto\u002Fcustom\u002Fconfig.yaml:\u002Fetc\u002Fllama-swap\u002Fconfig\u002Fconfig.yaml \\\n ghcr.io\u002Fmostlygeek\u002Fllama-swap:unified-cuda\n```\n\n#### Legacy container\n\n```shell\n$ docker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:cuda\n\n# run with a custom configuration and models directory\n$ docker run -it --rm --runtime nvidia -p 9292:8080 \\\n -v \u002Fpath\u002Fto\u002Fmodels:\u002Fmodels \\\n -v \u002Fpath\u002Fto\u002Fcustom\u002Fconfig.yaml:\u002Fapp\u002Fconfig.yaml \\\n ghcr.io\u002Fmostlygeek\u002Fllama-swap:cuda\n```\n\n\u003Cdetails>\n\u003Csummary>\nmore examples\n\u003C\u002Fsummary>\n\n```shell\n# pull latest images per platform\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:cpu\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:cuda\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:vulkan\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:intel\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:musa\n\n# tagged llama-swap, platform and llama-server version images\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:v166-cuda-b6795\n\n# non-root cuda\ndocker pull ghcr.io\u002Fmostlygeek\u002Fllama-swap:cuda-non-root\n\n```\n\n\u003C\u002Fdetails>\n\n### Homebrew Install (macOS\u002FLinux)\n\n```shell\nbrew tap mostlygeek\u002Fllama-swap\nbrew install llama-swap\nllama-swap --config path\u002Fto\u002Fconfig.yaml --listen localhost:8080\n```\n\n### WinGet Install (Windows)\n\n> [!NOTE]\n> WinGet is maintained by community contributor [Dvd-Znf](https:\u002F\u002Fgithub.com\u002FDvd-Znf) ([#327](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F327)). It is not an official part of llama-swap.\n\n```shell\n# install\nC:\\> winget install llama-swap\n\n# upgrade\nC:\\> winget upgrade llama-swap\n```\n\n### Pre-built Binaries\n\nBinaries are available on the [release](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Freleases) page for Linux, Mac, Windows and FreeBSD.\n\n### Building from source\n\n1. Building requires Go and Node.js (for UI).\n1. `git clone https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap.git`\n1. `make clean all`\n1. look in the `build\u002F` subdirectory for the llama-swap binary\n\n## Configuration\n\n```yaml\n# minimum viable config.yaml\n\nmodels:\n  model1:\n    cmd: llama-server --port ${PORT} --model \u002Fpath\u002Fto\u002Fmodel.gguf\n```\n\nThat's all you need to get started:\n\n1. `models` - holds all model configurations\n2. `model1` - the ID used in API calls\n3. `cmd` - the command to run to start the server.\n4. `${PORT}` - an automatically assigned port number\n\nAlmost all configuration settings are optional and can be added one step at a time:\n\n- Advanced features\n  - `matrix` to run concurrent models with a custom swap logic DSL\n  - `hooks` to run things on startup\n  - `macros` reusable snippets\n- Model customization\n  - `ttl` to automatically unload models\n  - `aliases` to use familiar model names (e.g., \"gpt-4o-mini\")\n  - `env` to pass custom environment variables to inference servers\n  - `cmdStop` gracefully stop Docker\u002FPodman containers\n  - `useModelName` to override model names sent to upstream servers\n  - `${PORT}` automatic port variables for dynamic port assignment\n  - `filters` rewrite parts of requests before sending to the upstream server\n\nSee the [configuration documentation](docs\u002Fconfiguration.md) for all options.\n\n## How does llama-swap work?\n\nWhen a request is made to an OpenAI compatible endpoint, llama-swap will extract the `model` value and load the appropriate server configuration to serve it. If the wrong upstream server is running, it will be replaced with the correct one. This is where the \"swap\" part comes in. The upstream server is automatically swapped to handle the request correctly.\n\nIn the most basic configuration llama-swap handles one model at a time. For more advanced use cases, using a `matrix` allows multiple models to be loaded at the same time. You have complete control over how your system resources are used.\n\n## Reverse Proxy Configuration (nginx)\n\nIf you deploy llama-swap behind nginx, disable response buffering for streaming endpoints. By default, nginx buffers responses which breaks Server‑Sent Events (SSE) and streaming chat completion. ([#236](https:\u002F\u002Fgithub.com\u002Fmostlygeek\u002Fllama-swap\u002Fissues\u002F236))\n\nRecommended nginx configuration snippets:\n\n```nginx\n# SSE for UI events\u002Flogs\nlocation \u002Fapi\u002Fevents {\n    proxy_pass http:\u002F\u002Fyour-llama-swap-backend;\n    proxy_buffering off;\n    proxy_cache off;\n}\n\n# Streaming chat completions (stream=true)\nlocation \u002Fv1\u002Fchat\u002Fcompletions {\n    proxy_pass http:\u002F\u002Fyour-llama-swap-backend;\n    proxy_buffering off;\n    proxy_cache off;\n}\n```\n\nAs a safeguard, llama-swap also sets `X-Accel-Buffering: no` on SSE responses. However, explicitly disabling `proxy_buffering` at your reverse proxy is still recommended for reliable streaming behavior.\n\n## Monitoring Logs on the CLI\n\n```sh\n# sends up to the last 10KB of logs\n$ curl http:\u002F\u002Fhost\u002Flogs\n\n# streams combined logs\ncurl -Ns http:\u002F\u002Fhost\u002Flogs\u002Fstream\n\n# stream llama-swap's proxy status logs\ncurl -Ns http:\u002F\u002Fhost\u002Flogs\u002Fstream\u002Fproxy\n\n# stream logs from upstream processes that llama-swap loads\ncurl -Ns http:\u002F\u002Fhost\u002Flogs\u002Fstream\u002Fupstream\n\n# stream logs only from a specific model\ncurl -Ns http:\u002F\u002Fhost\u002Flogs\u002Fstream\u002F{model_id}\n\n# stream and filter logs with linux pipes\ncurl -Ns http:\u002F\u002Fhost\u002Flogs\u002Fstream | grep 'eval time'\n\n# appending ?no-history will disable sending buffered history first\ncurl -Ns 'http:\u002F\u002Fhost\u002Flogs\u002Fstream?no-history'\n```\n\n## Do I need to use llama.cpp's server (llama-server)?\n\nAny OpenAI compatible server would work. llama-swap was originally designed for llama-server and it is the best supported.\n\nFor Python based inference servers like vllm or tabbyAPI it is recommended to run them via podman or docker. This provides clean environment isolation as well as responding correctly to `SIGTERM` signals for proper shutdown.\n\n## Star History\n\n> [!NOTE]\n> Thank you to everyone who has given this project a ⭐️!\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=mostlygeek\u002Fllama-swap&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#mostlygeek\u002Fllama-swap&Date)\n","llama-swap 是一个用于在本地OpenAI\u002FAnthropic兼容服务器上实现模型热切换的工具。它支持多种本地运行的生成式AI模型，如llama.cpp、vllm等，并允许用户根据需求即时切换不同模型。该工具使用Go语言开发，具有高性能和简洁性，无需额外依赖，仅需一个二进制文件和一个配置文件即可快速部署。此外，llama-swap提供了丰富的API接口支持，包括但不限于文本补全、聊天补全、音频转文字等功能，适用于需要灵活管理和切换本地AI模型的各种场景，比如个人开发者测试、小型企业内部服务搭建等。",2,"2026-06-11 03:44:36","high_star"]