[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84098":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":9,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":25,"discoverSource":26},84098,"LocalAIHotSwap","aa2448208027-code\u002FLocalAIHotSwap","aa2448208027-code","Local llama.cpp model hot-swap controller for preserving chat context with low VRAM overhead",null,"Python",80,2,1,0,9,19,37,67.83,false,"main",true,[],"2026-06-12 04:01:42","# HotModelReplacement\n\nHotModelReplacement is a small control plane for `llama.cpp` deployments. It keeps\nconversation state and preset prompts outside `llama-server`, then switches the\nactive local GGUF model through the `llama-server` router.\n\nThe default switch policy is `zero_overlap`: run one `llama-server` router,\nunload the current model, optionally wait for GPU memory to settle, then load\nthe target model. That keeps switch-time VRAM peaks low at the cost of reloading\nweights and replaying the saved prompt\u002Fmessages into the new model.\n\n## What this project guarantees\n\n- Preset prompt and session messages are preserved across model switches.\n- Only one model is loaded in the managed `llama-server` router during a default\n  switch.\n- Active generations are drained before a switch unloads the current model.\n- Long sessions can be bounded by message count and prompt character budget.\n- Clients can keep using an OpenAI-compatible `\u002Fv1\u002Fchat\u002Fcompletions` endpoint.\n- Streaming chat completions are forwarded as server-sent events when clients set\n  `stream = true`.\n- Switches are serialized with a lock so chat requests cannot race the process\n  lifecycle.\n\n## What it does not claim\n\n- Cross-model KV cache reuse is not treated as a valid optimization. KV entries\n  depend on model weights, tokenizer behavior, attention layout, and runtime\n  state.\n- Switching models still has weight load latency and prompt prefill latency.\n- The proxy can minimize orchestration overhead, but it cannot remove the\n  compute cost of evaluating a long preserved context on the target model.\n\n## Quick start\n\n1. Build or install a recent `llama.cpp` and make sure `llama-server` is on\n   `PATH`. Router mode must support `\u002Fmodels\u002Fload` and `\u002Fmodels\u002Funload`.\n2. Copy the example config and point the model paths at local GGUF files:\n\n   ```powershell\n   Copy-Item configs\\models.example.toml configs\\models.toml\n   ```\n\n3. Start the proxy:\n\n   ```powershell\n   hotmodel serve --config configs\\models.toml\n   ```\n\n4. Switch models:\n\n   ```powershell\n   hotmodel switch qwen3-small --config configs\\models.toml\n   ```\n\n5. Send chat requests to the proxy:\n\n   ```powershell\n   curl http:\u002F\u002F127.0.0.1:18080\u002Fv1\u002Fchat\u002Fcompletions `\n     -H \"Content-Type: application\u002Fjson\" `\n     -d \"{\\\"model\\\":\\\"active\\\",\\\"messages\\\":[{\\\"role\\\":\\\"user\\\",\\\"content\\\":\\\"hello\\\"}]}\"\n   ```\n\n## Configuration\n\nSee [configs\u002Fmodels.example.toml](configs\u002Fmodels.example.toml). The TOML file\ndefines the proxy and router settings. The INI file defines the model presets\nconsumed by `llama-server` router mode.\n\nSmall Qwen-family GGUF models are a practical starting point because they keep\nload latency and VRAM pressure low during iteration. Use `models_max = 1` for\nthe lowest peak VRAM. Raising it reduces switch latency by allowing multiple\nloaded models, with a direct VRAM cost.\n\nThe most important performance knobs are:\n\n- `router.models_max = 1`: prevents two model weights from being resident during\n  a switch.\n- `router.parallel = 1`: minimizes KV cache allocation.\n- `router.ctx_size`: caps unified KV cache size and max prompt length.\n- `router.cache_type_k` \u002F `router.cache_type_v`: `q8_0` cuts KV memory compared\n  with `f16`, with quality and speed tradeoffs to validate on your workload.\n- `session.max_session_messages` and `session.max_prompt_chars`: bound replayed\n  context so model switches do not turn into long prefill stalls.\n- `session.max_prompt_tokens`: asks the active `llama-server` to apply the\n  model chat template and tokenize the prompt, then drops older history until the\n  token budget fits. In `auto` mode the proxy starts with a fast estimate and\n  calls tokenizer endpoints only when the estimate indicates trimming is needed.\n- `server.switch_drain_timeout_seconds`: waits for active generations to finish\n  before unloading the current model.\n- `\u002Fadmin\u002Fswitch` returns GPU memory snapshots for `before_unload`,\n  `after_unload`, `after_settle`, and `after_load` when `nvidia-smi` is\n  available.\n\n## Architecture\n\nSee [docs\u002Farchitecture.md](docs\u002Farchitecture.md) for the lifecycle model,\nlatency tradeoffs, and operational notes.\n\nSee [docs\u002Fperformance.md](docs\u002Fperformance.md) for the current performance\nreview, compatibility notes, and future branch plan.\n","2026-06-11 04:12:18","CREATED_QUERY"]