[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1345":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},1345,"kaiwu","val1813\u002Fkaiwu","val1813","本地开源模型部署器，一键部署，支持各类系统，主流模型。",null,"Go",254,17,241,19,0,5,10,47.27,"MIT License",false,"main",true,[],"2026-06-12 04:00:08","\u003Cdiv align=\"center\">\n\n# Kaiwu · 开物\n\n**Auto-tuned local LLM serving: Kaiwu probes your hardware, model, KV cache, and context window so you get the fastest OpenAI-compatible endpoint your machine can actually sustain.**\n\n**自动调优本地大模型：Kaiwu 探测你的硬件、模型、KV cache 和上下文窗口，给你一个机器能稳定跑出的最快 OpenAI 兼容端点。**\n\n[English](#english) · [中文](#中文)\n\n\u003C\u002Fdiv>\n\n---\n\n\u003Ca name=\"english\">\u003C\u002Fa>\n# Kaiwu\n\nLM Studio and Ollama make models run. Kaiwu makes them run *well* — by measuring, not guessing.\n\nIt probes your GPU, reads the model architecture, benchmarks KV cache options, and walks the context window down from the model's native maximum until it finds the largest window your hardware can sustain at a useful speed. That config is cached. Second launch takes 2 seconds.\n\n## Proof\n\n### 30B MoE on 8GB GPU — the hard case\n\nModel: Qwen3-30B-A3B Q3_K_XL · RTX 5060 Laptop 8GB · Windows 11\n\n| | LM Studio | Kaiwu |\n|---|---|---|\n| Speed | 3 tok\u002Fs | **8.7 tok\u002Fs** |\n| Context window | 4K (default) | **32K (auto)** |\n| VRAM used | 7,549 MB (93%) | 4,800 MB (59%) |\n| Config required | Manual | **None** |\n\nLM Studio fills VRAM trying to load the full model. Kaiwu detects the MoE architecture, keeps attention layers on GPU, routes 128 expert layers through CPU — usable speed at 32K context on hardware that can't fit the model at all.\n\n### 8B dense — everyday use\n\nModel: Llama 3.1 8B Q5_K_M · RTX 5060 8GB\n\n| | LM Studio | Kaiwu |\n|---|---|---|\n| Speed (8K ctx) | 46.5 tok\u002Fs | **51.7 tok\u002Fs** |\n| Context window | 4–8K (default) | **64K (auto)** |\n\nSame speed, 8× more context. Kaiwu calculates whether f16 KV cache fits in VRAM and uses it when it does — matching LM Studio's speed while running a much larger context window.\n\n### Dual 4090 — high-end\n\nModel: Qwen3.6-35B-A3B · 2× RTX 4090 24GB\n\n- **115 tok\u002Fs** · **256K context** · fully automatic tensor split\n\n## How It Works\n\n```\nkaiwu run Qwen3-30B-A3B\n```\n\nThat's it. Kaiwu:\n\n1. **Probes your hardware** — GPU model, VRAM, memory bandwidth, SM version, CPU cores, RAM\n2. **Reads the model** — architecture, layer count, KV heads, native context limit, MoE structure\n3. **Selects KV cache** — calculates f16 footprint; uses f16 if it fits, q8_0+q4_0 if not, iso3 for tight VRAM\n4. **Runs warmup benchmark** — walks ctx from native max downward, stops where speed ≥ 20 tok\u002Fs\n5. **Tunes parameters** — ubatch size, thread count, mlock — all measured, not guessed\n6. **Caches the result** — next launch skips warmup entirely (2s startup)\n\nOn subsequent runs:\n\n```\n✓ Using last config  (64K ctx · 26.2 tok\u002Fs · 3 days ago)\n```\n\n## Installation\n\n**Windows** (PowerShell):\n```powershell\nirm https:\u002F\u002Fraw.githubusercontent.com\u002Fval1813\u002Fkaiwu\u002Fmain\u002Finstall.ps1 | iex\n```\n\n**Linux \u002F macOS**:\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fval1813\u002Fkaiwu\u002Fmain\u002Finstall.sh | sh\n```\n\nOr download manually from [Releases](https:\u002F\u002Fgithub.com\u002Fval1813\u002Fkaiwu\u002Freleases).\n\n## Quick Start\n\n```bash\n# Run a model (auto-downloads if needed)\nkaiwu run Qwen3-30B-A3B\n\n# Run a local GGUF file\nkaiwu run \u002Fpath\u002Fto\u002Fmodel.gguf\n\n# Connect your IDE (Continue, Cursor, Claude Code)\n# Point it to: http:\u002F\u002Flocalhost:11435\u002Fv1\n\n# Check what's running\nkaiwu status\n\n# Stop\nkaiwu stop\n```\n\nThe API is OpenAI-compatible. Any tool that works with the OpenAI API works with Kaiwu.\n\n## Advanced Usage\n\n```bash\n# Override context size\nkaiwu run Qwen3-8B --ctx-size 12000\n\n# Force re-tune (after hardware change)\nkaiwu run Qwen3-8B --reset\n\n# Fast start — skip warmup, use cached config only\nkaiwu run Qwen3-8B --fast\n\n# List available models\nkaiwu list\n\n# Inject IDE config automatically\nkaiwu inject\n```\n\n## What Gets Auto-Tuned\n\n| Parameter | How Kaiwu decides |\n|---|---|\n| Context length | Walks from model's native max down; stops where speed ≥ 20 tok\u002Fs |\n| KV cache type | Calculates f16 footprint; uses f16 → q8_0+q4_0 → iso3 by VRAM fit |\n| MoE expert placement | Detects `.ffn_.*_exps.` tensors; routes to CPU automatically |\n| ubatch size | Benchmarks 128 vs 512; picks the faster one |\n| Thread count | 2 for full-GPU, physical_cores\u002F2 for MoE offload |\n| mlock | Enabled when RAM headroom > 30% |\n| GPU tensor split | Weighted by VRAM × bandwidth when multiple GPUs detected |\n\n## Requirements\n\n- **GPU**: NVIDIA (CUDA) — 4GB+ VRAM recommended\n- **Driver**: ≥ 550.54 (Windows) \u002F ≥ 550.54 (Linux) — required for CUDA 12.4 runtime bundled with Kaiwu\n  - Check: `nvidia-smi` → look for \"Driver Version\"\n  - Update at: [nvidia.com\u002Fdrivers](https:\u002F\u002Fwww.nvidia.com\u002Fdrivers)\n- **OS**: Windows 10\u002F11, Linux (Ubuntu 20.04+)\n- **RAM**: 8GB+ (16GB+ for 30B MoE models)\n- **Model format**: GGUF\n\nCPU-only inference is supported but not the focus.\n\n## Commands\n\n| Command | What it does |\n|---|---|\n| `run \u003Cmodel>` | Start a model. Downloads if needed. |\n| `stop` | Stop the running model. |\n| `status` | Show running model, speed, VRAM usage. |\n| `list` | List available and downloaded models. |\n| `probe` | Show detected hardware. |\n| `inject` | Configure Continue\u002FCursor to use Kaiwu. |\n| `version` | Show version. |\n\n## Changelog\n\n### v0.3.2 — VRAM detection fix + MoE partial guard + RTX PRO support\n- Fixed VRAM over-reporting on Windows: Resizable BAR \u002F shared GPU memory caused nvidia-smi to report inflated VRAM (e.g. 4070 showing 31GB instead of 12GB). Now cross-checks XML vs CSV values and caps to known GPU VRAM limits\n- Fixed MoE partial mode OOM on small-VRAM cards: when model size > 1.2× total VRAM, forces `moe_offload` (all experts on CPU) instead of attempting `moe_partial` which would OOM\n- Added RTX PRO series (Blackwell professional) to bandwidth fallback table: PRO 6000\u002F5000\u002F4500\u002F4000\u002F2000 — fixes bandwidth=0 causing suboptimal tuning\n- Added `knownMaxVRAM()` lookup table covering all consumer\u002Fprofessional\u002Fdatacenter NVIDIA GPUs\n\n### v0.3.1 — MoE OOM root cause fix (--fit conflicts with --cpu-moe)\n- `--fit on` cannot be combined with `--cpu-moe`\u002F`--n-cpu-moe`（ik_llama.cpp docs）. Previous versions passed both, causing --fit to override MoE layer placement → OOM. Now only `full_gpu` uses `--fit on`; MoE modes use `-ngl 999` + explicit offload flags\n- `calcMoEMode` overhead: 1GB → 2.5GB (reserves KV cache + compute buffer space)\n- MoE + multi-GPU: skip `-sm graph`, use layer split + `GGML_CUDA_DISABLE_GRAPHS=1`\n- `isLikelyOOM` excludes missing .so errors\n\n### v0.2.9 — moe_partial + -sm graph detection + LD_LIBRARY_PATH\n- New `moe_partial` mode: calculates `--n-cpu-moe N` based on VRAM, keeping as many expert layers on GPU as possible. Enables running 120B MoE models on 8GB VRAM\n- `-sm graph` now runtime-detected: falls back to `--tensor-split` if binary doesn't support it. Prevents process exit from being misidentified as OOM\n- `isLikelyOOM` excludes parameter errors and timeouts from OOM detection\n- Linux: sets `LD_LIBRARY_PATH` to binary directory, fixing `libmtmd.so.0 not found`\n- `buildArgs`\u002F`BuildArgs` signature includes `binaryPath` for correct graph split detection\n\n### v0.2.8 — Three-mode selection + relative threshold (no more hardcoded 20 tok\u002Fs)\n- Warmup no longer filters by a fixed 18 tok\u002Fs threshold. Instead, collects all successful probe data points and derives three modes:\n  - Speed: fastest ctx (smallest ctx, highest tok\u002Fs)\n  - Balanced: largest ctx where TPS >= peak × 0.7\n  - Context: largest ctx where TPS >= 15 tok\u002Fs\n- Interactive selection after first warmup (10s timeout defaults to balanced), saved to config\n- Subsequent launches use cache; `--mode speed\u002Fbalanced\u002Fcontext` switches without re-warmup\n- MoE mode or identical ctx across modes skips selection menu\n\n### v0.2.7 — Blackwell JIT warmup (permanent fix for startup timeout)\n- RTX 50-series first run now executes `llama-server --version` to trigger PTX JIT compilation and populate CUDA cache. All subsequent launches (warmup probes + final start) read from cache and start in ~2s\n- No longer relies on extending timeouts to \"hope\" JIT finishes in time — JIT warmup runs once, result persists on disk across reboots\n- Multi-GPU `--kv-unified` skip also included (v0.2.6 fix)\n\n### v0.2.6 — Fix multi-GPU OOM (--kv-unified lands on GPU 0 only)\n- `--kv-unified` allocates entire KV cache on a single device (GPU 0). On dual 3090, model splits across both cards but KV cache all goes to GPU 0 → OOM. Now skipped when GPUCount > 1\n\n### v0.2.5 — (same as v0.2.6, tag issue)\n\n### v0.2.4 — Fix panic on older GPUs (P6000, Tesla, etc.)\n- `Fingerprint()` used hardcoded slice indices to remove dot from `ComputeCap` — panics on empty string. Replaced with `strings.ReplaceAll`\n\n### v0.2.3 — Fix Blackwell startup timeout misidentified as OOM\n- RTX 50-series startup timeout (90s) was too short for PTX JIT compilation (~60s) — timeout error was caught by `isLikelyOOM()` → false ctx-halving loop → 3 failures. Now 180s for Blackwell with distinct error message\n\n### v0.2.2 — Blackwell OOM fix + MoE VRAM optimization + --host flag\n- Fixed RTX 50-series (SM120) OOM on all context sizes: `--kv-unified` causes massive VRAM over-allocation with CUDA 12.4 binary on CUDA 13.x driver. Now skipped on Blackwell — llama.cpp uses paged KV allocation instead (grows on demand)\n- Fixed RTX 50-series startup timeout being misidentified as OOM: CUDA 12.4 binary on SM120 needs PTX JIT compilation (~60s), old 90s timeout caused false OOM → ctx halving loop. Now 180s for Blackwell, with distinct error message so `isLikelyOOM` won't trigger ctx retry\n- Warmup start point on Blackwell changed from `ideal×2` to `ideal` — the aggressive headroom caused all 8 probes to OOM before finding a working config\n- MoE VRAM reserve changed from hardcoded 1536MB to dynamic calculation (`model_size × 0.30`). After warmup, measured VRAM is written back for even more accurate KV cache type selection. Fixes users seeing 4GB+ unused VRAM while stuck on small ctx\n- iso3 detection no longer depends on `.kaiwu` marker file — `EnsureBinary` returns `isTurboQuant` directly (bundled = turboquant, downloaded = not)\n- New `--host` flag: `kaiwu run model --host 0.0.0.0` to listen on all interfaces (LAN access). Default remains `127.0.0.1`\n\n### v0.2.0 — iso3 detection rewrite (static, no more timeouts)\n- Replaced runtime iso3 detection (`--help` + timeout) with static check: marker file + SM >= 80. Eliminates all JIT timeout failures on RTX 50-series (SM120) and CUDA 13.x\n- CI now ships a `.kaiwu` marker file alongside the turboquant binary\n- Removed `DetectIso3Support`, `DetectIso3SupportForSM`, and all iso3 cache logic\n- New `ClusterCapabilities` architecture: multi-GPU capability decisions now take the intersection (min SM, all-support-iso3, all-support-FA) instead of relying on a single \"primary\" GPU. Resources (VRAM, bandwidth) are summed. Fixes heterogeneous multi-GPU misdetection (e.g. 4070+3060 where both have 12GB VRAM)\n- `PrimaryGPU()` now selects by bandwidth (not VRAM), used only for display — all capability checks go through `ClusterCaps()`\n- VRAM detection: added CSV fallback when XML `fb_memory_usage` returns 0 (newer driver schema changes). `parseMemValue` now handles \"MiB\", \"MB\", and comma-separated numbers\n- Warns when GPU VRAM=0 detected, with link to report the issue\n\n### v0.1.9 — Multi-GPU tensor split optimization\n- Multi-GPU tensor split now weighted by VRAM × bandwidth instead of VRAM alone. Heterogeneous setups (e.g. 3090+4090+5060) get smarter layer distribution — weak cards receive fewer layers so they don't bottleneck the system\n- Multi-GPU display now shows each card individually with VRAM, bandwidth, and computed split ratio\n- `--fit on` now applied unconditionally for both full_gpu and moe_offload modes (was missing for moe_offload in fallback path)\n- Accel display shows tensor split ratio for multi-GPU without NVLink\n\n### v0.1.8 — MoE warmup no speed threshold\n- MoE offload warmup no longer uses a speed threshold. Speed is PCIe-bandwidth-limited, not context-limited — dropping ctx from 128K to 4K only improves speed ~20-30%, never enough to cross any threshold. Warmup now finds the largest ctx that fits in VRAM and reports whatever speed the hardware delivers\n- Warmup output now shows: `ℹ MoE offload · speed limited by PCIe bandwidth, not context size`\n\n### v0.1.7 — MoE warmup threshold + \u002Fresponses endpoint\n- MoE offload warmup threshold lowered 18 → 8 tok\u002Fs: laptop MoE is PCIe-limited to 13-15 tok\u002Fs max; the old threshold caused warmup to always fall back to the smallest ctx even when the model runs fine\n- Proxy now handles `\u002Fresponses` (without `\u002Fv1\u002F` prefix) in addition to `\u002Fv1\u002Fresponses` — fixes 404 errors from newer Cursor and Claude Code clients\n\n### v0.1.6 — MoE offload fix + direct path fix\n- Fixed MoE models (Qwen3-30B, DeepSeek, etc.) always failing warmup with OOM: the previous `-ot` regex for routing expert layers to CPU wasn't working; replaced with `--cpu-moe` which is natively supported\n- KV cache selection for MoE now trusts llama.cpp's `--fit` to handle layer placement, instead of guessing the GPU footprint with a hardcoded ratio\n- Warmup timeout extended 60s → 180s to handle large MoE models loading ~13GB from RAM\n- Fixed `kaiwu run \u002Fpath\u002Fto\u002Fmodel.gguf` silently downloading the model instead of using the local file (regression from v0.1.3)\n\n### v0.1.5 — iso3 cache + bandwidth-aware tuning\n- iso3 detection result cached to disk — same binary only detects once\n- SM-aware timeout: SM\u003C75 skipped, SM75-119 uses 15s, SM120+ uses 60s\n- OOM suggestion copy is now dynamic — small models no longer wrongly told to switch to MoE\n- Small models (\u003C2GB) ubatch reduced 512→128, fixing `--kv-unified` pre-allocation OOM\n- Memory bandwidth calculated from nvidia-smi XML (`bus_width × max_mem_clock × 2`)\n- Low-bandwidth GPUs (\u003C200 GB\u002Fs) only benchmark ubatch=128, saving 1-2 min warmup\n- Full GPU bandwidth table (GTX 10\u002F16\u002F20\u002F30\u002F40\u002F50 + datacenter V100\u002FP100\u002FH200)\n\n### v0.1.4 — Blackwell architecture support\n- Fixed iso3 detection timeout on RTX 50-series (SM120): 10s → 60s\n- Root cause: CUDA 12.4 has no SM120 precompiled kernels; PTX JIT takes ~30s on first run\n- Prints warning when SM120 detected: `⚠ RTX 50-series first launch requires JIT compilation (~30s)`\n\n### v0.1.3 — hybrid architecture + APEX quantization\n- APEX quantization presets: Quality (q8_0) \u002F Balanced (q5_k_m) \u002F Compact (q4_k_m)\n- Hybrid architecture detection: auto-disables iso3 + enables `--swa-full` for DeltaNet\u002FSSM models\n- Direct GGUF path support introduced (fixed properly in v0.1.6)\n- Flash Attention auto-enabled on SM75+\n- NVLink auto-detection\n- nvidia-smi XML parsing (replaces fragile CSV)\n- Fixed multi-GPU VRAM calculation\n\n### v0.1.2 — iso3 detection timing fix\n- Moved iso3 detection to Preflight (before warmup)\n- Root cause: warmup launched llama-server with iso3 flags before confirming support → all ctx probes failed, false OOM\n\n### v0.1.1 — initial release\n- Hardware probe: GPU (nvidia-smi), CPU, RAM\n- Model matcher: VRAM-based quantization selection, full_gpu \u002F moe_offload modes\n- Warmup benchmark: binary search for max ctx at ≥20 tok\u002Fs, ubatch measurement\n- Config cache: results saved to `~\u002F.kaiwu\u002Fprofiles\u002F`, 2s second launch\n- Bundled turboquant iso3 llama-server binary\n- OpenAI-compatible API at `http:\u002F\u002Flocalhost:11435\u002Fv1`\n\n---\n\n\u003Ca name=\"中文\">\u003C\u002Fa>\n# 开物 (Kaiwu)\n\n> *\"开物成务，利用厚生\"* — 明·宋应星《天工开物》\n\nLM Studio 和 Ollama 让模型能跑。Kaiwu 让模型跑好——靠实测，不靠猜。\n\n它探测你的 GPU、读取模型架构、测试 KV cache 选项，然后从模型的原生最大上下文往下走，找到你的硬件能以实用速度稳定跑出的最大窗口。结果缓存起来，第二次启动只需 2 秒。\n\n## 数据说话\n\n### 8GB 显卡跑 30B 模型——最难的场景\n\n模型：Qwen3-30B-A3B Q3_K_XL · RTX 5060 笔记本 8GB · Windows 11\n\n| | LM Studio | Kaiwu |\n|---|---|---|\n| 速度 | 3 tok\u002Fs | **8.7 tok\u002Fs** |\n| 上下文窗口 | 4K（默认） | **32K（自动）** |\n| 显存占用 | 7,549 MB（93%） | 4,800 MB（59%） |\n| 需要手动配置 | 是 | **不需要** |\n\nLM Studio 试图把整个模型塞进显存，直接 OOM。Kaiwu 识别出 MoE 架构，只把 attention 层放 GPU，128 个 expert 层走 CPU——在装不下整个模型的硬件上，跑出 32K 上下文的可用速度。\n\n### 8B 模型——日常使用\n\n模型：Llama 3.1 8B Q5_K_M · RTX 5060 8GB\n\n| | LM Studio | Kaiwu |\n|---|---|---|\n| 速度（8K 上下文） | 46.5 tok\u002Fs | **51.7 tok\u002Fs** |\n| 上下文窗口 | 4–8K（默认） | **64K（自动）** |\n\n速度持平甚至更快，上下文多 8 倍。Kaiwu 计算 f16 KV cache 能不能装进显存，能装就用——速度匹配 LM Studio，同时跑更大的上下文。\n\n### 双 4090——高端配置\n\n模型：Qwen3.6-35B-A3B · 2× RTX 4090 24GB\n\n- **115 tok\u002Fs** · **256K 上下文** · 自动多卡分配\n\n## 工作原理\n\n```\nkaiwu run Qwen3-30B-A3B\n```\n\n就这一句。Kaiwu 会：\n\n1. **探测硬件** — GPU 型号、显存、内存带宽、SM 版本、CPU 核数、内存\n2. **读模型信息** — 架构、层数、KV heads、原生上下文限制、MoE 结构\n3. **选 KV cache** — 计算 f16 占用；能装用 f16，不够降 q8_0+q4_0，显存极紧用 iso3\n4. **跑 warmup 基准测试** — 从最大上下文往下探，找速度 ≥ 20 tok\u002Fs 的最大值\n5. **调整参数** — ubatch 大小、线程数、mlock——全部实测，不靠猜\n6. **缓存结果** — 下次启动跳过 warmup，2 秒就绪\n\n第二次启动你会看到：\n\n```\n✓ 使用上次配置  (64K ctx · 26.2 tok\u002Fs · 3 天前)\n```\n\n## 安装\n\n**Windows** (PowerShell):\n```powershell\nirm https:\u002F\u002Fraw.githubusercontent.com\u002Fval1813\u002Fkaiwu\u002Fmain\u002Finstall.ps1 | iex\n```\n\n**Linux \u002F macOS**:\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Fval1813\u002Fkaiwu\u002Fmain\u002Finstall.sh | sh\n```\n\n也可以从 [Releases](https:\u002F\u002Fgithub.com\u002Fval1813\u002Fkaiwu\u002Freleases) 手动下载。\n\n## 快速开始\n\n```bash\n# 运行模型（没有会自动下载）\nkaiwu run Qwen3-30B-A3B\n\n# 运行本地 GGUF 文件\nkaiwu run \u002Fpath\u002Fto\u002Fmodel.gguf\n\n# 接入 IDE（Continue、Cursor、Claude Code）\n# API 地址：http:\u002F\u002Flocalhost:11435\u002Fv1\n\n# 查看运行状态\nkaiwu status\n\n# 停止\nkaiwu stop\n```\n\nAPI 兼容 OpenAI 格式，任何支持 OpenAI API 的工具都可以直接用。\n\n## 进阶用法\n\n```bash\n# 指定上下文大小\nkaiwu run Qwen3-8B --ctx-size 12000\n\n# 强制重新调参（换了硬件后）\nkaiwu run Qwen3-8B --reset\n\n# 快速启动——跳过 warmup，直接用缓存\nkaiwu run Qwen3-8B --fast\n\n# 列出可用模型\nkaiwu list\n\n# 自动配置 IDE\nkaiwu inject\n```\n\n## 自动调整的参数\n\n| 参数 | Kaiwu 怎么决定 |\n|---|---|\n| 上下文长度 | 从模型最大值往下探，找速度 ≥ 20 tok\u002Fs 的最大值 |\n| KV cache 类型 | 计算 f16 占用；按显存依次选 f16 → q8_0+q4_0 → iso3 |\n| MoE expert 位置 | 自动识别 `.ffn_.*_exps.` 张量，路由到 CPU |\n| ubatch 大小 | 实测 128 vs 512，取快的 |\n| 线程数 | 全 GPU 用 2，MoE offload 用物理核 \u002F2 |\n| mlock | 内存余量 > 30% 时自动开，防止模型被换出到磁盘 |\n| 多卡分配 | 按显存×带宽加权自动切分，弱卡少分活 |\n\n## 硬件要求\n\n- **显卡**：NVIDIA（CUDA）——建议 4GB+ 显存\n- **驱动**：≥ 550.54——Kaiwu 内置 CUDA 12.4 runtime，需要此版本驱动支持\n  - 查看：`nvidia-smi` → 看 \"Driver Version\"\n  - 更新：[nvidia.com\u002Fdrivers](https:\u002F\u002Fwww.nvidia.com\u002Fdrivers)\n- **系统**：Windows 10\u002F11，Linux（Ubuntu 20.04+）\n- **内存**：8GB+（30B MoE 模型建议 16GB+）\n- **模型格式**：GGUF\n\n支持纯 CPU 推理，但不是主要使用场景。\n\n## 命令列表\n\n| 命令 | 说明 |\n|---|---|\n| `run \u003C模型>` | 启动模型，没有会自动下载 |\n| `stop` | 停止运行中的模型 |\n| `status` | 显示当前模型、速度、显存占用 |\n| `list` | 列出可用和已下载的模型 |\n| `probe` | 显示检测到的硬件信息 |\n| `inject` | 自动配置 Continue\u002FCursor 接入 Kaiwu |\n| `version` | 显示版本号 |\n\n## 版本历史\n\n### v0.2.3 — 修复 Blackwell 启动超时被误判为 OOM\n- RTX 50 系启动超时（90s）不够 PTX JIT 编译（~60s），超时错误被 `isLikelyOOM()` 捕获 → ctx 减半重试循环 → 三次全失败。Blackwell 现在 180s 超时，错误信息与 OOM 区分开\n\n### v0.3.2 — VRAM 检测修复 + MoE partial 保护 + RTX PRO 支持\n- 修复 Windows 下 VRAM 虚高：Resizable BAR \u002F 共享 GPU 内存导致 nvidia-smi 报告虚假 VRAM（如 4070 显示 31GB 而非 12GB）。现在 XML 与 CSV 交叉校验，并用已知 GPU VRAM 上限表兜底\n- 修复小显存卡 MoE partial 模式 OOM：当模型大小 > 1.2× 总 VRAM 时，强制走 `moe_offload`（全部 expert 放 CPU），不再尝试 `moe_partial`\n- 新增 RTX PRO 系列（Blackwell 专业卡）带宽枚举：PRO 6000\u002F5000\u002F4500\u002F4000\u002F2000——修复带宽=0 导致调参不准\n- 新增 `knownMaxVRAM()` 查找表，覆盖所有消费级\u002F专业\u002F数据中心 NVIDIA GPU\n\n### v0.3.1 — MoE OOM 根因修复（--fit 与 --cpu-moe 冲突）\n- `--fit on` 不能和 `--cpu-moe`\u002F`--n-cpu-moe` 同时使用（ik_llama.cpp 文档明确说明）。之前所有版本都同时传了两个，`--fit` 覆盖了 MoE 层分配 → OOM。现在只有 `full_gpu` 用 `--fit on`，MoE 模式用 `-ngl 999` + 显式 offload 参数\n- `calcMoEMode` overhead 从 1GB 增加到 2.5GB（预留 KV cache + compute buffer 空间）\n- MoE + 多卡：跳过 `-sm graph`，用 layer split + `GGML_CUDA_DISABLE_GRAPHS=1`\n- `isLikelyOOM` 排除 .so 缺失错误\n\n### v0.2.9 — moe_partial + -sm graph 检测 + LD_LIBRARY_PATH\n- 新增 `moe_partial` 模式：根据 VRAM 计算 `--n-cpu-moe N`，只把超出显存的 expert 层放 CPU，其余留 GPU。8GB 卡可跑 120B MoE 模型\n- `-sm graph` 改为运行时检测：binary 不支持时自动降级到 `--tensor-split`，不再因参数错误导致进程退出被误判 OOM\n- `isLikelyOOM` 排除参数错误和超时，不再触发错误的 ctx 减半重试\n- Linux 启动时设 `LD_LIBRARY_PATH` 到 binary 目录，修复 `libmtmd.so.0 not found`\n- `buildArgs`\u002F`BuildArgs` 签名加 `binaryPath` 参数，graph split 检测用正确的 binary\n\n### v0.2.8 — 三档模式选择 + 相对阈值（不再写死 20 tok\u002Fs）\n- Warmup 不再用固定 18 tok\u002Fs 阈值过滤。改为收集所有成功的 probe 数据点，探测结束后推导三档：\n  - 速度优先：最快的 ctx（最小 ctx，最高 tok\u002Fs）\n  - 均衡：峰值速度 × 0.7 阈值下的最大 ctx\n  - 上下文优先：速度 ≥ 15 tok\u002Fs 的最大 ctx\n- 首次 warmup 后交互选择（10s 超时默认均衡），选择结果保存到 config\n- 下次启动直接用缓存，`--mode speed\u002Fbalanced\u002Fcontext` 切换不需要重新 warmup\n- MoE 模式或三档 ctx 相同时跳过选择菜单\n\n### v0.2.7 — Blackwell JIT 预热（彻底解决启动超时）\n- RTX 50 系首次运行时，先执行 `llama-server --version` 触发 PTX JIT 编译并写入 CUDA 缓存。后续所有启动秒开\n- 不再依赖延长超时——JIT 预热只需一次，结果持久化到磁盘，重启也不丢\n- 多卡场景跳过 `--kv-unified`（v0.2.6 修复也包含在内）\n\n### v0.2.2 — Blackwell OOM 修复 + --host 参数\n- 修复 RTX 50 系（SM120）所有上下文大小都 OOM 的问题：`--kv-unified` 在 CUDA 12.4 binary + CUDA 13.x 驱动下会过度分配显存。Blackwell 现在跳过此参数，llama.cpp 改用分页式 KV 分配（按需增长）\n- Blackwell warmup 起点从 `ideal×2` 改为 `ideal`——激进的探顶策略导致 8 次探测全部 OOM\n- iso3 检测不再依赖 `.kaiwu` 标记文件——`EnsureBinary` 直接返回 `isTurboQuant`（bundled = turboquant，下载的 = 不是）\n- 新增 `--host` 参数：`kaiwu run model --host 0.0.0.0` 监听所有网卡（局域网访问）。默认仍为 `127.0.0.1`\n\n### v0.2.0 — iso3 检测重写（静态判断，不再超时）\n- iso3 检测从运行时（`--help` + 超时）改为静态判断：标记文件 + SM >= 80。彻底消除 RTX 50 系（SM120）和 CUDA 13.x 下的 JIT 超时误判\n- CI 打包时在 turboquant binary 旁放 `.kaiwu` 标记文件\n- 删除 `DetectIso3Support`、`DetectIso3SupportForSM` 及所有 iso3 缓存逻辑\n- 新增 `ClusterCapabilities` 架构：多卡能力判断改为取交集（最低 SM、全部支持 iso3、全部支持 FA），资源取总和。修复异构多卡误识别（如 4070+3060 同为 12GB 时主卡选错）\n- `PrimaryGPU()` 改为按带宽选主卡（不再按 VRAM），仅用于显示——所有能力判断走 `ClusterCaps()`\n- VRAM 检测：XML `fb_memory_usage` 返回 0 时自动用 CSV fallback（兼容新版驱动 schema 变化）。`parseMemValue` 支持 \"MiB\"\u002F\"MB\"\u002F逗号分隔数字\n- GPU VRAM=0 时打印警告和 issue 链接，方便用户反馈\n\n### v0.1.9 — 多卡 tensor split 优化\n- 多卡 tensor split 从纯按显存比例改为按 显存×带宽 加权。异构多卡（如 3090+4090+5060）分配更合理——弱卡少分层，不拖慢整体\n- 多卡显示改为逐卡列出（型号、显存、带宽、分配比例）\n- `--fit on` 现在对 full_gpu 和 moe_offload 两种模式都无条件启用（之前 fallback 路径的 moe_offload 漏了）\n- 加速特性显示新增 tensor split 比例（多卡无 NVLink 时）\n\n### v0.1.8 — MoE warmup 不再限速\n- MoE offload warmup 不再使用速度阈值。MoE 的速度瓶颈是 PCIe 带宽，不是 ctx 大小——ctx 从 128K 降到 4K 速度只提升 20-30%，永远到不了任何阈值。现在直接找显存能装下的最大 ctx，速度是多少就是多少\n- warmup 结束后新增提示：`ℹ MoE offload · speed limited by PCIe bandwidth, not context size`\n\n### v0.1.7 — MoE warmup 阈值 + \u002Fresponses 端点\n- MoE offload warmup 阈值从 18 降到 8 tok\u002Fs：笔记本 MoE 受 PCIe 带宽限制，上限约 13-15 tok\u002Fs，旧阈值导致 warmup 总是 fallback 到最小 ctx，即使模型跑得好好的\n- proxy 新增 `\u002Fresponses` 路由（不带 `\u002Fv1\u002F` 前缀），修复新版 Cursor 和 Claude Code 调用时的 404\n\n### v0.1.6 — MoE offload 修复 + 直接路径修复\n- 修复 MoE 模型（Qwen3-30B、DeepSeek 等）warmup 全 OOM 的问题：之前用 `-ot` 正则把 expert 层路由到 CPU 实际没生效，改用 llama.cpp 原生支持的 `--cpu-moe`\n- MoE 模式的 KV cache 选择不再用硬编码比例猜 GPU 占用，改为信任 llama.cpp 的 `--fit` 自动处理层分配\n- warmup 超时从 60s 延长到 180s，适配大 MoE 模型从内存加载 ~13GB 的时间\n- 修复 `kaiwu run \u002Fpath\u002Fto\u002Fmodel.gguf` 实际走下载而非使用本地文件的 bug（v0.1.3 引入的回归）\n\n### v0.1.5 — iso3 缓存 + 带宽感知调参\n- iso3 检测结果缓存到磁盘，同一 binary 只检测一次\n- SM 版本感知超时：SM\u003C75 直接跳过，SM75-119 用 15s，SM120+ 用 60s\n- OOM 建议文案动态化：小模型 OOM 不再错误建议换 MoE\n- 小模型（\u003C2GB）ubatch 从 512 降到 128，修复 `--kv-unified` 预分配 OOM\n- 带宽从 nvidia-smi XML 精确计算（`bus_width × max_mem_clock × 2`）\n- 低带宽卡（\u003C200 GB\u002Fs）warmup 只测 ubatch=128，减少 1-2 分钟等待\n- 完整 GPU 带宽枚举表（GTX 10\u002F16\u002F20\u002F30\u002F40\u002F50 + 数据中心）\n\n### v0.1.4 — Blackwell 架构支持\n- 修复 RTX 50 系（SM120）iso3 检测超时：10s → 60s\n- 根因：CUDA 12.4 无 SM120 预编译 kernel，PTX JIT 编译需 ~30s\n- 检测到 SM120 时打印提示：`⚠ RTX 50 系首次启动需要 JIT 编译 (~30s)`\n\n### v0.1.3 — 混合架构支持 + APEX 量化\n- APEX 量化三档预设：Quality (q8_0) \u002F Balanced (q5_k_m) \u002F Compact (q4_k_m)\n- 混合架构动态检测：iso3 自动禁用 + `--swa-full` 补偿（DeltaNet\u002FSSM 架构）\n- 直接 GGUF 路径支持（v0.1.6 修复了实际生效的 bug）\n- Flash Attention 自动启用（SM75+）\n- NVLink 自动检测\n- nvidia-smi XML 解析（替代脆弱的 CSV 解析）\n- 修复多卡 VRAM 计算错误\n\n### v0.1.2 — iso3 检测时机修复\n- iso3 检测移到 warmup 之前（Preflight 阶段）\n- 根因：warmup 用 iso3 参数启动 llama-server，但 binary 不支持 → 所有 ctx 探测失败，误报 OOM\n\n### v0.1.1 — 初始版本\n- 硬件探测：GPU（nvidia-smi）、CPU、内存\n- 模型匹配：基于 VRAM 选量化，full_gpu \u002F moe_offload 两种模式\n- Warmup 基准测试：二分探测最大 ctx，ubatch 实测\n- 配置缓存：结果保存到 `~\u002F.kaiwu\u002Fprofiles\u002F`，第二次启动 2 秒\n- 内置 turboquant iso3 llama-server binary\n- OpenAI 兼容 API：`http:\u002F\u002Flocalhost:11435\u002Fv1`\n\n---\n\n## For Developers \u002F 贡献者\n\nBuild from source (requires Go 1.22+):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fval1813\u002Fkaiwu.git\ncd kaiwu\nmake build-windows   # or build-linux\n```\n\n---\n\n\u003Cdiv align=\"center\">\n\nBuilt on [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) · by [llmbbs.ai](https:\u002F\u002Fllmbbs.ai)\n\n\u003C\u002Fdiv>\n","Kaiwu 是一个本地开源模型部署工具，支持一键部署各类主流模型。其核心功能在于自动探测硬件配置、模型架构及KV缓存选项，并根据实际性能调整上下文窗口大小，以实现最优的运行速度和资源利用效率，同时保持与OpenAI API兼容。该工具特别适用于需要高效利用有限GPU资源来运行大型语言模型的场景，如个人开发者或小型团队在普通消费级硬件上进行模型开发与测试。通过智能调优，即使是在8GB显存的笔记本电脑上也能有效运行30B参数规模的模型，极大提升了模型的实际可用性。",2,"2026-06-11 02:43:11","CREATED_QUERY"]