[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2390":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":14,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":13,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":15,"starSnapshotCount":15,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},2390,"qwen3.6-windows-server","devnen\u002Fqwen3.6-windows-server","devnen","One-click Qwen3.6-27B inference on Windows. 158 tok\u002Fs on RTX 5090, 72 tok\u002Fs on RTX 3090. Native, no WSL, no Docker, no telemetry.",null,"Python",201,22,3,7,0,1,64,52.99,false,"main",[22,23,24,25,26,27,28,29,30,31],"llm-inference","local-llm","offline-ai","privacy","qwen","qwen3","rtx-3090","textual-tui","vllm","windows","2026-06-12 04:00:14","# qwen3.6-windows-server\n\n> **One-click [Qwen3.6-27B](https:\u002F\u002Fhuggingface.co\u002FQwen) inference on Windows.**\n> Unzip, double-click, you're serving on `http:\u002F\u002F127.0.0.1:5001\u002Fv1`.\n> No WSL, no Docker, no conda, no pip, no admin. **Everything runs on\n> your machine. No telemetry. No analytics. No phone-home.**\n\n[![License: Apache 2.0](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Made for Windows](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOS-Windows%2010%2F11-0078d6.svg)](https:\u002F\u002Fwww.microsoft.com\u002Fwindows)\n[![GPU](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftested-RTX%203090-76b900.svg)](https:\u002F\u002Fwww.nvidia.com\u002F)\n[![Local AI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F100%25-local%20%2F%20offline-success.svg)](#the-local-ai-ethos)\n\n---\n\n## What this is\n\nA small portable Windows app that gives you an OpenAI-compatible API\nserving Qwen3.6-27B locally, with config presets that I actually\nmeasured myself. The launcher is a Textual TUI: arrow keys, Enter\nto start a snapshot, Esc to stop. Press `e` to add, edit, duplicate, or\ndelete your own snapshot configs from inside the TUI, no hand-editing\nfiles. That's the whole UX.\n\nIt is the matching launcher for the [`devnen\u002Fvllm-windows`](https:\u002F\u002Fgithub.com\u002Fdevnen\u002Fvllm-windows)\npatched wheel, but you don't need to know or care about that. The wheel\nships inside the launcher zip.\n\n## What you get\n\nOn a single RTX 3090 (24 GB), running [Lorbus AutoRound INT4](https:\u002F\u002Fhuggingface.co\u002FLorbus\u002FQwen3.6-27B-int4-AutoRound):\n\nEvery snapshot below has the tool-calling fix baked in (PR #35687 + #40861 + `qwen3.5-enhanced.jinja` + `preserve_thinking=false`), so any one of them works with any OpenAI-compatible client, Claude Code, Cline, Cursor, Codex, OpenCode, KiloCode, LM Studio, etc. Just point it at the listed port.\n\n| Snapshot              | Decode tok\u002Fs | Prompt class      | Context | Use it when |\n|-----------------------|--------------|-------------------|---------|-------------|\n| `start_72tps`         | **~72**      | short (~200 tok)  | 32 k    | Short-prompt \u002F chat baseline. MTP n=3. |\n| `start_speed`         | **64.5**     | long (100 KB)     | 90 k    | Default for long prompts. MTP n=6, see note below. |\n| `start_127k`          | 53.4         | long (100 KB)     | 127 k   | Maximum context on a single 3090. |\n| `start_mtp4`          | 58.3         | long (100 KB)     | 120 k   | Mid-balance speed vs context. |\n| `start_pp2_160k` (2 GPU) | 43.5      | long (100 KB)     | 160 k   | Pipeline-parallel for the largest contexts. |\n| `start_gpu0_50k`      | 56.9         | mixed             | 9–50 k  | Single-GPU + display, fallback when you can't boot-quiet. |\n\n> **GPU index note.** `start_72tps`, `start_speed`, `start_127k`, and\n> `start_mtp4` pin to **GPU 1** so GPU 0 stays free for the desktop\n> compositor and other apps on a 2× 3090 box. On a single-GPU host the\n> snapshot detects that via `nvidia-smi` and falls back to GPU 0 with a\n> warning. `start_pp2_160k` requires two GPUs.\n\n> **Single 3090 with display attached.** You can run the full\n> `start_speed` snapshot at 90 k context if you close heavy GPU apps\n> (Chrome, Discord, Slack, video playback) **during boot**. Once vLLM\n> has reserved its KV pool, the driver schedules everything else around\n> what vLLM already owns, so you can reopen those apps and they'll\n> behave normally. The danger is reopening them before boot finishes,\n> mid-allocation OOM is what kills runs. If you can't or won't\n> boot-quiet, `start_gpu0_50k` is the conservative fallback (`mem_util\n> 0.92`, ~50 k ctx, same decode tok\u002Fs).\n\nLong-prompt rows were measured on a ~100 KB \u002F ~24 k-token Python\nsource-summary prompt fed to `windows_tools\\bench_summarize.py`. From\nv1.3.3 the shipped fixture is CPython 3.12's `Lib\u002Finspect.py`\n(~130 KB, ~25 k tokens, PSF-licensed) so anyone can reproduce these\nnumbers from a clean install, `windows_tools\u002Fbench_prompt_sample.py`\nis the file. The short-prompt row was measured on a ~200-token chat\nturn via `windows_tools\\bench.py`. All numbers\n[coherence-validated](docs\u002FCOHERENCE.md), TPS without coherence is a\nlie.\n\n> **Why MTP n=6 on `start_speed`?** n=3 is the universal *short-prompt*\n> sweet spot and ships as `start_72tps`. On long, dense Python source\n> the acceptance curve shifts later, n=6 won my coherence sweep\n> (n=3 \u002F 4 \u002F 5 \u002F 6 \u002F 7 \u002F 8 → 53.4 \u002F 58.3 \u002F 62.8 \u002F 64.5 \u002F 61.5 \u002F 58.0\n> tok\u002Fs; full sweep in [`docs\u002FTUNING.md`](docs\u002FTUNING.md)). Always\n> re-sweep on a representative prompt for your workload.\n\n> **Honest framing:** these are not r\u002FLocalLLaMA records. Community has\n> hit 80–82 tok\u002Fs on a 3090 with TurboQuant 3-bit KV, and 160 tok\u002Fs on a\n> 5090. The unique angle here is **native Windows, no WSL**. Same\n> recipe, no virtualization tax, one community member measured the\n> same hardware going from **85 tok\u002Fs in WSL to 160 tok\u002Fs in native\n> Ubuntu** ([reported here](https:\u002F\u002Fwww.reddit.com\u002Fr\u002FLocalLLaMA\u002Fcomments\u002F1sw21op\u002Fcomment\u002Foid8d9n\u002F)).\n> This launcher closes that gap on Windows.\n\n## Community Linux \u002F Blackwell validation\n\nAn independent Linux run on an RTX PRO 5000 Blackwell 48 GB card\nvalidated the same Qwen3.6-27B NVFP4 direction at 256K context with\nvLLM 0.20.2, FlashInfer 0.6.8.post1, ModelOpt NVFP4 weights, fp8_e4m3\nKV cache, chunked prefill, and MTP.\n\nHighlights from that reproduction:\n\n| Test | Result |\n|------|--------|\n| 47K health check | 46,855 prompt tokens, 4\u002F4 needles, roughly 5,800 tok\u002Fs estimated prefill |\n| 200K target | 197,391 prompt tokens, 4\u002F4 needles |\n| 256K stretch | 252,510 prompt tokens, 4\u002F4 needles after raising output budget |\n| NVFP4 path | `FlashInferCutlassNvFp4LinearKernel` + Triton\u002FFLA GDN prefill + FlashInfer attention |\n| MTP n=3 | 87.8% acceptance, 97.8 tok\u002Fs engine decode |\n| MTP n=6 | 78.2% acceptance, 120.9 tok\u002Fs engine decode |\n\nThis is not a Windows launcher snapshot and it does not replace the\n3090 numbers above. It is a Linux\u002FBlackwell validation showing that the\nNVFP4 + fp8 KV path can run 200K-256K practical needle tests on a single\n48 GB Blackwell workstation GPU without OOM. Full report and raw data:\n[`docs\u002Fpro5000-linux-nvfp4\u002F`](docs\u002Fpro5000-linux-nvfp4\u002F).\n\n## Why this exists\n\nMost fast Qwen3.6-27B recipes on r\u002FLocalLLaMA assume Linux + Docker, or\nLinux-in-WSL. Windows users either pay the WSL tax, dual-boot, or skip\ninference entirely. None of those is great if your daily driver is\nWindows.\n\nThis launcher is the third option:\n\n- **Native Windows.** Runs as a normal Windows process. No virtualization layer.\n- **Portable.** Unzip the launcher, drop your model into a folder, double-click. That's it.\n- **Validated.** Every config in here was measured against a coherence battery before being checked in. No copy-pasted Reddit recipes that look fast but emit `* * * *`.\n- **Local-only.** No outbound calls except when you explicitly ask the launcher to download a model from HuggingFace. No telemetry of any kind, ever.\n\n## Install\n\n**TL;DR for CI \u002F agents \u002F scripted installs**, one line, no TUI:\n\n```powershell\nstart.bat --auto-download --snapshot start_72tps\n```\n\nThat installs the runtime, downloads the model if missing, and starts\nserving on `http:\u002F\u002F127.0.0.1:5001\u002Fv1`. See\n[Headless \u002F scripted install](#headless--scripted-install) below for\nall the flags.\n\n**Hand the install to a coding agent**, copy\u002Fpaste prompt at\n[`docs\u002FAGENT_INSTALL_PROMPT.md`](docs\u002FAGENT_INSTALL_PROMPT.md). Edit the\none `INSTALL_DIR` line, paste into Claude Code \u002F Cursor \u002F Codex CLI \u002F\nany agent with shell access, and it does the download + extract +\nruntime install + model fetch + smoke test end-to-end while you do\nsomething else.\n\n**Interactive path:**\n\n1. Download the right zip for your GPU from the\n   [latest Release](..\u002F..\u002Freleases\u002Flatest):\n   - `qwen3.6-windows-server-portable-x64-ampere.zip` for 30-series \u002F 40-series (Ampere, Ada).\n   - `qwen3.6-windows-server-portable-x64-blackwell.zip` for 50-series (Blackwell).\n\n   Extract anywhere (no admin needed).\n2. Double-click `start.bat`. The first run does two one-time steps,\n   then drops you in the TUI:\n   - **Runtime install** (~5–15 min, several GB). The bundled vLLM\n     wheel + ~150 transitive deps (torch, CUDA wheels, transformers,\n     etc.) install into the embedded Python's `site-packages`. A\n     marker file is written so subsequent launches skip this entirely.\n   - **Model setup.** Looks for `Qwen3.6-27B-int4-AutoRound` weights\n     on your fixed drives (scans `\u003Cdrive>:\\`, `_models\\`, `models\\`,\n     `AI\\`, `AI\\models\\`, `huggingface\\`, `huggingface\\hub\\`,\n     `models\\Lorbus\\`). If it doesn't find them, offers to\n     **auto-download from Hugging Face** (~16 GB, public, no token)\n     or accepts a path to weights you already have. If your weights\n     live somewhere else, pass `--model-dir \u003Cpath>` to skip the scan.\n3. Pick a snapshot, press Enter, you're serving on\n   `http:\u002F\u002F127.0.0.1:5001\u002Fv1`.\n\nThe portable zip ships with an embedded Python 3.12 runtime, the\npatched vLLM wheel, the launcher TUI, a portable Windows Terminal,\nand a vendored `get-pip.py`. No conda, no system-Python install, no\nregistry changes, no admin prompts. The runtime install on first run\nis the only network-dependent step besides the model download.\n\nDon't have the model yet? See [`docs\u002FMTP_HEAD.md`](docs\u002FMTP_HEAD.md),\n**use the Lorbus AutoRound quant**, the others won't draft.\n\nDetailed install (including the wheel-only path for users who already\nhave their own venv): [`docs\u002FINSTALL.md`](docs\u002FINSTALL.md).\n\n## Optional: install MSVC 2022 for the small decode boost\n\nThe launcher works on a vanilla Windows install, no MSVC required.\nBut if you install **Visual Studio 2022 Build Tools** (free, no full\nIDE) with the **\"Desktop development with C++\"** workload, the\nsnapshots auto-detect it and turn on vLLM's flashinfer sampler path,\nwhich JIT-compiles a faster top-k \u002F top-p kernel on first launch.\n\nWhat it costs:\n- ~7 GB download, one-time install.\n- Extra 30 to 60 s on the first `profile_run` of each new snapshot\n  while the kernel compiles. Subsequent boots reuse the compiled\n  cache.\n\nWhat you get:\n- A small but measurable decode boost on the sampler path.\n\nWithout MSVC, the snapshots transparently fall back to the PyTorch\nsampler, which never JIT-compiles anything. Boot is faster and the\nserver is reliable; you just leave a few percent of decode tok\u002Fs on\nthe table. The launcher prints a one-line `[info]` at startup telling\nyou which path it picked.\n\nGet the Build Tools installer here (official Microsoft `aka.ms`\nshortlink, pinned to VS 2022 \u002F 17.x so it stays on the right product\neven after VS 2026 ships):\nhttps:\u002F\u002Faka.ms\u002Fvs\u002F17\u002Frelease\u002Fvs_buildtools.exe\n\nninja (the other half of the JIT toolchain) ships inside the\nlauncher zip, you don't need to install it separately.\n\n## Test it\n\nOnce the server is up:\n\n```powershell\ncurl http:\u002F\u002F127.0.0.1:5001\u002Fv1\u002Fchat\u002Fcompletions ^\n  -H \"Content-Type: application\u002Fjson\" ^\n  -d \"{\\\"model\\\":\\\"any\\\",\\\"messages\\\":[{\\\"role\\\":\\\"user\\\",\\\"content\\\":\\\"Capital of France?\\\"}],\\\"max_tokens\\\":2000}\"\n```\n\nNote the `\"model\": \"any\"`, the patched wheel accepts any value. You\ndon't have to know what the model is called.\n\n> **Why `max_tokens: 2000`?** Qwen3.6 is a thinking model: it spends the\n> first chunk of its budget reasoning inside `\u003Cthink>...\u003C\u002Fthink>` and\n> only then writes the answer to `content`. With `max_tokens: 50` the\n> entire budget gets eaten by the thinking phase and you'll see\n> `content: null` plus `finish_reason: \"length\"`, the server is fine,\n> the budget was just too small. 1500–2000 is a safe floor for short Q&A.\n\n> **Where's the answer in the response?** The final answer lands in\n> `choices[0].message.content`. The chain-of-thought lands in a separate\n> `choices[0].message.reasoning` field, that's the `--reasoning-parser=qwen3`\n> wheel patch doing its job, not a bug. Most chat clients show\n> `content` and ignore `reasoning`; if yours doesn't, point it at\n> `content`.\n\n> **If the request hangs**, tail `logs\\vllm_server.\u003Cport>.log` for vLLM's\n> own stdout, the parent launcher logs only the boot banner; the\n> serving process tees its progress to that file.\n\n## Headless \u002F scripted install\n\nEnd-to-end automated install (no TUI, no prompts), useful for CI,\nremote machines, agent installers, or just keeping a repeatable\nrecipe:\n\n```powershell\nstart.bat --auto-download --snapshot start_72tps\n```\n\nThe launcher runs the first-run setup (vLLM wheel + ~150 deps),\nauto-downloads the Lorbus quant from Hugging Face if it's missing,\napplies the tokenizer patch automatically, and execs the chosen\nsnapshot, all without opening the TUI. Other useful flags:\n\n```powershell\nstart.bat --model-dir D:\\models\\Qwen3.6-27B-int4-AutoRound --snapshot start_speed\nstart.bat --headless     :: skip TUI, run the default snapshot (start_72tps)\nstart.bat --setup-only   :: install runtime + model, then exit (no serving)\n```\n\n`--headless` without `--snapshot` now runs the default snapshot\n(`start_72tps`) instead of exiting after setup checks. To run only\nthe setup checks (the old `--headless` behavior), pass `--setup-only`.\n\nThe launcher also stays in the parent terminal, instead of detaching\ninto a new Windows Terminal window, when it sees any of `WT_SESSION`,\n`VLLM_NO_WT`, `CI`, `GITHUB_ACTIONS`, `MSYSTEM`, or `TERM` in the\nenvironment. That covers GitHub Actions, git-bash, MSYS, agent\nrunners, and anything that exports `TERM`. So your captured stdout\nwon't go missing.\n\nFor benchmark numbers like the table above, use the bundled tools:\n\n```powershell\nwindows_tools\\bench.bat              :: short prompt, decode-only TPS\nwindows_tools\\bench_summarize.bat    :: ~100 KB \u002F ~24 k-token prompt, prefill + decode + KV\nwindows_tools\\check_coherence.bat    :: 3-tier coherence validator\n```\n\n## Hardware reality\n\nTuned and measured on:\n\n- Windows 10 Enterprise 22H2\n- 2× NVIDIA RTX 3090 (Ampere `sm_86`), no NVLink, PCIe Gen 4\n- 350 W power cap (250 W also benchmarked, see [`docs\u002FTUNING.md`](docs\u002FTUNING.md))\n\nShould also work on any Ampere or Ada NVIDIA GPU running Windows 10\u002F11,\n3090, 4090, A6000, etc. **Will not work** on Pascal, Turing, Intel Arc,\nor any AMD card. **Single GPU with the display attached** loses 1–3 GiB\nof VRAM to the desktop compositor and another 2–5 GiB to running apps,\nbut you can still run the full `start_speed` snapshot at 90 k context\nby closing heavy GPU apps (Chrome, Discord, Slack, video playback)\nduring boot, then reopening them after vLLM finishes booting. If you\ncan't boot-quiet, fall back to `start_gpu0_50k`. Either path is\ncovered in [`docs\u002FWINDOWS_VRAM_HEADLESS.md`](docs\u002FWINDOWS_VRAM_HEADLESS.md).\n\n> **RTX 50-series (Blackwell, 5060 \u002F 5070 \u002F 5080 \u002F 5090): supported via\n> the Blackwell zip.** Download\n> `qwen3.6-windows-server-portable-x64-blackwell.zip` instead of the\n> default zip. It bundles `vllm-0.20.0+cu132.devnen.2` against CUDA 13.2\n> \u002F PyTorch cu130 with `sm_120` kernels. **v1.3.0 ships NVFP4 as the new\n> default** (`rtx5090_nvfp4`, port 5001) using the\n> [`Peutlefaire\u002FQwen3.6-27B-NVFP4`](https:\u002F\u002Fhuggingface.co\u002FPeutlefaire\u002FQwen3.6-27B-NVFP4)\n> weights. These route FFN GEMMs through FlashInfer's sm_120 native\n> FP4 tensor cores, escaping the 170W prefill ceiling that AutoRound INT4\n> hits on consumer Blackwell. Measured on a single RTX 5090 at 575W:\n> **~5,300 tok\u002Fs prefill @ 47k prompt (5x AutoRound), ~92 tok\u002Fs decode**\n> at 200k context. A second snapshot `rtx5090_nvfp4_vision` (180k ctx)\n> ships as experimental for image and video input. As of v1.3.7,\n> NVFP4 is the only supported 5090 path; the AutoRound INT4 5090\n> snapshots have been removed since they cannot escape the 170W\n> ceiling on consumer Blackwell. NVIDIA driver 596+ required. See [`docs\u002FBLACKWELL.md`](docs\u002FBLACKWELL.md) for\n> the full story and [`docs\u002FSM120_GDN_CEILING.md`](docs\u002FSM120_GDN_CEILING.md)\n> for the prefill-ceiling investigation.\n\nIf you're on a 4090, expect slightly higher numbers than mine. If\nyou're on something more exotic, nothing here is going to work without\nyour own tuning, that's fine, please share what you find.\n\n> **Scope.** This launcher serves Qwen3.6-27B specifically through a\n> fixed set of validated snapshots. It is not a general vLLM server you\n> can point at any model. Adding configs for smaller Qwen variants is\n> straightforward (see [`docs\u002FSNAPSHOTS.md`](docs\u002FSNAPSHOTS.md));\n> running unrelated models like ACE-Step, Stable Diffusion, or other\n> diffusion \u002F multimodal stacks is out of scope.\n\n## The local-AI ethos\n\nEverything runs on your machine. No telemetry. No analytics. No\nphone-home. No cloud inference. No model weights downloaded behind your\nback. The launcher never opens an outbound connection except when you\nexplicitly ask it to (downloading a model from HuggingFace via your own\nbrowser\u002F`huggingface-cli`). This is in the spirit of\n[r\u002FLocalLLaMA](https:\u002F\u002Fwww.reddit.com\u002Fr\u002FLocalLLaMA\u002F): your hardware,\nyour weights, your prompts, your business.\n\nThe launcher and every script are Apache-2.0. The bundled wheel inherits\nupstream vLLM's Apache-2.0 license. SHA256 of every release asset is\npublished next to the release, verify before extracting.\n\n## What's under the hood\n\nThe wheel that powers this launcher is\n[`devnen\u002Fvllm-windows`](https:\u002F\u002Fgithub.com\u002Fdevnen\u002Fvllm-windows): a\npatched native-Windows build of [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm),\nwith three Windows-specific fixes (CPU-relay for Gloo collectives, Qwen3\nreasoning-parser fix mirrored from PR #35687, hardwired wildcard model\nname). The full diff is at\n[`CHANGES_VS_SYSTEMPANIC.md`](https:\u002F\u002Fgithub.com\u002Fdevnen\u002Fvllm-windows\u002Fblob\u002Fmain\u002FCHANGES_VS_SYSTEMPANIC.md)\nin that repo. You don't have to download it separately, it's bundled\ninside this launcher's portable zip.\n\n## Documentation\n\n- [`docs\u002FINSTALL.md`](docs\u002FINSTALL.md), full install + the bring-your-own-venv path.\n- [`docs\u002FUPGRADING.md`](docs\u002FUPGRADING.md), in-place updater (`update.bat`), preserve list, variant switching.\n- [`docs\u002FBLACKWELL.md`](docs\u002FBLACKWELL.md), single landing page for RTX 50-series users.\n- [`docs\u002FMODELS.md`](docs\u002FMODELS.md), swapping in other quants \u002F model sizes \u002F Qwen variants.\n- [`docs\u002FCLAUDE_CODE.md`](docs\u002FCLAUDE_CODE.md), point Claude Code at the local server (native `\u002Fv1\u002Fmessages`, no proxy).\n- [`docs\u002FCODEX.md`](docs\u002FCODEX.md), use OpenAI Codex CLI with this server (Responses API, `developer`-role template fix).\n- [`docs\u002FOPENCODE.md`](docs\u002FOPENCODE.md), point OpenCode at the local server (custom OpenAI-compatible provider, AGENTS.md path-handling rule).\n- [`docs\u002FQWEN_CLI.md`](docs\u002FQWEN_CLI.md), use Alibaba's official Qwen Code agent against this server.\n- [`docs\u002FPI.md`](docs\u002FPI.md), use the Pi coding agent with this server (custom provider extension, auto-compaction notes).\n- [`docs\u002FUNINSTALL.md`](docs\u002FUNINSTALL.md), clean removal (it's portable, so just delete folders).\n- [`docs\u002FHARDWARE.md`](docs\u002FHARDWARE.md), what works, what doesn't, and why.\n- [`docs\u002FCOMPARISON.md`](docs\u002FCOMPARISON.md), how this stacks up against Ollama, LM Studio, llama.cpp, Docker, and WSL2.\n- [`docs\u002FCOHERENCE.md`](docs\u002FCOHERENCE.md), degenerate-output guide and the 3-tier validator.\n- [`docs\u002FTROUBLESHOOTING.md`](docs\u002FTROUBLESHOOTING.md), every failure mode I've hit.\n- [`docs\u002FTUNING.md`](docs\u002FTUNING.md), the lever set, anti-levers, how to sweep your own configs.\n- [`docs\u002FMTP_HEAD.md`](docs\u002FMTP_HEAD.md), why Lorbus AutoRound is the only INT4 quant that works (the NVFP4 weights ship a separate MTP head and are documented in [`docs\u002FBLACKWELL.md`](docs\u002FBLACKWELL.md) and [`docs\u002FSM120_GDN_CEILING.md`](docs\u002FSM120_GDN_CEILING.md)).\n- [`docs\u002FSPEC_DECODE_MATRIX.md`](docs\u002FSPEC_DECODE_MATRIX.md), what spec-decode + parallelism combos work.\n- [`docs\u002FSNAPSHOTS.md`](docs\u002FSNAPSHOTS.md), managing snapshots from inside the TUI: keyboard shortcuts, the CRUD editor, flag invariants, hand-edit fallback.\n- [`docs\u002FWINDOWS_VRAM_HEADLESS.md`](docs\u002FWINDOWS_VRAM_HEADLESS.md), free VRAM on Windows for single-GPU.\n- [`docs\u002FHALLUCINATED_FLAGS.md`](docs\u002FHALLUCINATED_FLAGS.md), flags from web search results that don't exist on this wheel.\n- [`docs\u002FCREDITS.md`](docs\u002FCREDITS.md), vLLM team, SystemPanic, Lorbus, the community.\n\n## Contributing\n\nBug reports welcome, please include GPU model, driver version, Windows\nbuild, and the relevant slice of `logs\\vllm_server.\u003Cport>.log`. The\n[issue template](.github\u002FISSUE_TEMPLATE\u002Fbug_report.md) walks you\nthrough it.\n\n**Share your configs.** Each snapshot in `snapshots\u002F` is just a\nvalidated set of vLLM flags for one hardware\u002Fmodel combo, plus a card\nin `launcher\u002Fconfigs.yaml` so the launcher can list it. If you've got\na config that runs coherent and faster (or with more context) than\nwhat's in here, please send a PR. The bar is the\n[3-tier coherence check](docs\u002FCOHERENCE.md), TPS without coherence\nwon't be merged.\n\nConfigs I'd love to see:\n\n- Other Qwen3.6-27B quants (FP8, additional NVFP4 variants, smaller AutoRound variants)\n- Smaller Qwen models (14B, 8B, 4B) for 16 GB cards\n- 4090 \u002F 5090 \u002F 5060 Ti \u002F A6000 tunings\n- New parallelism or KV-cache combos as vLLM adds them\n\nHow to add a snapshot: [`docs\u002FSNAPSHOTS.md`](docs\u002FSNAPSHOTS.md) (in-TUI editor and hand-edit fallback).\n\nThis project is intentionally narrow scope: **Windows + Ampere\u002FAda\u002FBlackwell\nNVIDIA**. PRs for other operating systems or GPU vendors are politely\nout of scope, please go upstream.\n\n## Credits\n\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), the engine.\n- [SystemPanic\u002Fvllm-windows](https:\u002F\u002Fgithub.com\u002FSystemPanic\u002Fvllm-windows), the upstream Windows wheel build infrastructure.\n- [Lorbus](https:\u002F\u002Fhuggingface.co\u002FLorbus), the AutoRound INT4 quant of Qwen3.6-27B that makes the Ampere\u002FAda path fast.\n- [Peutlefaire](https:\u002F\u002Fhuggingface.co\u002FPeutlefaire), the NVFP4 quant that unlocks consumer Blackwell's full prefill throughput on the 5090.\n- [r\u002FLocalLLaMA](https:\u002F\u002Fwww.reddit.com\u002Fr\u002FLocalLLaMA\u002F), the configs in here started from recipes posted on the subreddit, and got refined by the honest feedback in the comments.\n","该项目提供了一键式在Windows上运行Qwen3.6-27B模型的推理服务。其核心功能包括本地部署、无需额外软件（如WSL、Docker）即可运行，支持RTX 5090和RTX 3090显卡，并且完全离线无数据回传，确保用户隐私。通过简单的解压和双击操作，用户就能在本地启动一个兼容OpenAI API的服务端，适用于需要高性能文本生成但又重视数据安全性的场景，比如个人开发者、小型企业或对隐私有高要求的研究机构。",2,"2026-06-11 02:49:44","CREATED_QUERY"]