[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-905":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},905,"club-3090","noonghunna\u002Fclub-3090","noonghunna","Community recipes for serving LLMs on RTX 3090\u002F4090\u002F5090 CUDA gpus. Multi-engine (vLLM, llama.cpp, ik_llama) and model-agnostic. Currently shipping Qwen3.6-27B Qwen3.6 35B Gemma 4 26B Gemma 4 31B configs for 1× and 2× cards.",null,"Python",1303,67,22,15,0,25,91,557,75,18.5,"Apache License 2.0",false,"master",true,[],"2026-06-12 02:00:20","# club-3090\n\n**Recipes for serving LLMs locally on RTX 3090s.** Multi-engine (vLLM, llama.cpp, SGLang), multi-model, model-agnostic by design.\n\nIf you have one or two RTX 3090s and want to run modern LLMs at home, in a homelab, or as a dev backend — this repo collects the working configs, patches, and benchmarks.\n\n---\n\n## TL;DR — what this is\n\n- **Two complementary routes** — pick by what your workload breaks on:\n  - 🏎 **vLLM dual** = max throughput. Up to **127 TPS code** (DFlash) or **4 concurrent streams @ 262K** (turbo). Full feature stack (vision · tools · MTP · streaming).\n  - 🛡 **llama.cpp single** = max robustness. Full **262K context** on one 3090. Stress-tested clean: no prefill cliffs, 25K-token tool returns work, 90K needle ladder passes. Slower (~21 TPS) but doesn't crash on real-world tool-using agents.\n- **Validated docker compose configs** for both routes — drop-in OpenAI-compatible API on `localhost:8020`\n- **Multi-engine**: vLLM (full features), llama.cpp (max ctx + robustness), SGLang (currently blocked, watch list)\n- **Model-agnostic**: today ships configs for Qwen3.6-27B; structure scales as we add models\n\n**First time here?** → [Models](#supported-models) — pick yours.\n**Already running, want to compare engines?** → [docs\u002Fengines\u002F](docs\u002Fengines\u002F)\n**Hardware questions** (does this work on a 4090, do I need NVLink)? → [docs\u002FHARDWARE.md](docs\u002FHARDWARE.md)\n**Don't know what TPS \u002F KV \u002F MTP mean?** → [docs\u002FGLOSSARY.md](docs\u002FGLOSSARY.md)\n\n> ⚠️ **Known issue (2026-05-05)**: Single-card 24 GB long-context (>~50K tokens) on `long-text.yml` \u002F `long-text-no-mtp.yml` \u002F `long-vision.yml` can OOM despite Genesis v7.72.2's PN59 fix. PN59's runtime eligibility check rejects the chunked-prefill path that 24 GB single-card configs are forced to take. Filed at [Sandermage\u002Fgenesis-vllm-patches#22](https:\u002F\u002Fgithub.com\u002FSandermage\u002Fgenesis-vllm-patches\u002Fissues\u002F22), pending Sander review. **If you hit it**: switch to `dual.yml` \u002F `dual-turbo.yml` (TP=2 escapes the cliff) or `llamacpp\u002Fdefault` (different engine, no Cliff 2). See [docs\u002FCLIFFS.md](docs\u002FCLIFFS.md) for the full diagnosis.\n\n---\n\n## Pick your path\n\n| You have | Start here |\n|---|---|\n| **1× RTX 3090** | [`docs\u002FSINGLE_CARD.md`](docs\u002FSINGLE_CARD.md) — workload → config → quick start |\n| **2× RTX 3090** (PCIe \u002F no NVLink) | [`docs\u002FDUAL_CARD.md`](docs\u002FDUAL_CARD.md) — workload → config → quick start |\n| **3+ GPUs** (any class — 4× 3090, 8× A6000, mixed) | [`docs\u002FMULTI_CARD.md`](docs\u002FMULTI_CARD.md) — TP scaling math, derivation from `dual.yml`, valid TP values |\n| Considering self-host vs cloud APIs | [`docs\u002FCOMPARISONS.md`](docs\u002FCOMPARISONS.md) — cost crossover + when each wins |\n\nEach hardware page lists every supported model with the working composes for that card count, plus measured TPS and per-workload pitfalls. Model-specific deep dives (quants, Genesis patches, engine internals) live under [`models\u002F\u003Cname>\u002F`](models\u002F).\n\n---\n\n## Supported models\n\n| Model | Status | Card counts | Engines | Highlights |\n|---|---|---|---|---|\n| **[Qwen3.6-27B](models\u002Fqwen3.6-27b\u002F)** | Production-ready ⭐ | 1× \u002F 2× 3090 | vLLM ✅ · llama.cpp ✅ · SGLang ❌ blocked | Vision · tools · MTP n=3 · up to 262K ctx · vLLM dual = 89\u002F127 TPS · llama.cpp single = full 262K, no prefill cliffs |\n\nMore models coming. The repo structure scales — when we add Qwen3.5-27B \u002F GLM-4.6 \u002F etc., they go under `models\u002F\u003Cname>\u002F` with the same internal pattern.\n\n---\n\n## Measured TPS at a glance\n\n![Qwen3.6-27B TPS by config](docs\u002Fimg\u002Fperformance.png)\n\nBench protocol: 3 warm + 5 measured runs of the canonical narrative + code prompts. Substrate: vLLM nightly `0.20.1rc1.dev16+g7a1eb8ac2` + Genesis v7.69 dev tip (commit `2db18df`), with local backports `patch_inputs_embeds_optional.py` (vllm#35975) and `patch_tolist_cudagraph.py`. llama.cpp mainline `0d0764dfd`, RTX 3090 sm_86 PCIe-only at 230 W. Per-config details + run-by-run numbers + VRAM + AL\u002Faccept rates: [models\u002Fqwen3.6-27b\u002FCHANGELOG.md](models\u002Fqwen3.6-27b\u002FCHANGELOG.md) (per-model history) and [scripts\u002Fbench.sh](scripts\u002Fbench.sh) (canonical bench).\n\n---\n\n## Quick start (for the current model — Qwen3.6-27B on vLLM)\n\n```bash\n# 1. Clone the repo\ngit clone https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fclub-3090.git\ncd club-3090\n\n# 2. Download + SHA-verify the model (~20 GB; clones Genesis patches too)\nbash scripts\u002Fsetup.sh qwen3.6-27b\n\n# 3. Pick a config + boot it (interactive wizard — asks engine \u002F cards \u002F workload)\nbash scripts\u002Flaunch.sh\n#    Or skip the wizard:\n#      bash scripts\u002Flaunch.sh --variant vllm\u002Fdefault      # single-card chat (recommended)\n#      bash scripts\u002Flaunch.sh --variant vllm\u002Fdual         # dual-card 262K + vision\n#      bash scripts\u002Flaunch.sh --variant llamacpp\u002Fdefault  # single-card 262K, no cliffs\n#    See all variants:\n#      bash scripts\u002Fswitch.sh --list\n\n# 4. Sanity test (launcher already printed this curl)\ncurl -sf http:\u002F\u002Flocalhost:8020\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\":\"qwen3.6-27b-autoround\",\"messages\":[{\"role\":\"user\",\"content\":\"Capital of France?\"}],\"max_tokens\":200}'\n\n# 5. Run the canonical benchmark\nbash scripts\u002Fbench.sh\n\n# 6. Switch later without re-clicking through the wizard:\nbash scripts\u002Fswitch.sh vllm\u002Flong-vision   # for example\n\n# 7. Keep your install up-to-date as the stack moves (Genesis pin bumps,\n#    new compose variants, vendored patch updates):\nbash scripts\u002Fupdate.sh\n#   - bails if your tree has uncommitted edits (commit or stash first)\n#   - git pull --ff-only origin master, then re-runs setup.sh\n#   - tells you to restart your container via switch.sh after — so you can\n#     A\u002FB old-vs-new before bringing the new variant up\n#   launch.sh + switch.sh also soft-warn at boot when your checkout is\n#   behind origin\u002Fmaster, so you'll usually find out before you ask.\n```\n\n`launch.sh` calls `switch.sh` (down old, up new) and then `verify-full.sh` so you know it's serving cleanly before you point a client at it. See [`scripts\u002F`](scripts\u002F) for all helpers.\n\nFor client snippets — Python (`openai` SDK + raw `requests`), TypeScript \u002F Node, plus connection settings for Open WebUI, Cline, Cursor, and other OpenAI-compat clients — see [`docs\u002FEXAMPLES.md`](docs\u002FEXAMPLES.md). Common questions (\"can I use a 4090?\", \"why MTP not EAGLE?\", \"why not Ollama?\", \"what's a prefill cliff?\") have answers in [`docs\u002FFAQ.md`](docs\u002FFAQ.md). Trying to decide self-host vs cloud APIs vs other local options? [`docs\u002FCOMPARISONS.md`](docs\u002FCOMPARISONS.md). Want to contribute numbers, bug repros, or new variants? [`CONTRIBUTING.md`](CONTRIBUTING.md). Tracking the upstream issues and PRs we depend on or have filed? [`docs\u002FUPSTREAM.md`](docs\u002FUPSTREAM.md).\n\n**Hit an issue or want to share bench numbers?** Run `bash scripts\u002Freport.sh > my-rig.md` (add `--full` for the canonical \"everything\" pass: rig + verify-full + verify-stress 7\u002F7 + SOAK_MODE=continuous + bench, ~35 min) and paste into the [bug](https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fclub-3090\u002Fissues\u002Fnew?template=bug-report.yml) or [bench](https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fclub-3090\u002Fissues\u002Fnew?template=numbers-from-your-rig.yml) issue template — single command captures everything we'd otherwise ask for individually. **Not on our shipped Docker composes?** Scripts now work on non-Docker host builds (llama.cpp host server, SGLang, etc.) via `URL=... CONTAINER=none MODEL=... bash scripts\u002F...` — see [discussion #88](https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fclub-3090\u002Fdiscussions\u002F88) for the full contributor flow.\n\nFor llama.cpp (different engine, different recipe — useful for max context on single-card):\n```bash\ncd models\u002Fqwen3.6-27b\u002Fllama-cpp && cat README.md\n```\n\n---\n\n## Repo layout\n\n```\nclub-3090\u002F\n├── README.md                              this file — start here\n├── CHANGELOG.md                           cross-cutting changes (engine pin bumps, script updates)\n├── LICENSE                                Apache-2.0\n├── docs\u002F\n│   ├── ARCHITECTURE.md                    how this stack thinks about LLM serving on 24 GB\n│   ├── HARDWARE.md                        Ampere SM 8.6+, NVLink note, 24 GB ceilings\n│   ├── GLOSSARY.md                        plain-language definitions (TPS \u002F KV \u002F MTP \u002F TP \u002F etc.)\n│   ├── UPSTREAM.md                        every upstream issue \u002F PR we depend on or have filed\n│   ├── CLIFFS.md                          full synopsis of the prefill cliffs (root causes + fix landscape)\n│   ├── img\u002F                               chart sources (performance.svg, vram-budget-{single,dual,combined}.svg) + PNG exports\n│   └── engines\u002F                           cross-model engine comparison + per-engine deep dives\n│       ├── README.md                      decision tree, pros\u002Fcons matrix\n│       ├── VLLM.md                        vLLM general docs + tuning\n│       ├── LLAMA_CPP.md                   llama.cpp general docs + 262K recipe\n│       └── SGLANG.md                      blocked status + watch list\n├── models\u002F\n│   └── qwen3.6-27b\u002F                       all Qwen3.6-27B-specific stuff\n│       ├── README.md                      model overview + variants + recommendations\n│       ├── INTERNALS.md                   model-specific bugs (DeltaNet cliffs, Genesis patches, MTP head, Marlin pad)\n│       └── INTERNALS.md                   engineering rationale (Genesis, Marlin pad, DFlash)\n│       ├── CHANGELOG.md                   model-specific dated history\n│       ├── vllm\u002F\n│       │   ├── README.md                  \"vLLM recipes for Qwen3.6-27B\"\n│       │   ├── compose\u002F                   docker-compose files (single-card + dual-card variants)\n│       │   └── patches\u002F                   tolist_cudagraph + Marlin pad README + Genesis pointer\n│       ├── llama-cpp\u002F\n│       │   ├── README.md                  \"llama.cpp recipes for Qwen3.6-27B\"\n│       │   └── recipes\u002F                   single-card 65K + 262K-max-ctx + dual-card recipes\n│       └── sglang\u002F\n│           └── README.md                  blocked status — what would unblock it on this model\n├── scripts\u002F                               shared, model-aware\n│   ├── setup.sh                           bash setup.sh \u003Cmodel> → preflight + downloads + verifies + Genesis\n│   ├── launch.sh                          interactive wizard: cards → workload → boots compose + verifies\n│   ├── switch.sh                          stateless variant switcher (bring down old, up new)\n│   ├── update.sh                          one-shot upgrade: git pull + re-pin Genesis + re-vendor patches\n│   ├── health.sh                          runtime health probe (KV %, MTP AL, recent TPS, errors)\n│   ├── preflight.sh                       sourceable lib: docker \u002F GPU \u002F disk \u002F repo-drift \u002F Genesis-pin checks\n│   ├── verify.sh                          quick smoke test (engine-aware via env)\n│   ├── verify-full.sh                     fast functional test (8 checks, ~1-2 min)\n│   ├── verify-stress.sh                   boundary-case stress test (longctx ladder + tool prefill OOM, ~5-10 min)\n│   ├── soak-test.sh                       runtime VRAM accretion \u002F multi-turn agent traffic (~10-30 min, opt-in)\n│   ├── bench.sh                           canonical TPS bench\n│   └── report.sh                          paste-ready triage report (run before filing a bug or sharing bench numbers)\n└── tools\u002F\n    └── charts\u002F                            re-generate docs\u002Fimg\u002F* SVGs and PNG exports (matplotlib)\n        ├── gen-perf.py                    perf bar charts (combined + single + dual)\n        └── gen-vram.py                    VRAM stacked bars (combined + single + dual)\n```\n\n---\n\n## What you'll need\n\n| For any model on this stack | Notes |\n|---|---|\n| 1× or 2× NVIDIA RTX 3090 (24 GB each) | Larger Ampere\u002FAda cards (4090, A6000) work; smaller cards (12 GB) don't fit 27B-class models. |\n| Linux (Ubuntu 22.04+ tested) | macOS\u002FWindows: vLLM is Linux + CUDA only. Llama.cpp works on macOS\u002FWindows but recipes assume Linux paths. |\n| Docker + NVIDIA Container Toolkit | For vLLM. llama.cpp works without Docker. |\n| NVIDIA driver 580.x+ | For CUDA 13 runtime in vLLM nightly. |\n| ~30 GB free disk | Per model. More for multiple models. |\n\nSee [docs\u002FHARDWARE.md](docs\u002FHARDWARE.md) for hardware-specific notes (PCIe vs NVLink, power draw, etc.).\n\n---\n\n## How this is structured\n\n**Engines and hardware are general** — the docs in `docs\u002F` apply across models. vLLM works the same way regardless of whether you're serving Qwen, GLM, or Llama; the engine docs cover that once.\n\n**Models are specific** — under `models\u002F\u003Cname>\u002F`, you find that model's quants, quirks, recommended configs, and engine-specific recipes. Adding a new model means adding a new subdir with the same internal pattern.\n\n**Scripts are shared but model-aware** — `bash scripts\u002Fsetup.sh qwen3.6-27b` downloads the right model + clones the right patches. When we add another model, you'd run `bash scripts\u002Fsetup.sh glm-4.6` and the same script handles it.\n\nThis separation keeps the stack maintainable as it grows. We don't want a model-specific README at the top; we want the top to be \"stack docs\" and the model details under their dedicated subdirs.\n\n---\n\n## Migration history\n\n- **2026-04-28** — Repo created. Consolidates and supersedes:\n  - [`noonghunna\u002Fqwen36-27b-single-3090`](https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fqwen36-27b-single-3090) (single-card recipe; archived for issue history)\n  - [`noonghunna\u002Fqwen36-dual-3090`](https:\u002F\u002Fgithub.com\u002Fnoonghunna\u002Fqwen36-dual-3090) (dual-card recipe; archived for issue history)\n\n  Old repos remain readable for existing issue threads, external links (Medium articles, Reddit posts), and historical context. New issues should be filed here.\n\nSee [CHANGELOG.md](CHANGELOG.md) for the merged dated history.\n\n---\n\n## Credits\n\nThe stack stands on a lot of shoulders:\n\n- **Qwen team** ([@Alibaba_Qwen](https:\u002F\u002Fhuggingface.co\u002FQwen)) — for the base models and the MTP head architecture\n- **[Lorbus](https:\u002F\u002Fhuggingface.co\u002FLorbus\u002FQwen3.6-27B-int4-AutoRound)** — for the AutoRound INT4 quant with preserved BF16 `mtp.fc` (the model this whole stack runs on)\n- **[Sandermage](https:\u002F\u002Fgithub.com\u002FSandermage\u002Fgenesis-vllm-patches)** — Genesis patch tree for TurboQuant + hybrid models on consumer Ampere; root-causing #40880 and shipping the v7.14 fix\n- **[vibhavagarwal5](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fpull\u002F38479)** — TurboQuant landing PR + tracking issue #40069\n- **[vLLM project](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)** — the engine + active maintenance\n- **[llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)** — the alternative engine path\n- **[Luce z-lab](https:\u002F\u002Fgithub.com\u002Fluce-spec)** — DFlash N=5 draft model for Qwen3.6-27B\n- **Intel AutoRound** — quantization framework\n- **All cross-rig contributors** — [@ampersandru](https:\u002F\u002Fgithub.com\u002Fampersandru), [@walmis](https:\u002F\u002Fgithub.com\u002Fwalmis), [@3dluvr](https:\u002F\u002Fgithub.com\u002F3dluvr), and the Reddit \u002F X local-LLM community for benchmark data and bug reports.\n\n---\n\n## License\n\nApache 2.0. Do what you want with it. If you get better numbers on your rig — open an issue. If you add a new model with working configs — open a PR.\n","club-3090 项目旨在为 RTX 3090 显卡用户提供本地运行大型语言模型（LLMs）的配置和方案。该项目支持多引擎（vLLM、llama.cpp、SGLang）和多种模型，并且设计上对模型无特定要求，当前提供了 Qwen3.6-27B 在单张及双张显卡上的配置。其核心功能包括通过 vLLM 实现最大吞吐量（最高可达 127 TPS），以及利用 llama.cpp 提供最大稳健性（支持 262K 上下文长度）。项目还提供了经过验证的 Docker Compose 配置文件，方便用户快速部署与 OpenAI 兼容的 API 接口。适用于拥有 RTX 3090 的个人开发者或小型实验室环境，在家中或开发后端运行现代 LLMs。",2,"2026-06-11 02:40:11","CREATED_QUERY"]