[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83241":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":9,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},83241,"vLLM-2080Ti-Definitive","weicj\u002FvLLM-2080Ti-Definitive","weicj","The definitive vLLM runtime for dual RTX 2080 Ti 22GB + NVLink, delivering 27B\u002F31B local inference with 100+ tok\u002Fs single-request decode with support of FP8 weight ",null,"Python",149,25,51,5,0,27,91,119,4.24,"Apache License 2.0",false,"sm75-tp2-cu128-stable",true,[],"2026-06-12 02:04:32","\u003C!-- markdownlint-disable MD001 MD041 -->\n# ⚡ vLLM 2080 Ti Definitive Edition\n\n![vLLM 2080 Ti Definitive Edition cover](docs\u002Fassets\u002Fvllm-2080ti-cover.jpg)\n\nThe definitive vLLM runtime for dual RTX 2080 Ti \u002F SM75 serving.\n\nThis is a hardware-focused fork that preserves the patched source, launch\nprofiles, and runtime notes needed to reproduce the working 2080 Ti vLLM stack.\n\nFork release: `v0.1.4`\nBase vLLM: `0.21.0`\n\nHeadline evidence: Qwen3.6 27B reaches `100+ tok\u002Fs` single-request decode, and\nGemma4 31B reaches `~100 tok\u002Fs` single-request decode on the same dual 2080 Ti\nTP=2 runtime.\n\nLanguage: English | [简体中文](README.zh-CN.md)\n\n![Live single-request speed demo](docs\u002Fassets\u002Fvllmspeed.gif)\n\n## 💡 Why RTX 2080 Ti for LLM Inference?\n\nIn August 2018, NVIDIA launched the RTX 2080 Ti and moved the enthusiast GPU\nline from GTX into the RTX era. Years later, the card is still remembered as a\nlandmark Turing design. With 22GB memory mods, NVLink, high memory bandwidth,\nand enough raw compute to remain relevant, dual 2080 Ti cards turn out to be a\nsurprisingly strong local AI inference platform.\n\n| Metric | 2x 2080 Ti 22GB + NVLink | 3090 Ti 24GB baseline | Ratio |\n|---|---:|---:|---:|\n| Physical CUDA core count | 8,704 | 5,376 | 1.62x |\n| SM count | 136 | 84 | 1.62x |\n| Physical Tensor Core count | 1,088 | 336 | 3.24x |\n| Dense Tensor FP16 matrix throughput | 228 TFLOPS | 160 TFLOPS | 1.43x |\n| Total physical memory bandwidth | 1,232 GB\u002Fs | 1,008 GB\u002Fs | 1.22x |\n| Total VRAM capacity | 44GB | 24GB | 1.83x |\n| Secondary-market price anchor | about $550 with NVLink | about $1,100 | about 0.5x |\n\nThe project is built around a simple cost\u002Fperformance bet: use roughly half the\nsecondary-market price of an RTX 3090 Ti to get a dual 22GB RTX 2080 Ti setup\nthat can match or exceed it on the physical resources that matter for LLM\nserving, then use vLLM runtime work to turn those resources into real tokens.\n\nThat is the first value of this fork: take old but strong Turing silicon and\nmake it behave like a serious 27B\u002F31B-class inference platform through Marlin,\nFlashQLA\u002FFlashInfer\u002FFA2, TurboQuant\u002FINT8 KV, MTP, and CUDAGraph integration.\n\n## 🧩 Core Routes\n\nServing shape:\n\n- This project optimizes for extreme single-concurrency performance on dual\n  2080 Ti: one personal-agent style workload, one serious 27B\u002F31B model, and\n  the largest practical context window this hardware can sustain.\n- It is not a multi-tenant serving stack. Multi-agent use is supported best as\n  queued workspace isolation, not as parallel long-prefill throughput. Long\n  prefill work is capacity-safe when tuned, but it is effectively serialized by\n  the runtime scheduler on this TP=2 profile.\n\nStatus: 🟢 full support; 🟡 partial support; 🔴 performance regression; ⚪ not\nsupported.\n\n### Qwen3.6 27B Mature Route\n\nQwen-family 27B is the primary production route for this fork. It has the most\ncomplete coverage across Marlin weights, native MTP, FP16\u002FINT8\u002FTurboQuant KV,\nnative 256K context, YaRN capacity experiments, and image-serving compatibility.\nFast path: Qwen uses FlashQLA-SM70-SM75 for Gated DeltaNet \u002F linear-attention\nprefill, FlashInfer \u002F FA2 for full-attention prefill, head_dim=256 fast-path\ncontrols, and native MTP with CUDAGraph for decode.\n\n| Feature | FP16 KV | INT8 KV | TurboQuant KV |\n|---|---|---|---|\n| Marlin weight route | 🟢 FP8\u002FINT4 | 🟢 FP8\u002FINT4 | 🟢 FP8\u002FINT4 |\n| Native MTP3 decoding | 🟢 short-context speed route | 🟢 capacity + speed route | 🟢 compressed-capacity route |\n| Native 256K context | 🟢 noMTP real prompt supported | 🟡 capacity\u002Fspeed candidate | 🟢 real prompt\u002Fservice supported |\n| YaRN 512K extension | ⚪ not the target route | 🟢 supported capacity route | 🟡 capacity candidate |\n| No-eager \u002F CUDAGraph | 🟢 supported | 🟢 supported | 🟢 graph-safety fixed |\n| Fast prefill path | 🟢 FlashInfer \u002F FA2 | 🟢 FlashInfer \u002F INT8 path | 🟢 TurboQuant FlashInfer path |\n| Multimodal image serving | 🟢 default-KV route | 🔴 output corruption observed | 🟢 recommended image route |\n| Peak MTP3 PP4096\u002FTG128 | 🟢 1747.52 \u002F 100.98 tok\u002Fs | 🟢 1744.06 \u002F 81.12 tok\u002Fs | 🟢 1746.32 \u002F 85.94 tok\u002Fs |\n\nMode note: FP16 KV is the stable service path. INT8 KV and TurboQuant KV are\nspeed-mode paths for capacity, YaRN, workspace isolation, and experiments.\n\n### Gemma4 31B Experimental Route\n\nGemma4 31B is kept as a secondary experimental route. The FP16\u002Fdefault-KV path\nis fast and useful, but Gemma's head_dim=512 and heterogeneous\u002FGQA attention\nmake compressed-KV and long-context routes much less mature than Qwen.\nFast path: Gemma uses the default-KV fast route for short-context FP16 service;\ncompressed long-context paths still fall back to SDPA\u002FGQA compatibility, and\nassistant MTP is compatible but more workload-sensitive.\n\n| Feature | FP16 KV | INT8 KV | TurboQuant KV |\n|---|---|---|---|\n| Marlin weight route | 🟢 GPTQ target | 🟢 GPTQ target | 🟢 GPTQ target |\n| Assistant MTP decoding | 🟢 MTP5 peak route | 🔴 experimental | 🔴 MTP regression |\n| Native 256K context | ⚪ capacity-negative | 🟡 slow offline route | ⚪ not enough practical capacity |\n| YaRN 512K extension | ⚪ not supported | ⚪ not supported | ⚪ not supported |\n| No-eager \u002F CUDAGraph | 🟢 supported | 🔴 fallback-heavy | 🟢 repaired for compatibility |\n| Fast prefill path | 🟢 default-KV fast path | 🔴 SDPA\u002FGQA fallback cost | 🟡 short-context fast, long-context limited |\n| Multimodal image serving | 🟢 default-KV route | ⚪ not supported | ⚪ not supported |\n| Peak PP4096\u002FTG128 | 🟢 MTP5 1655.65 \u002F 99.64 tok\u002Fs | 🔴 not a serving profile | 🟡 noMTP 1596.15 \u002F 31.70 tok\u002Fs |\n\n## 🧪 Tested Model Checkpoints\n\nThis section records checkpoint-level validation. It is intentionally stricter\nthan \"vLLM can load it\": a supported checkpoint can start and generate, while a\nrecommended checkpoint also has a useful speed\u002Fcontext tradeoff on dual 2080 Ti.\n\n| Model route | Weight route | Model cards | Status |\n|---|---|---|---|\n| Qwen3.6 27B FP8 | FP8 | [Jackrong\u002FQwopus3.6-27B-v2-FP8](https:\u002F\u002Fhuggingface.co\u002FJackrong\u002FQwopus3.6-27B-v2-FP8) | 🟢 Recommended |\n| Qwen3.6 27B INT4 | INT4 | [mconcat\u002FQwopus3.6-27B-v2-AWQ-4bit](https:\u002F\u002Fhuggingface.co\u002Fmconcat\u002FQwopus3.6-27B-v2-AWQ-4bit)\u003Cbr>[QuantTrio\u002FQwen3.6-27B-AWQ](https:\u002F\u002Fhuggingface.co\u002FQuantTrio\u002FQwen3.6-27B-AWQ)\u003Cbr>[llmfan46\u002FQwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4](https:\u002F\u002Fhuggingface.co\u002Fllmfan46\u002FQwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved-GPTQ-Int4) | 🟢 Recommended |\n| Qwen3.6 27B AutoRound | AutoRound INT8 | [Minachist\u002FQwen3.6-27B-INT8-AutoRound W8A16-GS128](https:\u002F\u002Fhuggingface.co\u002FMinachist\u002FQwen3.6-27B-INT8-AutoRound\u002Ftree\u002FW8A16-GS128) | 🟡 Supported |\n| Gemma4 31B GPTQ | GPTQ-INT4 + assistant draft | [ebircak\u002Fgemma-4-31B-it-4bit-W4A16-GPTQ](https:\u002F\u002Fhuggingface.co\u002Febircak\u002Fgemma-4-31B-it-4bit-W4A16-GPTQ) | 🟡 Supported |\n\nFP8 is the recommended high-quality Qwen 8-bit route; INT4 remains the default\nperformance\u002Fcapacity route. AutoRound INT8 is experimental.\n\n## 🛠️ Target Hardware & Runtime\n\n- Validated GPU profile: dual RTX 2080 Ti 22GB, SM75, NVLink, tensor parallel\n  size 2\n- CUDA\u002FPyTorch: CUDA 12.8, `torch 2.11.0+cu128`\n- Fork release: `v0.1.4`\n- Base vLLM: `0.21.0`\n- Repository identity: `vllm-2080ti-definitive`\n- Main stable runtime identity: `vllm-sm75-tp2-cu128`\n- Compatibility target: NVIDIA Turing \u002F SM75 GPUs. Other Turing cards still\n  need profile validation for VRAM capacity, P2P\u002FNVLink behavior, model\n  head_dim, KV dtype, and CUDAGraph\u002FMTP settings.\n\n## 🚀 How To Use\n\nFor a source checkout:\n\n```bash\n.\u002Fbuild.sh\n.\u002Fstart.sh\n```\n\nThen choose three things in the launcher:\n\n1. Checkpoint directory\n2. Profile, usually `stable-*` first\n3. Port and local\u002FLAN access\n\nA successful launch prints an OpenAI-compatible API URL. For scripted use:\n\n```bash\nMODEL_DIR=\u002Fpath\u002Fto\u002Fqwen-or-gemma-checkpoint \\\nPROFILE=stable-qwen27-int4-fp16kv-mtp3-256k.env \\\nPORT=8000 \\\nSERVICE_SCOPE=lan \\\nCUDA_VISIBLE_DEVICES=0,1 \\\n.\u002Fstart.sh --non-interactive\n```\n\nWhen `PROFILE` is set, the profile supplies the launch mode. Use custom `MODE`\nonly for no-profile experiments.\n\n## 🧭 Profiles\n\nStart from [Model Profile Routes](docs\u002Fmodel-profile-routes.md). Pick the model\nfamily and weight precision first, then choose one of two profile classes:\n\n- `stable-*`: recommended daily service profiles. These use stable mode and\n  FP16\u002Fdefault KV only.\n- `speed-*`: high-performance profiles for INT8 KV, TurboQuant KV, YaRN, and\n  other context-capacity or throughput-first routes.\n\nThe launcher applies the mode from the selected profile and refuses stable mode\nwith quantized KV, so profile names and runtime behavior stay aligned.\n\nExperimental profiles live under `profiles\u002Fexperimental\u002F`. They are kept in the\nsame `profiles` tree for discoverability, but the launcher only scans\n`profiles\u002F*.env` by default. To try an experimental route, set the launcher\nProfile directory to `profiles\u002Fexperimental`.\n\n## 🚀 MTP And KV Precision\n\nUse the bundled profiles instead of hand-tuning MTP and KV settings first.\nMTP is already set to the best practical value for each route. FP16\u002Fdefault KV\nis the safest choice for stable service; quantized KV is provided through\nspeed-mode profiles when the priority is context capacity or throughput.\n\nDetailed benchmark notes are kept in\n[MTP Task Sensitivity](docs\u002Fmtp-task-sensitivity.md) and\n[Qwen3.6 KV Throughput Sweep](docs\u002Fqwen36-kv-throughput-sweep.md).\n\n## ❓ Hardware Q&A\n\n**Q: What GPU interconnect is required?**\n\nA: NVLink is recommended, but PCIe P2P is the real baseline requirement. The\nvalidated system uses NVLink and an intentionally non-ideal PCIe topology, with\none card at PCIe 3.0 x1 and the other at PCIe 3.0 x4. With NVLink carrying\nGPU-to-GPU traffic, PCIe slot bandwidth is not the main bottleneck. Without\nNVLink, do not treat narrow PCIe links as proven sufficient; confirm P2P\nbehavior and benchmark the actual topology.\n\n**Q: Does the host need a strong CPU or a lot of RAM?**\n\nA: No. The validated path has run on a low-end desktop CPU with 16GB RAM. More\nCPU\u002FRAM mainly helps compile cache generation, downloads, and local build work,\nnot steady-state token generation.\n\n**Q: Which Turing GPUs make sense? Can I mix 11GB and 22GB cards?**\n\nA: The fully validated target is dual RTX 2080 Ti 22GB. Other good candidates\nare high-VRAM TU102-class cards: TITAN RTX 24GB, Quadro RTX 6000 24GB, and\nQuadro RTX 8000 48GB, preferably in pairs with NVLink or confirmed PCIe P2P.\nMixed 11GB + 22GB RTX 2080 Ti setups are not recommended for these 27B\u002F31B\nprofiles because vLLM TP=2 is effectively constrained by the smaller rank.\nSmaller Turing cards can run smaller models, but they are not the main target\nfor this stack.\n\n**Q: Which CUDA, PyTorch, and driver versions are validated?**\n\nA: The stable runtime is CUDA 12.8 + `torch 2.11.0+cu128`. Use a recent NVIDIA\ndriver that supports your host GPUs and is compatible with the CUDA runtime. Do\nnot mix build\u002Fruntime assumptions casually: keep the PyTorch CUDA lane, local\nCUDA toolkit, FlashInfer\u002FFlashQLA builds, and launch profile aligned.\n\n**Q: What other hardware risks matter?**\n\nA: Cooling, power stability, and enough SSD space for model files and compile\ncaches. Thermal throttling can hide as a software regression, especially during\nlong prefill or repeated CUDAGraph\u002FAOT compilation runs.\n\n## 🔗 Related Project\n\n- [2080Ti-LLM-Toolbox](https:\u002F\u002Fgithub.com\u002Fweicj\u002F2080Ti-LLM-Toolbox): companion\n  toolbox for dual 2080 Ti model routes, benchmark summaries, model notes, and\n  operational guidance. This repository focuses on the patched vLLM runtime\n  itself.\n\n## 🙏 Credits \u002F Upstream Projects\n\nThis repository is a hardware-focused fork of upstream\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), licensed under Apache-2.0. The\nfork keeps the upstream project structure and adds local SM75 runtime patches,\nlaunch profiles, and validation notes for the dual 2080 Ti route.\n\nAcceleration components used or integrated by the stable runtime include:\n\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm): base inference engine and\n  serving stack.\n- [FlashInfer](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer): attention,\n  sampling, and quantized kernel paths used by vLLM.\n- [QwenLM\u002FFlashQLA](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FFlashQLA): upstream FlashQLA\n  Gated DeltaNet \u002F Qwen3.5 linear-attention implementation.\n- [weicj\u002FFlashQLA-SM70-SM75](https:\u002F\u002Fgithub.com\u002Fweicj\u002FFlashQLA-SM70-SM75):\n  SM70\u002FSM75 adaptation used by the stable Qwen3.6 prefill profile.\n- FlashAttention \u002F FA2, TurboQuant, Marlin, CUTLASS, Triton, and related vLLM\n  acceleration kernels: existing open-source acceleration work integrated and\n  profiled for this hardware target.\n",2,"2026-06-11 04:10:31","CREATED_QUERY"]