[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81035":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":28,"discoverSource":29},81035,"Gemma-4-31B-MTP-vLLM-Server","alicankiraz1\u002FGemma-4-31B-MTP-vLLM-Server","alicankiraz1","A production-minded FastAPI sidecar for serving Gemma 4 31B on vLLM with Gemma 4 Multi-Token Prediction (MTP) speculative decoding.","",null,"Python",42,4,29,0,2,12,13,6,53.4,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:31","# Gemma 4 31B MTP vLLM Server\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fgemma4-mtp-benchmark-card.png\" alt=\"Gemma 4 31B MTP vLLM benchmark snapshot\" width=\"100%\">\n\u003C\u002Fp>\n\nA production-minded FastAPI sidecar for serving Gemma 4 31B on vLLM with\nGemma 4 Multi-Token Prediction (MTP) speculative decoding. It keeps the raw\n`vllm serve` process private and adds OpenAI-compatible and Anthropic-compatible\nHTTP APIs, API-key auth, CORS controls, rate limiting, bounded admission,\nhealth\u002Freadiness diagnostics, release hygiene checks, and Prometheus-style\ngateway metrics.\n\nThe current release is an alpha focused on local\u002Fprivate GPU serving. It has\nbeen validated on a 2x NVIDIA GeForce RTX 5090 host with vLLM `0.21.0`.\n\n## Performance Snapshot\n\n\u003Cdiv align=\"center\">\n\n![MTP speedup](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMTP_speedup-2.12x_average-00c2ff?style=for-the-badge)\n![1000 token throughput](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F1000_token_run-132.56_tok%2Fs-6ee7b7?style=for-the-badge)\n![Hardware](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fhardware-2x_RTX_5090-8b5cf6?style=for-the-badge)\n![vLLM](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FvLLM-0.21.0-f97316?style=for-the-badge)\n\n\u003C\u002Fdiv>\n\n| Scenario | Baseline Gemma 4 31B | Gemma 4 31B + MTP | Improvement |\n| --- | ---: | ---: | ---: |\n| 250 completion tokens | 62.74 tok\u002Fs | 136.27 tok\u002Fs | 2.17x |\n| 500 completion tokens | 62.96 tok\u002Fs | 130.71 tok\u002Fs | 2.08x |\n| 1000 completion tokens | 62.70 tok\u002Fs | 132.56 tok\u002Fs | 2.11x |\n\n| Validation Target | Result |\n| --- | --- |\n| Real hardware smoke | Passed on 2x RTX 5090 |\n| Gateway health | `ready`, `version_ok: true` |\n| OpenAI + Anthropic routes | Chat, stream, messages, count_tokens passed |\n| Backend errors after smoke | `gemma4_mtp_backend_errors 0` |\n\n## Verified Results\n\n### Real Hardware Smoke\n\nValidated on `2026-05-17` with:\n\n- Hardware: `2x NVIDIA GeForce RTX 5090`\n- Backend: `vllm 0.21.0`, tensor parallel size `2`\n- Served model alias: `gemma-4-31b-mtp`\n- Gateway: `127.0.0.1:18080`, upstream vLLM on `127.0.0.1:8010`\n\nSmoke results:\n\n- `\u002Fhealth`: `ready`, `version_ok: true`, backend version `0.21.0`\n- `\u002Fv1\u002Fmodels`: returned OpenAI model objects with `display_name`\n- `\u002Fv1\u002Fchat\u002Fcompletions`: `200 OK`\n- `\u002Fv1\u002Fchat\u002Fcompletions` streaming: `200 OK` and `[DONE]`\n- `\u002Fv1\u002Fmessages`: `200 OK`\n- `\u002Fv1\u002Fmessages\u002Fcount_tokens`: `200 OK`\n- `\u002Fmetrics`: `gemma4_mtp_backend_errors 0`\n\n### MTP Throughput\n\nMeasured directly against the vLLM OpenAI endpoint with\n`min_tokens=max_tokens`, `ignore_eos=true`, one warmup request, and the same\nprompt for both runs.\n\n| Completion target | MTP tok\u002Fs | Baseline tok\u002Fs | Speedup |\n| --- | ---: | ---: | ---: |\n| 250 tokens | 136.27 | 62.74 | 2.17x |\n| 500 tokens | 130.71 | 62.96 | 2.08x |\n| 1000 tokens | 132.56 | 62.70 | 2.11x |\n\nThe MTP service was restored after the benchmark and left healthy.\n\n## Architecture\n\n```mermaid\nflowchart LR\n    clients[\"OpenAI \u002F Anthropic-compatible clients\"]\n    gateway[\"FastAPI sidecar gateway\"]\n    guardrails[\"Auth, limits, rate limit, CORS, bind policy\"]\n    selector[\"Protocol router\"]\n    openai_path[\"OpenAI passthrough\"]\n    anthropic_path[\"Anthropic adapter\"]\n    vllm_client[\"vLLM HTTP client\"]\n    vllm[\"vLLM OpenAI server (process)\"]\n    target[\"Gemma 4 31B target\"]\n    drafter[\"Gemma 4 MTP assistant drafter\"]\n    diagnostics[\"\u002Flivez, \u002Freadyz, \u002Fhealth, \u002Fversion, \u002Fmetrics\"]\n    bench[\"MTP benchmark harness\"]\n    doctor[\"Doctor self-check\"]\n\n    clients --> gateway\n    gateway --> guardrails\n    guardrails --> selector\n    selector --> openai_path\n    selector --> anthropic_path\n    openai_path --> vllm_client\n    anthropic_path --> vllm_client\n    vllm_client --> vllm\n    vllm --> target\n    vllm --> drafter\n    gateway --> diagnostics\n    bench --> vllm_client\n    doctor --> vllm_client\n```\n\n## Profiles\n\nThe default `safe80` profile targets a single 80 GB-class GPU:\n\n- Target: `google\u002Fgemma-4-31B-it`\n- Drafter: `google\u002Fgemma-4-31B-it-assistant`\n- `num_speculative_tokens`: `4`\n- `tensor_parallel_size`: `1`\n- `gpu_memory_utilization`: `0.90`\n- `max_model_len`: `32768`\n\nA `tp2` profile is available for two 40+ GB GPUs (`tensor_parallel_size: 2`).\nThe live validation above used the same served alias with tensor parallelism\nacross two RTX 5090 GPUs.\n\n## vLLM Requirement\n\nThe gateway requires `vllm >= 0.21.0,\u003C0.22.0` for Gemma 4 MTP. vLLM 0.21.0\nships official Gemma 4 MTP speculative decoding support via PR #41745.\nOlder vLLM releases can fail during initialization or treat the Gemma 4\nassistant checkpoint incorrectly. This matters because older releases can\nmishandle the assistant checkpoint.\n\nvLLM is an **optional extra** because it pulls heavy CUDA \u002F ROCm wheels.\nInstall it separately on the GPU host with:\n\n```bash\npip install \"gemma4-mtp-vllm[vllm]\"\n```\n\nThe gateway process itself does not import `vllm`; it only talks to a\nrunning `vllm serve` over HTTP.\n\n## Quick Start\n\n### Prerequisites\n\n- Python `3.10+`. Python `3.12` recommended.\n- NVIDIA CUDA driver `12.x` (CUDA 12.9 wheels available) or AMD ROCm\n  `7.2.1+`. The gateway itself does not require a GPU, but `vllm serve`\n  does.\n- Enough VRAM for the chosen profile (`safe80` needs 80 GB, `tp2` needs\n  2× 40+ GB).\n\n### 1. Clone the repository\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Falicankiraz1\u002FGemma-4-31B-MTP-vLLM-Server.git\ncd Gemma-4-31B-MTP-vLLM-Server\n```\n\n### 2. Create and activate a virtual environment\n\n```bash\npython3.12 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npython -m pip install --upgrade pip\n```\n\n### 3. Install the gateway\n\nFor local development (gateway + tests, without vLLM):\n\n```bash\npython -m pip install -e \".[dev]\"\n```\n\nFor a GPU host that will also run `vllm serve`:\n\n```bash\npython -m pip install -e \".[dev,vllm]\"\n```\n\nThe `[vllm]` extra installs `vllm >= 0.21.0,\u003C0.22.0`. On NVIDIA hosts that\nneed the latest pre-release CUDA wheels:\n\n```bash\nuv pip install -U vllm --pre \\\n    --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly\u002Fcu129 \\\n    --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129 \\\n    --index-strategy unsafe-best-match\n```\n\n### 4. Start vLLM\n\n```bash\nvllm-mtp launch --profile safe80 --host 127.0.0.1 --port 8000\n```\n\nThis prints and executes the canonical `vllm serve` command for the chosen\nprofile, including `--speculative-config` with the Gemma 4 MTP drafter. Use\n`--print-only` first to inspect the exact command:\n\n```bash\nvllm-mtp launch --profile safe80 --print-only\n```\n\nFor raw vLLM exposure, keep `--host 127.0.0.1`. Passing a non-loopback host to\n`vllm-mtp launch` requires `--allow-public-vllm` because raw vLLM has no gateway\nauth, rate limiting, or CORS protection.\n\n### 5. Start the gateway\n\n```bash\nvllm-mtp serve \\\n    --profile safe80 \\\n    --host 127.0.0.1 \\\n    --port 8080 \\\n    --api-key local-dev-key \\\n    --vllm-base-url http:\u002F\u002F127.0.0.1:8000\n```\n\nThe gateway binds to `127.0.0.1` by default. Binding the gateway to `0.0.0.0`\nrequires an `--api-key`.\n\n## Doctor\n\nVerify the vLLM process is reachable, new enough for Gemma 4 MTP, and serving\nthe configured target model:\n\n```bash\nvllm-mtp doctor --profile safe80 --vllm-base-url http:\u002F\u002F127.0.0.1:8000\n```\n\nExpected output shape (single-line JSON):\n\n```json\n{\"ok\": true, \"profile\": \"safe80\", \"target_model\": \"google\u002Fgemma-4-31B-it\", \"drafter\": \"google\u002Fgemma-4-31B-it-assistant\", \"drafter_configured\": \"google\u002Fgemma-4-31B-it-assistant\", \"drafter_loaded\": \"unknown\", \"num_speculative_tokens\": 4, \"tensor_parallel_size\": 1, \"gateway_version\": \"0.1.0\", \"required_vllm_min_version\": \"0.21.0\", \"vllm\": {\"status\": \"ok\", \"version\": \"0.21.0\"}, \"version_ok\": true, \"target_served\": true}\n```\n\n`ok: false` indicates vLLM is unreachable, older than the required version, or\nthe target model is not listed in vLLM's `\u002Fv1\u002Fmodels`. Real vLLM reports the\nserved target model there; the drafter is reported as configured by this\ngateway and `drafter_loaded` remains `unknown`.\n\n## Benchmarks\n\nThe bench harness compares one vLLM process running with MTP enabled\nagainst a second vLLM process running without `--speculative-config`. The\nuser is responsible for launching both processes. Example:\n\nTerminal 1 (MTP-enabled vLLM on 8001):\n\n```bash\nvllm-mtp launch --profile safe80 --port 8001\n```\n\nTerminal 2 (baseline vLLM without MTP on 8002):\n\n```bash\nvllm-mtp launch --profile safe80 --port 8002 --no-mtp\n```\n\nTerminal 3 (paired bench):\n\n```bash\nvllm-mtp bench \\\n    --prompt \"Summarize the key trade-offs of running Gemma 4 locally.\" \\\n    --profile safe80 \\\n    --max-tokens 128 \\\n    --mtp-url http:\u002F\u002F127.0.0.1:8001 \\\n    --baseline-url http:\u002F\u002F127.0.0.1:8002 \\\n    --runs 3 \\\n    --warmup-runs 1 \\\n    --json-output bench-results\u002Fsafe80.json\n```\n\nFor a matrix sweep over multiple prompts and `num_speculative_tokens`\nvalues:\n\n```bash\nvllm-mtp bench-matrix \\\n    --profile safe80 \\\n    --mtp-url http:\u002F\u002F127.0.0.1:8001 \\\n    --baseline-url http:\u002F\u002F127.0.0.1:8002 \\\n    --prompt \"Short technical answer.\" \\\n    --prompt \"Long multi-step reasoning.\" \\\n    --num-speculative-tokens 2 \\\n    --num-speculative-tokens 4 \\\n    --runs 3 \\\n    --warmup-runs 1 \\\n    --json-output bench-results\u002Fsafe80-matrix.json\n```\n\n### Upstream caveat\n\nvLLM has reported very low draft acceptance rates (~0.2%) for Gemma 4 31B\nMTP in some setups. The bench harness measures this directly through\n`generation_tps` comparisons. If your `median_speedup` is close to `1.0`\neven with MTP enabled, you are likely hitting that upstream regression.\nSee https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm\u002Fissues\u002F41789 for the active\ndiscussion.\n\n## API Examples\n\n### OpenAI-compatible chat\n\n```bash\ncurl -sS http:\u002F\u002F127.0.0.1:8080\u002Fv1\u002Fchat\u002Fcompletions \\\n    -H \"Authorization: Bearer local-dev-key\" \\\n    -H \"Content-Type: application\u002Fjson\" \\\n    -d '{\n        \"model\": \"gemma-4-31b-mtp\",\n        \"messages\": [\n            {\"role\": \"system\", \"content\": \"Kisa ve net cevap ver.\"},\n            {\"role\": \"user\", \"content\": \"Merhaba, calisiyor musun?\"}\n        ],\n        \"max_tokens\": 32,\n        \"temperature\": 0\n    }' | python3 -m json.tool\n```\n\n### Anthropic-compatible messages\n\n```bash\ncurl -sS http:\u002F\u002F127.0.0.1:8080\u002Fv1\u002Fmessages \\\n    -H \"Authorization: Bearer local-dev-key\" \\\n    -H \"Content-Type: application\u002Fjson\" \\\n    -d '{\n        \"model\": \"claude-gemma-4-31b-mtp\",\n        \"max_tokens\": 32,\n        \"system\": \"Kisa ve net cevap ver.\",\n        \"messages\": [\n            {\"role\": \"user\", \"content\": \"Merhaba, calisiyor musun?\"}\n        ]\n    }' | python3 -m json.tool\n```\n\n## Alpha Policy\n\nThe gateway is intentionally narrow in v0.1. The following request fields\nfail fast with `400 unsupported_feature` instead of being silently ignored\nor forwarded to vLLM:\n\n- OpenAI: `tools`, `tool_choice`, `function_call`, `functions`, `stop`,\n  and structured `response_format` while MTP is enabled.\n- Anthropic: `tools`, `tool_choice`, `thinking`, `mcp`, `files`,\n  `stop_sequences`.\n\nNo-op client defaults are accepted for compatibility:\n`tools: []`, `tool_choice: \"none\"`, `function_call: \"none\"`,\n`functions: []`, `stop: null`, `response_format: {\"type\": \"text\"}`,\nAnthropic `tools: []`, `tool_choice: {\"type\": \"none\"}`,\n`thinking: {\"type\": \"disabled\"}`, `stop_sequences: []`.\n\n`\u002Fv1\u002Fmessages\u002Fcount_tokens` returns a word-count estimate with the\n`X-Gemma4-MTP-Token-Counting: estimated_word_count` header. Tokenizer-exact\ncounting is planned but not in v0.1.\n\nStreaming SSE works through the gateway; vLLM streams natively. Anthropic\nstreaming buffers the upstream chunks before translation in v0.1.\n\n## Guardrails\n\n- `GET \u002Flivez` is public and returns `{\"status\":\"ok\"}`.\n- `\u002Fhealth`, `\u002Freadyz`, `\u002Fversion`, `\u002Fmetrics` are protected when\n  `--api-key` is configured.\n- Request bodies are capped by `--max-body-mb` (default `2`).\n- Output is capped by `--max-output-tokens` (default `4096`).\n- In-memory rate limiting defaults to `--rate-limit-rpm 30` per credential\n  (or per client host when no API key is configured).\n- Gateway slot admits `--max-queue-size + 1` concurrent requests before\n  rejecting; real concurrency is handled by vLLM's continuous batcher.\n- CORS is default-deny; add `--cors-origin` for explicit browser clients.\n- Non-loopback bind hosts require an `--api-key`.\n- Keep raw `vllm serve` bound to `127.0.0.1` unless you explicitly accept that\n  it has no gateway auth, rate limit, or CORS protection. Expose only the gateway\n  for normal use.\n\n## Release Hygiene\n\n### Source Archives\n\nRelease archives must come from this script or from CI. Do not publish manually created Finder or desktop zip files.\nDo not share a manually zipped working directory.\nRelease artifact scripts refuse a dirty worktree by default. Use `--allow-dirty`\nonly for local wheel smoke checks; never publish artifacts created from a dirty\nworkspace.\n\n```bash\nscripts\u002Fmake_source_archive.sh\n```\n\nUse an explicit output path when needed:\n\n```bash\nscripts\u002Fmake_source_archive.sh dist\u002FGemma-4-31B-MTP-vllm-src.zip\n```\n\nVerify that an archive does not contain local workspace, cache, build, or\nmacOS metadata entries:\n\n```bash\nscripts\u002Fverify_source_archive.sh Gemma-4-31B-MTP-vllm-src.zip\n```\n\nThe verifier rejects `.git`, `.venv`, `dist`, `__MACOSX`, `__pycache__`, and\nbuild\u002Fcache entries.\n\n### Wheel Freshness\n\nBefore publishing or sharing a wheel, rebuild it from the current checkout\nand smoke-test the installed artifact:\n\n```bash\nscripts\u002Fverify_wheel_freshness.sh\n```\n\nThe verifier removes stale wheels, builds a fresh one, installs it into a\ntemporary virtual environment, and exercises `\u002Flivez`, `\u002Fhealth` (with\napi key), and basic endpoint shape using a fake vLLM transport.\n\n## Verification\n\n```bash\npython -m pytest -q\npython -m pip check\npython -m compileall -q src\npython -m build --wheel\n```\n\n### Local Verification (2026-05-17)\n\n- `python -m pytest -q` -> `159 passed`\n- `python -m pip check` → `No broken requirements found.`\n- `python -m compileall -q src` → no errors\n- `python -m build --wheel` → built `gemma4_mtp_vllm-0.1.0-py3-none-any.whl`\n- `scripts\u002Fverify_wheel_freshness.sh` → `wheel smoke ok`\n- `scripts\u002Fmake_source_archive.sh` + `scripts\u002Fverify_source_archive.sh` → archive clean\n\n159 tests cover profiles, server limits, bind policy, errors, runtime state,\nmiddleware, policy validation, request validation, vLLM HTTP client, Anthropic\nadapter, server app foundation, health, metrics, OpenAI endpoints, Anthropic\nendpoints, doctor, benchmarking, launch helper, CLI, bench CLI, versioning, and\nrelease scripts.\n\n## Operational Notes\n\n- The external model aliases are `gemma-4-31b-mtp`, `claude-gemma-4-31b-mtp`,\n  and `default`.\n- Gateway requests are routed upstream using the configured served vLLM model\n  name, so vLLM can be launched with `--served-model-name gemma-4-31b-mtp`.\n- `\u002Fhealth` and `\u002Freadyz` are version-aware and report degraded status when the\n  upstream vLLM version is older than `0.21.0`.\n- The gateway intentionally rejects unsupported tool, multimodal, and advanced\n  stop\u002Fformat fields instead of silently dropping them.\n\n## Author\n\n**Alican Kiraz**\n\n[![LinkedIn](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLinkedIn-0077B5?style=flat&logo=linkedin&logoColor=white)](https:\u002F\u002Flinkedin.com\u002Fin\u002Falican-kiraz)\n[![X](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FX-000000?style=flat&logo=x&logoColor=white)](https:\u002F\u002Fx.com\u002FAlicanKiraz0)\n[![Medium](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMedium-12100E?style=flat&logo=medium&logoColor=white)](https:\u002F\u002Falican-kiraz1.medium.com)\n[![HuggingFace](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-FFD21E?style=flat&logo=huggingface&logoColor=black)](https:\u002F\u002Fhuggingface.co\u002FAlicanKiraz0)\n[![GitHub](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-181717?style=flat&logo=github&logoColor=white)](https:\u002F\u002Fgithub.com\u002Falicankiraz1)\n","该项目是一个面向生产的FastAPI辅助服务，用于通过vLLM平台提供Gemma 4 31B模型，并支持Gemma 4多令牌预测（MTP）推测解码。其核心功能包括OpenAI和Anthropic兼容的HTTP API、API密钥认证、CORS控制、速率限制、有界准入控制、健康\u002F就绪诊断以及Prometheus风格的网关指标。此外，该服务在性能上显著优于基线Gemma 4 31B模型，在不同场景下平均提速约2.12倍。此项目特别适合于本地或私有GPU环境下的高效文本生成任务部署，如聊天机器人、内容创作等应用场景。","2026-06-11 04:03:17","CREATED_QUERY"]