[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11202":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},11202,"mlx-vlm","Blaizzy\u002Fmlx-vlm","Blaizzy","MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.",null,"https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm","Python",4999,577,33,116,0,31,200,304,93,30.29,false,"main",[25,26,27,28,29,30,31,32,33,34,35,36,37],"llava","llm","mlx","vision-transformer","apple-silicon","idefics","local-ai","paligemma","vision-framework","vision-language-model","florence2","molmo","pixtral","2026-06-12 02:02:30","[![Upload Python Package](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Factions\u002Fworkflows\u002Fpython-publish.yml)\n# MLX-VLM\n\nMLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) and Omni Models (VLMs with audio and video support) on your Mac using MLX.\n\n## Table of Contents\n- [Installation](#installation)\n- [Usage](#usage)\n  - [Command Line Interface (CLI)](#command-line-interface-cli)\n    - [Thinking Budget](#thinking-budget)\n  - [Speculative Decoding](#speculative-decoding)\n    - [DFlash (Qwen3.5)](#dflash-qwen35)\n    - [Gemma 4 MTP](#gemma-4-mtp)\n  - [Chat UI with Gradio](#chat-ui-with-gradio)\n  - [Python Script](#python-script)\n  - [Server (FastAPI)](#server-fastapi)\n    - [Continuous Batching](#continuous-batching)\n    - [Automatic Prefix Caching (APC)](#automatic-prefix-caching-apc)\n    - [KV Cache Quantization](#kv-cache-quantization)\n- [Activation Quantization (CUDA)](#activation-quantization-cuda)\n- [Multi-Image Chat Support](#multi-image-chat-support)\n  - [Supported Models](#supported-models)\n  - [Usage Examples](#usage-examples)\n- [Model-Specific Documentation](#model-specific-documentation)\n- [Vision Feature Caching](#vision-feature-caching)\n- [TurboQuant KV Cache](#turboquant-kv-cache)\n- [Distributed Inference](#distributed-inference)\n- [Fine-tuning](#fine-tuning)\n\n## Model-Specific Documentation\n\nSome models have detailed documentation with prompt formats, examples, and best practices:\n\n| Model | Documentation |\n|-------|---------------|\n| DeepSeek-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr\u002FREADME.md) |\n| DeepSeek-OCR-2 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdeepseekocr_2\u002FREADME.md) |\n| DOTS-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| DOTS-MOCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fdots_ocr\u002FREADME.md) |\n| GLM-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fglm_ocr\u002FREADME.md) |\n| Phi-4 Reasoning Vision | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4_siglip\u002FREADME.md) |\n| MiniCPM-o | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fminicpmo\u002FREADME.md) |\n| Phi-4 Multimodal | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fphi4mm\u002FREADME.md) |\n| MolmoPoint | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmolmo_point\u002FREADME.md) |\n| Moondream3 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fmoondream3\u002FREADME.md) |\n| Gemma 4 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgemma4\u002FREADME.md) |\n| Falcon-OCR | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Ffalcon_ocr\u002FREADME.md) |\n| Granite Vision 3.2 | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite_vision\u002FREADME.md) |\n| Granite 4.0 Vision | [Docs](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fmlx_vlm\u002Fmodels\u002Fgranite4_vision\u002FREADME.md) |\n\n## Installation\n\nThe easiest way to get started is to install the `mlx-vlm` package using pip:\n\n```sh\npip install -U mlx-vlm\n```\n\n## Usage\n\n### Command Line Interface (CLI)\n\nGenerate output from a model using the CLI:\n\n```sh\n# Text generation\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Hello, how are you?\"\n\n# Image generation\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\n\n# Audio generation (New)\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you hear\" --audio \u002Fpath\u002Fto\u002Faudio.wav\n\n# Multi-modal generation (Image + Audio)\nmlx_vlm.generate --model mlx-community\u002Fgemma-3n-E2B-it-4bit --max-tokens 100 --prompt \"Describe what you see and hear\" --image \u002Fpath\u002Fto\u002Fimage.jpg --audio \u002Fpath\u002Fto\u002Faudio.wav\n```\n\n#### Thinking Budget\n\nFor thinking models (e.g., Qwen3.5), you can limit the number of tokens spent in the thinking block:\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen3.5-2B-4bit \\\n  --thinking-budget 50 \\\n  --thinking-start-token \"\u003Cthink>\" \\\n  --thinking-end-token \"\u003C\u002Fthink>\" \\\n  --enable-thinking \\\n  --prompt \"Solve 2+2\"\n```\n\n| Flag | Description |\n|------|-------------|\n| `--enable-thinking` | Activate thinking mode in the chat template |\n| `--thinking-budget` | Max tokens allowed inside the thinking block |\n| `--thinking-start-token` | Token that opens a thinking block (default: `\u003Cthink>`) |\n| `--thinking-end-token` | Token that closes a thinking block (default: `\u003C\u002Fthink>`) |\n\nWhen the budget is exceeded, the model is forced to emit `\\n\u003C\u002Fthink>` and transition to the answer. If `--enable-thinking` is passed but the model's chat template does not support it, the budget is applied only if the model generates the start token on its own.\n\nOn the server, thinking mode is disabled by default. Start the server with `--enable-thinking` to make thinking mode the default for requests that do not specify it:\n\n```sh\nmlx_vlm.server --model Qwen\u002FQwen3.5-4B --enable-thinking\n```\n\nRequests can override the server default with `enable_thinking: true` or `enable_thinking: false`.\n\n### Speculative Decoding\n\nSpeed up generation by drafting several candidate tokens with a small \"drafter\" model and verifying them in a single target forward pass. Two drafter families are supported.\n\n| Flag | Description |\n|------|-------------|\n| `--draft-model` | HuggingFace repo or local path for the drafter |\n| `--draft-kind` | Drafter family — `dflash` (default) or `mtp` (Gemma 4) |\n| `--draft-block-size` | Override the drafter's configured block size |\n\nSee [docs\u002Fusage.md](docs\u002Fusage.md) for Python API examples including batch generation.\n\n#### DFlash (Qwen3.5)\n\nA lightweight block-diffusion drafter that predicts multiple tokens per round, typically 2–3× faster.\n\n```sh\n# Text generation with speculative decoding\nmlx_vlm.generate --model Qwen\u002FQwen3.5-4B \\\n  --draft-model z-lab\u002FQwen3.5-4B-DFlash \\\n  --prompt \"Write a quicksort in Python.\" \\\n  --max-tokens 512 --temperature 0 --enable-thinking\n\n# Also works with images\nmlx_vlm.generate --model Qwen\u002FQwen3.5-4B \\\n  --draft-model z-lab\u002FQwen3.5-4B-DFlash \\\n  --image examples\u002Fimages\u002Fcats.jpg \\\n  --prompt \"Describe this image.\" \\\n  --max-tokens 256 --temperature 0 --enable-thinking\n\n# Server with speculative decoding\nmlx_vlm.server --model Qwen\u002FQwen3.5-4B \\\n  --draft-model z-lab\u002FQwen3.5-4B-DFlash\n```\n\n#### Gemma 4 MTP\n\n[Multi-Token Prediction](https:\u002F\u002Fai.google.dev\u002Fgemma\u002Fdocs\u002Fmtp\u002Fmtp): Google's 4-layer \"assistant\" drafter that shares K\u002FV with the target and drafts multiple tokens autoregressively from a constant position. Pass `--draft-kind mtp` to dispatch the MTP round-loop.\n\n```sh\nmlx_vlm.generate --model mlx-community\u002Fgemma-4-31B-it-bf16 \\\n  --draft-model mlx-community\u002Fgemma-4-31B-it-assistant-bf16 \\\n  --draft-kind mtp --draft-block-size 4 \\\n  --prompt \"Explain speculative decoding in 3 sentences.\" \\\n  --max-tokens 256 --temperature 0\n\n# Server\nmlx_vlm.server --model mlx-community\u002Fgemma-4-31B-it-bf16 \\\n  --draft-model mlx-community\u002Fgemma-4-31B-it-assistant-bf16 \\\n  --draft-kind mtp --draft-block-size 4\n```\n\nSupported pairings (target ↔ drafter):\n\n| Target                          | Drafter                                  |\n|---------------------------------|------------------------------------------|\n| `mlx-community\u002Fgemma-4-E2B-it-bf16`         | `mlx-community\u002Fgemma-4-E2B-it-assistant-bf16`        |\n| `mlx-community\u002Fgemma-4-E4B-it-bf16`         | `mlx-community\u002Fgemma-4-E4B-it-assistant-bf16`        |\n| `mlx-community\u002Fgemma-4-26B-A4B-it-bf16`     | `mlx-community\u002Fgemma-4-26B-A4B-it-assistant-bf16`    |\n| `mlx-community\u002Fgemma-4-31B-it-bf16`         | `mlx-community\u002Fgemma-4-31B-it-assistant-bf16`        |\n\nMeasured speedups (greedy, byte-identical output): up to **3.94×** on 26B-A4B and **2.29×** on 31B at B=4. See [`mlx_vlm\u002Fspeculative\u002Fdrafters\u002Fgemma4_assistant\u002FREADME.md`](mlx_vlm\u002Fspeculative\u002Fdrafters\u002Fgemma4_assistant\u002FREADME.md) for full sweeps and architecture notes.\n\n### Chat UI with Gradio\n\nLaunch a chat interface using Gradio:\n\n```sh\nmlx_vlm.chat_ui --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit\n```\n\n### Python Script\n\nHere's an example of how to use MLX-VLM in a Python script:\n\n```python\nimport mlx.core as mx\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load the model\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = load_config(model_path)\n\n# Prepare input\nimage = [\"http:\u002F\u002Fimages.cocodataset.org\u002Fval2017\u002F000000039769.jpg\"]\n# image = [Image.open(\"...\")] can also be used with PIL.Image.Image objects\nprompt = \"Describe this image.\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(image)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, verbose=False)\nprint(output)\n```\n\n#### Audio Example\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load model with audio support\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare audio input\naudio = [\"\u002Fpath\u002Fto\u002Faudio1.wav\", \"\u002Fpath\u002Fto\u002Faudio2.mp3\"]\nprompt = \"Describe what you hear in these audio files.\"\n\n# Apply chat template with audio\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_audios=len(audio)\n)\n\n# Generate output with audio\noutput = generate(model, processor, formatted_prompt, audio=audio, verbose=False)\nprint(output)\n```\n\n#### Multi-Modal Example (Image + Audio)\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\n# Load multi-modal model\nmodel_path = \"mlx-community\u002Fgemma-3n-E2B-it-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\n# Prepare inputs\nimage = [\"\u002Fpath\u002Fto\u002Fimage.jpg\"]\naudio = [\"\u002Fpath\u002Fto\u002Faudio.wav\"]\nprompt = \"\"\n\n# Apply chat template\nformatted_prompt = apply_chat_template(\n    processor, config, prompt,\n    num_images=len(image),\n    num_audios=len(audio)\n)\n\n# Generate output\noutput = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)\nprint(output)\n```\n\n### Server (FastAPI)\n\nStart the server:\n```sh\nmlx_vlm.server --port 8080\n\n# Preload a model at startup (Hugging Face repo or local path)\nmlx_vlm.server --model \u003Chf_repo_or_local_path>\n\n# Preload a model with adapter\nmlx_vlm.server --model \u003Chf_repo_or_local_path> --adapter-path \u003Cadapter_path>\n\n# With trust remote code enabled (required for some models)\nmlx_vlm.server --trust-remote-code\n\n# Enable thinking mode by default for requests that do not override it\nmlx_vlm.server --model Qwen\u002FQwen3.5-4B --enable-thinking\n```\n\n#### Server Options\n\n- `--model`: Preload a model at server startup, accepts a Hugging Face repo ID or local path (optional, loads lazily on first request if omitted)\n- `--adapter-path`: Path for adapter weights to use with the preloaded model\n- `--draft-model`: Speculative drafter path or HF id (e.g. `z-lab\u002FQwen3.5-4B-DFlash`, `google\u002Fgemma-4-31B-it-assistant`) — enables speculative decoding for ~2× or higher throughput\n- `--draft-kind`: Drafter family — `dflash` (default) or `mtp` (Gemma 4)\n- `--draft-block-size`: Override the drafter's configured block size\n- `--host`: Host address (default: `0.0.0.0`)\n- `--port`: Port number (default: `8080`)\n- `--trust-remote-code`: Trust remote code when loading models from Hugging Face Hub\n- `--enable-thinking`: Enable thinking mode by default for requests that do not set `enable_thinking`\n- `--kv-bits`: Number of bits for KV cache quantization (e.g. `8` for uniform, `3.5` for TurboQuant)\n- `--kv-quant-scheme`: KV cache quantization backend (`uniform` or `turboquant`)\n- `--kv-group-size`: Group size for uniform KV cache quantization (default: `64`)\n- `--max-kv-size`: Maximum KV cache size in tokens\n- `--vision-cache-size`: Max number of cached vision features (default: `20`)\n- `--log-level`: Logging level — `DEBUG`, `INFO`, `WARNING`, `ERROR`, `CRITICAL` (default: `INFO`)\n\nYou can also set trust remote code via environment variable:\n```sh\nMLX_TRUST_REMOTE_CODE=true mlx_vlm.server\n```\n\nThe server provides multiple endpoints for different use cases and supports dynamic model loading\u002Funloading with caching (one model at a time).\n\n### Continuous Batching\n\nThe server supports continuous batching for higher throughput when handling multiple concurrent requests. New requests join the active batch immediately without waiting for existing requests to finish, and mixed batches of image and text-only requests are supported.\n\nContinuous batching is enabled automatically when the server loads a model. You can pre-load a model at startup so it's ready to serve immediately:\n\n```sh\nmlx_vlm.server --port 8080 --model mlx-community\u002FQwen2.5-VL-3B-Instruct-4bit\n```\n\nVerify via the health endpoint:\n\n```sh\ncurl http:\u002F\u002Flocalhost:8080\u002Fhealth\n# {\"status\":\"healthy\",\"loaded_model\":\"...\",\"apc_enabled\":false}\n```\n\nIf `--model` is omitted, the model is loaded on the first request.\n\n### Automatic Prefix Caching (APC)\n\nAutomatic Prefix Caching reuses block-level K\u002FV cache state across requests that share the same prefix. It is useful for repeated long documents, long chat histories, or retrieval contexts where each request appends a short new suffix.\n\nAPC has two tiers:\n\n- **Warm memory**: keeps reusable `APCBlock` tensors in process memory. This is the fastest path, but it keeps both the reusable block pool and the runtime `KVCache`.\n- **Warm disk**: persists cached prefixes as safetensors shards so they survive process restarts. Warm-disk reads build the layer-major prompt cache directly without promoting restored blocks into the `APCBlock` pool; writes can still populate both memory and disk tiers.\n\n#### Python Script\n\nUse `APCManager` directly when calling `stream_generate`:\n\n```python\nfrom pathlib import Path\n\nfrom mlx_vlm import load, stream_generate\nfrom mlx_vlm.apc import APCManager, DiskBlockStore\nfrom mlx_vlm.prompt_utils import apply_chat_template\n\nmodel_id = \"Qwen\u002FQwen3-VL-4B-Instruct\"\nmodel, processor = load(model_id)\n\ndisk = DiskBlockStore(\n    Path(\"~\u002F.cache\u002Fmlx-vlm\u002Fcaching\").expanduser(),\n    namespace=model_id,\n    max_bytes=3 * (1 \u003C\u003C 30),  # 3 GB disk cap; use None for uncapped\n)\napc = APCManager(num_blocks=4096, block_size=16, disk=disk)\n\ndocument = Path(\"long_document.txt\").read_text()\n\ntry:\n    # First request computes the full prefix and stores reusable K\u002FV blocks.\n    prompt1 = apply_chat_template(\n        processor,\n        model.config,\n        prompt=f\"{document}\\n\\nSummarize the key decisions.\",\n        num_images=0,\n    )\n    for _ in stream_generate(\n        model, processor, prompt1, max_tokens=128, temperature=0.0, apc_manager=apc\n    ):\n        pass\n\n    # Second request shares the same document prefix and only prefills the suffix.\n    prompt2 = apply_chat_template(\n        processor,\n        model.config,\n        prompt=f\"{document}\\n\\nList the open engineering risks.\",\n        num_images=0,\n    )\n    for chunk in stream_generate(\n        model, processor, prompt2, max_tokens=128, temperature=0.0, apc_manager=apc\n    ):\n        print(chunk.text, end=\"\", flush=True)\n\n    print(apc.stats_snapshot())\nfinally:\n    apc.close()\n```\n\nTo compare cold, warm-memory, warm-disk, and disk-eviction behavior with a\nmodel, use the same direct API path:\n\n```python\nimport os\nimport tempfile\nimport time\nfrom pathlib import Path\n\nfrom mlx_vlm import load, stream_generate\nfrom mlx_vlm.apc import APCManager, DiskBlockStore\nfrom mlx_vlm.prompt_utils import apply_chat_template\n\nmodel_id = \"Qwen\u002FQwen3-VL-4B-Instruct\"\ncontexts = [8000, 20000, 50000, 100000]\ndisk_cap_gb = 0  # 0 means uncapped\nshard_max_blocks = 256\ncontext_sweep_max_tokens = 1  # one token is enough to measure prefill reuse\n\ntest_prompt_tokens = 8000\nfill_prompts = 80\neviction_disk_cap_gb = 3.0\n\nos.environ[\"APC_DISK_SHARD_MAX_BLOCKS\"] = str(shard_max_blocks)\n\nmodel, processor = load(model_id)\ntokenizer = processor.tokenizer if hasattr(processor, \"tokenizer\") else processor\n\n\ndef disk_cap_bytes(gb: float):\n    return None if gb \u003C= 0 else int(gb * (1 \u003C\u003C 30))\n\n\ndef make_context(target_tokens: int, seed: int = 0) -> str:\n    line = (\n        f\"Document {seed}: APC benchmark content with deterministic facts, \"\n        \"dates, identifiers, and repeated technical notes.\\n\"\n    )\n    line_tokens = max(1, len(tokenizer.encode(line, add_special_tokens=False)))\n    text = line * max(1, target_tokens \u002F\u002F line_tokens)\n    while len(tokenizer.encode(text, add_special_tokens=False)) \u003C target_tokens:\n        text += line\n    return text\n\n\ndef make_prompt(context: str, question: str) -> str:\n    return apply_chat_template(\n        processor,\n        model.config,\n        prompt=f\"{context}\\n\\n{question}\",\n        num_images=0,\n    )\n\n\ndef run_once(apc: APCManager, context: str, question: str, max_tokens: int = 32):\n    prompt = make_prompt(context, question)\n    apc.reset_stats()\n\n    last = None\n    output = []\n    start = time.perf_counter()\n    for chunk in stream_generate(\n        model,\n        processor,\n        prompt,\n        max_tokens=max_tokens,\n        temperature=0.0,\n        apc_manager=apc,\n    ):\n        output.append(chunk.text)\n        last = chunk\n\n    if last is None:\n        raise RuntimeError(\"generation returned no chunks\")\n\n    return {\n        \"wall_s\": time.perf_counter() - start,\n        \"prompt_tokens\": last.prompt_tokens,\n        \"prompt_tps\": last.prompt_tps,\n        \"generation_tps\": last.generation_tps,\n        \"apc\": apc.stats_snapshot(),\n        \"text\": \"\".join(output).strip(),\n    }\n\n\ndef print_result(label: str, result: dict) -> None:\n    stats = result[\"apc\"]\n    print(\n        f\"{label:\u003C12} \"\n        f\"prompt_tokens={result['prompt_tokens']:>7} \"\n        f\"prompt_tps={result['prompt_tps']:>8.1f} \"\n        f\"gen_tps={result['generation_tps']:>7.1f} \"\n        f\"matched={stats.get('matched_tokens', 0):>7} \"\n        f\"disk_hits={stats.get('disk_hits', 0):>5} \"\n        f\"disk_evictions={stats.get('disk_evictions', 0):>5}\"\n    )\n\n\ndef open_apc(cache_root: Path, namespace: str, disk_gb: float) -> APCManager:\n    disk = DiskBlockStore(\n        cache_root,\n        namespace=namespace,\n        max_bytes=disk_cap_bytes(disk_gb),\n    )\n    return APCManager(num_blocks=4096, block_size=16, disk=disk)\n\n\ndef run_context_sweep() -> None:\n    print(\"cold \u002F warm-memory \u002F warm-disk\")\n    with tempfile.TemporaryDirectory() as tmp:\n        cache_root = Path(tmp)\n        for target_tokens in contexts:\n            context = make_context(target_tokens)\n            namespace = f\"{model_id}-context-{target_tokens}\"\n            apc = open_apc(cache_root, namespace, disk_cap_gb)\n            try:\n                print(f\"\\ncontext ~= {target_tokens} text tokens\")\n                print_result(\n                    \"cold\",\n                    run_once(\n                        apc,\n                        context,\n                        \"Summarize the key decisions.\",\n                        max_tokens=context_sweep_max_tokens,\n                    ),\n                )\n                print_result(\n                    \"warm-memory\",\n                    run_once(\n                        apc,\n                        context,\n                        \"List the open engineering risks.\",\n                        max_tokens=context_sweep_max_tokens,\n                    ),\n                )\n            finally:\n                # Closing waits for queued disk writes before reopening the disk tier.\n                apc.close()\n\n            apc = open_apc(cache_root, namespace, disk_cap_gb)\n            try:\n                print_result(\n                    \"warm-disk\",\n                    run_once(\n                        apc,\n                        context,\n                        \"Extract the implementation timeline.\",\n                        max_tokens=context_sweep_max_tokens,\n                    ),\n                )\n            finally:\n                apc.close()\n\n\ndef run_disk_eviction_workload() -> None:\n    print(\"\\ndisk eviction workload\")\n    with tempfile.TemporaryDirectory() as tmp:\n        cache_root = Path(tmp)\n        namespace = f\"{model_id}-eviction\"\n        test_context = make_context(test_prompt_tokens, seed=0)\n\n        apc = open_apc(cache_root, namespace, eviction_disk_cap_gb)\n        try:\n            print_result(\n                \"seed\",\n                run_once(apc, test_context, \"Summarize the retained test prefix.\"),\n            )\n        finally:\n            apc.close()\n\n        apc = open_apc(cache_root, namespace, eviction_disk_cap_gb)\n        try:\n            for i in range(fill_prompts):\n                fill_context = make_context(test_prompt_tokens, seed=i + 1)\n                run_once(\n                    apc,\n                    fill_context,\n                    f\"Summarize filler document {i + 1}.\",\n                    max_tokens=1,\n                )\n                if (i + 1) % 10 == 0:\n                    stats = apc.stats_snapshot()\n                    print(\n                        f\"filled={i + 1:>3} \"\n                        f\"disk_gb={stats.get('disk_bytes', 0) \u002F (1 \u003C\u003C 30):.2f} \"\n                        f\"disk_evictions={stats.get('disk_evictions', 0)}\"\n                    )\n        finally:\n            apc.close()\n\n        apc = open_apc(cache_root, namespace, eviction_disk_cap_gb)\n        try:\n            print_result(\n                \"post-fill\",\n                run_once(\n                    apc,\n                    test_context,\n                    \"Check whether the retained test prefix still restores.\",\n                ),\n            )\n        finally:\n            apc.close()\n\n\nrun_context_sweep()\nrun_disk_eviction_workload()\n```\n\n#### Server\n\nEnable in-memory APC for the server with environment variables:\n\n```sh\nAPC_ENABLED=1 \\\nAPC_NUM_BLOCKS=4096 \\\nmlx_vlm.server --model Qwen\u002FQwen3-VL-4B-Instruct --port 8080\n```\n\nEnable the persistent disk tier:\n\n```sh\nAPC_ENABLED=1 \\\nAPC_NUM_BLOCKS=4096 \\\nAPC_DISK_PATH=~\u002F.cache\u002Fmlx-vlm\u002Fcaching \\\nAPC_DISK_MAX_GB=3 \\\nAPC_DISK_SHARD_MAX_BLOCKS=256 \\\nmlx_vlm.server --model Qwen\u002FQwen3-VL-4B-Instruct --port 8080\n```\n\nRepeated requests with the same long prefix will hit APC automatically:\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fv1\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -H \"X-APC-Tenant: demo\" \\\n  -d '{\n    \"model\": \"Qwen\u002FQwen3-VL-4B-Instruct\",\n    \"messages\": [{\n      \"role\": \"user\",\n      \"content\": \"Paste a long shared document here.\\n\\nNow answer question A.\"\n    }],\n    \"max_tokens\": 128\n  }'\n```\n\nUse the same `X-APC-Tenant` value for requests that may share cached prefixes. Use different tenant values to isolate cache entries between users or workspaces.\n\nInspect and reset APC state:\n\n```sh\ncurl http:\u002F\u002Flocalhost:8080\u002Fv1\u002Fcache\u002Fstats\ncurl -X POST http:\u002F\u002Flocalhost:8080\u002Fv1\u002Fcache\u002Freset\n```\n\nCommon APC environment variables:\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `APC_ENABLED` | `0` | Set to `1` to enable APC |\n| `APC_NUM_BLOCKS` | `2048` | Number of in-memory APC blocks |\n| `APC_BLOCK_SIZE` | `16` | Tokens per APC block |\n| `APC_DISK_PATH` | unset | Directory for persistent disk shards |\n| `APC_DISK_MAX_GB` | `0` | Disk cap in GB; `0` means uncapped |\n| `APC_DISK_SHARD_MAX_BLOCKS` | `256` | Max blocks per disk segment shard |\n| `APC_MAX_POOL_TENSORS` | `450000` | Stops adding memory blocks before the Metal resource limit; disk writes continue |\n| `APC_LAYER_MAJOR_MEMORY_MIN_TOKENS` | `50000` | Store long warm-memory prefixes as compact layer-major snapshots instead of per-block tensors |\n| `APC_HASH` | `fast` | Set to `sha256` for a stable cryptographic hash |\n\nAPC is disabled automatically for models that use a custom cache layout. On the server, APC is also skipped when KV-cache quantization is enabled.\n\n#### KV Cache Quantization\n\nReduce KV cache memory during continuous batching with `--kv-bits`. Both uniform quantization and TurboQuant are supported:\n\n```sh\n# Uniform 8-bit KV cache quantization\nmlx_vlm.server --model google\u002Fgemma-4-26b-a4b-it --kv-bits 8\n\n# TurboQuant 3.5-bit (3-bit keys + 4-bit values)\nmlx_vlm.server --model google\u002Fgemma-4-26b-a4b-it --kv-bits 3.5 --kv-quant-scheme turboquant\n```\n\nFull-attention layers use quantized batch caches while sliding-window layers keep their fixed-size rotating caches. The last full-attention layer stays unquantized (sensitive in deep models).\n\nTested with gemma-4-26b-a4b-it at 20K context:\n\n| Config | Gen tok\u002Fs | KV Cache | KV Reduction |\n|--------|-----------|----------|--------------|\n| No quant | 50.3 | 0.624 GB | 1x |\n| Uniform 8-bit | 52.6 | 0.469 GB | **1.33x** |\n| TurboQuant 3.5-bit | 25.6 | 0.365 GB | **1.71x** |\n\n> Models with all full-attention layers (e.g. Qwen, LLaMA) see larger reductions — up to 3.6x at 8-bit and 6.4x at 4-bit.\n\n#### Log Probabilities\n\nThe `\u002Fchat\u002Fcompletions` endpoint supports OpenAI-compatible per-token log probabilities. Pass `logprobs: true` (and optionally `top_logprobs: N`, up to 20) in the request:\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fv1\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [{\"role\":\"user\",\"content\":\"Say hi in 3 words.\"}],\n    \"max_tokens\": 8,\n    \"logprobs\": true,\n    \"top_logprobs\": 3\n  }'\n```\n\nEach choice gets a `logprobs.content[]` list with one entry per generated token: `{token, logprob, bytes, top_logprobs: [{token, logprob, bytes}, ...]}`. Works for both streaming and non-streaming.\n\n`top_logprobs` requires the server to be started with a non-zero cap on how many alternatives it will compute per token (default `0` = disabled, max `20`). Set it via the `--top-logprobs-k` flag or the `TOP_LOGPROBS_K` env var:\n\n```sh\nmlx_vlm.server --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --top-logprobs-k 5\n# or\nTOP_LOGPROBS_K=5 mlx_vlm.server --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit\n```\n\nPer-request `top_logprobs` is clamped to `TOP_LOGPROBS_K`. When `TOP_LOGPROBS_K=0`, requests with `logprobs: true` still return chosen-token logprobs; only the `top_logprobs` list stays empty. Leaving the cap at `0` keeps the vocab-wide sort out of the decode graph, so deployments that don't need logprobs pay zero overhead.\n\n#### Structured Outputs\n\nThe `\u002Fv1\u002Fchat\u002Fcompletions` and `\u002Fv1\u002Fresponses` endpoints support OpenAI-compatible `json_schema` structured outputs. The server constrains generation to the supplied JSON schema and supports both streaming and non-streaming responses.\n\nYou can define the schema with Pydantic:\n\n```python\nfrom typing import Literal\n\nfrom pydantic import BaseModel, ConfigDict, Field\n\n\nclass AnimalResult(BaseModel):\n    model_config = ConfigDict(extra=\"forbid\")\n\n    animal: Literal[\"dog\", \"cat\", \"bird\", \"unknown\"]\n    species: str = Field(max_length=60)\n    description: str = Field(max_length=200)\n\n\nschema = AnimalResult.model_json_schema()\n```\n\nCall the local server with the OpenAI Python client:\n\n```python\nfrom openai import OpenAI\n\nclient = OpenAI(base_url=\"http:\u002F\u002Flocalhost:8080\u002Fv1\", api_key=\"not-needed\")\n\nresponse = client.chat.completions.create(\n    model=\"mlx-community\u002FQwen3.5-4B-MLX-4bit\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Return a dog object.\"},\n    ],\n    response_format={\n        \"type\": \"json_schema\",\n        \"json_schema\": {\n            \"name\": \"AnimalResult\",\n            \"strict\": True,\n            \"schema\": schema,\n        },\n    },\n)\n\nresult = AnimalResult.model_validate_json(response.choices[0].message.content)\nprint(result)\n```\n\nExample output:\n\n```text\nanimal='dog' species='Canis lupus familiaris' description='A domesticated canine known for companionship and loyalty.'\n```\n\nChat completions use top-level `response_format`. The same format works for text-only and multimodal requests:\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fv1\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen3.5-4B-MLX-4bit\",\n    \"messages\": [{\n      \"role\": \"user\",\n      \"content\": [\n        {\"type\": \"text\", \"text\": \"Identify the main animal in this image.\"},\n        {\"type\": \"image_url\", \"image_url\": {\"url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"}}\n      ]\n    }],\n    \"response_format\": {\n      \"type\": \"json_schema\",\n      \"json_schema\": {\n        \"name\": \"AnimalResult\",\n        \"strict\": true,\n        \"schema\": {\n          \"type\": \"object\",\n          \"properties\": {\n            \"animal\": {\"type\": \"string\", \"enum\": [\"dog\", \"cat\", \"bird\", \"unknown\"]},\n            \"species\": {\"type\": \"string\", \"maxLength\": 60},\n            \"description\": {\"type\": \"string\", \"maxLength\": 200}\n          },\n          \"required\": [\"animal\", \"species\", \"description\"],\n          \"additionalProperties\": false\n        }\n      }\n    },\n    \"max_tokens\": 256\n  }'\n```\n\nStructured outputs are also supported with:\n\n- Streaming chat completions by setting `\"stream\": true`\n- The responses API via `text.format` on `\u002Fv1\u002Fresponses`\n- Text-only requests using the same `response_format` shape\n\nStructured outputs are not currently supported with speculative decoding.\n\n#### How It Works\n\n- A dedicated generation thread runs a `BatchGenerator` that processes multiple requests in parallel\n- Image requests are prefilled individually with their own vision embeddings, then join the shared decoding batch\n- Text-only requests are batched together for efficient prefill\n- After prefill, all requests decode together in a single batch, sharing GPU compute\n\n#### Available Endpoints\n\n- `\u002Fmodels` and `\u002Fv1\u002Fmodels` - List models available locally\n- `\u002Fchat\u002Fcompletions` and `\u002Fv1\u002Fchat\u002Fcompletions` - OpenAI-compatible chat-style interaction endpoint with support for images, audio, and text\n- `\u002Fresponses` and `\u002Fv1\u002Fresponses` - OpenAI-compatible responses endpoint\n- `\u002Fhealth` - Check server status\n- `\u002Fmetrics` and `\u002Fv1\u002Fmetrics` - Inspect rolling request metrics, throughput, and runtime counters\n- `\u002Funload` - Unload current model from memory\n\n#### Usage Examples\n\n##### List available models\n\n```sh\ncurl \"http:\u002F\u002Flocalhost:8080\u002Fmodels\"\n```\n\n##### Text Input\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": \"Hello, how are you\"\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 100\n  }'\n```\n\n##### Image Input\n\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fchat\u002Fcompletions\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2.5-VL-32B-Instruct-8bit\",\n    \"messages\":\n    [\n      {\n        \"role\": \"system\",\n        \"content\": \"You are a helpful assistant.\"\n      },\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\n            \"type\": \"text\",\n            \"text\": \"This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?\"\n          },\n          {\n            \"type\": \"input_image\",\n            \"image_url\": \"\u002Fpath\u002Fto\u002Frepo\u002Fexamples\u002Fimages\u002Frenewables_california.png\"\n          }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 1000\n  }'\n```\n\n##### Audio Support (New)\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          { \"type\": \"text\", \"text\": \"Describe what you hear in these audio files\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio1.wav\" },\n          { \"type\": \"input_audio\", \"input_audio\": \"https:\u002F\u002Fexample.com\u002Faudio2.mp3\" }\n        ]\n      }\n    ],\n    \"stream\": true,\n    \"max_tokens\": 500\n  }'\n```\n\n##### Multi-Modal (Image + Audio)\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fgenerate\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002Fgemma-3n-E2B-it-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"},\n          {\"type\": \"input_audio\", \"input_audio\": \"\u002Fpath\u002Fto\u002Faudio.wav\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n##### Responses Endpoint\n```sh\ncurl -X POST \"http:\u002F\u002Flocalhost:8080\u002Fresponses\" \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"input_text\", \"text\": \"What is in this image?\"},\n          {\"type\": \"input_image\", \"image_url\": \"\u002Fpath\u002Fto\u002Fimage.jpg\"}\n        ]\n      }\n    ],\n    \"max_tokens\": 100\n  }'\n```\n\n#### Request Parameters\n\n- `model`: Model identifier (required)\n- `messages`: Chat messages for chat\u002FOpenAI endpoints\n- `max_tokens`: Maximum tokens to generate\n- `temperature`: Sampling temperature\n- `top_p`: Top-p sampling parameter\n- `top_k`: Top-k sampling cutoff\n- `min_p`: Min-p sampling threshold\n- `repetition_penalty`: Penalty applied to repeated tokens\n- `enable_thinking`: Override the server thinking-mode default for a request (`true` or `false`)\n- `thinking_budget`: Maximum tokens allowed inside the thinking block\n- `thinking_start_token`: Token that opens a thinking block\n- `stream`: Enable streaming responses\n\n\n## Activation Quantization (CUDA)\n\nWhen running on NVIDIA GPUs with MLX CUDA, models quantized with `mxfp8` or `nvfp4` modes require activation quantization to work properly. This converts `QuantizedLinear` layers to `QQLinear` layers which quantize both weights and activations.\n\n### Command Line\n\nUse the `-qa` or `--quantize-activations` flag:\n\n```sh\nmlx_vlm.generate --model \u002Fpath\u002Fto\u002Fmxfp8-model --prompt \"Describe this image\" --image \u002Fpath\u002Fto\u002Fimage.jpg -qa\n```\n\n### Python API\n\nPass `quantize_activations=True` to the `load` function:\n\n```python\nfrom mlx_vlm import load, generate\n\n# Load with activation quantization enabled\nmodel, processor = load(\n    \"path\u002Fto\u002Fmxfp8-quantized-model\",\n    quantize_activations=True\n)\n\n# Generate as usual\noutput = generate(model, processor, \"Describe this image\", image=[\"image.jpg\"])\n```\n\n### Supported Quantization Modes\n\n- `mxfp8` - 8-bit MX floating point\n- `nvfp4` - 4-bit NVIDIA floating point\n\n> **Note**: This feature is required for mxfp\u002Fnvfp quantized models on CUDA. On Apple Silicon (Metal), these models work without the flag.\n\n## Multi-Image Chat Support\n\nMLX-VLM supports analyzing multiple images simultaneously with select models. This feature enables more complex visual reasoning tasks and comprehensive analysis across multiple images in a single conversation.\n\n\n### Usage Examples\n\n#### Python Script\n\n```python\nfrom mlx_vlm import load, generate\nfrom mlx_vlm.prompt_utils import apply_chat_template\nfrom mlx_vlm.utils import load_config\n\nmodel_path = \"mlx-community\u002FQwen2-VL-2B-Instruct-4bit\"\nmodel, processor = load(model_path)\nconfig = model.config\n\nimages = [\"path\u002Fto\u002Fimage1.jpg\", \"path\u002Fto\u002Fimage2.jpg\"]\nprompt = \"Compare these two images.\"\n\nformatted_prompt = apply_chat_template(\n    processor, config, prompt, num_images=len(images)\n)\n\noutput = generate(model, processor, formatted_prompt, images, verbose=False)\nprint(output)\n```\n\n#### Command Line\n\n```sh\nmlx_vlm.generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Compare these images\" --image path\u002Fto\u002Fimage1.jpg path\u002Fto\u002Fimage2.jpg\n```\n\n## Video Understanding\n\nMLX-VLM also supports video analysis such as captioning, summarization, and more, with select models.\n\n### Supported Models\n\nThe following models support video chat:\n\n1. Qwen2-VL\n2. Qwen2.5-VL\n3. Idefics3\n4. LLaVA\n\nWith more coming soon.\n\n### Usage Examples\n\n#### Command Line\n```sh\nmlx_vlm.video_generate --model mlx-community\u002FQwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt \"Describe this video\" --video path\u002Fto\u002Fvideo.mp4 --max-pixels 224 224 --fps 1.0\n```\n\n\nThese examples demonstrate how to use multiple images with MLX-VLM for more complex visual reasoning tasks.\n\n## Vision Feature Caching\n\nIn multi-turn conversations about an image, the vision encoder runs on every turn even though the image hasn't changed. `VisionFeatureCache` stores projected vision features in an LRU cache keyed by image path, so the expensive vision encoder is only called once per unique image.\n\n### How It Works\n\n1. **First turn (cache miss)** -- `encode_image()` runs the full vision pipeline (vision tower + projector), stores the result in the cache, and passes it to the language model.\n2. **Subsequent turns (cache hit)** -- the cached features are passed directly via `cached_image_features`, skipping the vision encoder entirely.\n3. **Image switch** -- when the image changes, it's a new cache key so features are computed and cached. Switching back to a previous image is a cache hit.\n\nThe cache holds up to 8 entries (configurable) and uses LRU eviction.\n\n### CLI\n\nAll chat interfaces use `VisionFeatureCache` automatically:\n\n```sh\n# Gradio chat UI\npython -m mlx_vlm.chat_ui --model google\u002Fgemma-4-26b-a4b-it\n\n# Interactive chat with Rich UI (load images with \u002Fimage command)\npython -m mlx_vlm.chat --model google\u002Fgemma-4-26b-a4b-it\n\n# Inline chat mode\npython -m mlx_vlm.generate \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --image path\u002Fto\u002Fimage.jpg \\\n  --chat \\\n  --max-tokens 200\n```\n\n### Python\n\n```python\nfrom mlx_vlm import load, stream_generate, VisionFeatureCache\nfrom mlx_vlm.prompt_utils import apply_chat_template\n\nmodel, processor = load(\"google\u002Fgemma-4-26b-a4b-it\")\ncache = VisionFeatureCache()\n\nimage = \"path\u002Fto\u002Fimage.jpg\"\n\n# Turn 1 -- cache miss, encodes image\nprompt1 = apply_chat_template(processor, model.config, \"Describe this image.\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt1, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n\n# Turn 2 -- cache hit, skips vision encoder\nprompt2 = apply_chat_template(processor, model.config, \"What colors do you see?\", num_images=1)\nfor chunk in stream_generate(model, processor, prompt2, image=[image],\n                              max_tokens=200, vision_cache=cache):\n    print(chunk.text, end=\"\")\n```\n\n### Server\n\nThe server caches vision features automatically across requests for the same image. No configuration needed -- the cache is created when a model loads and cleared on unload.\n\n```sh\nmlx_vlm.server --model google\u002Fgemma-4-26b-a4b-it\n```\n\nMulti-turn conversations via `\u002Fv1\u002Fchat\u002Fcompletions` (streaming and non-streaming) and `\u002Fresponses` all benefit. The same image sent across multiple requests will only be encoded once.\n\n### Performance\n\nTested on `google\u002Fgemma-4-26b-a4b-it` over 10 multi-turn conversation turns:\n\n| Metric | Without Cache | With Cache |\n|--------|--------------|------------|\n| Prompt TPS | ~48 | ~550-825 |\n| Speedup | -- | **11x+** |\n| Peak Memory | 52.66 GB | 52.66 GB (flat) |\n\nGeneration speed (~31 tok\u002Fs) and memory are unaffected -- only prompt processing gets faster.\n\n## TurboQuant KV Cache\n\nTurboQuant compresses the KV cache during generation, enabling longer context lengths with less memory while maintaining quality.\n\n### Quick Start\n\n```sh\n# 3.5-bit KV cache quantization (3-bit keys + 4-bit values)\nmlx_vlm generate \\\n  --model mlx-community\u002FQwen3.5-4B-4bit \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant \\\n  --prompt \"Your long prompt here...\"\n```\n\n```python\nfrom mlx_vlm import generate\n\nresult = generate(\n    model, processor, prompt,\n    kv_bits=3.5,\n    kv_quant_scheme=\"turboquant\",\n    max_tokens=256,\n)\n```\n\n```sh\n# Server with TurboQuant\nmlx_vlm server \\\n  --model google\u002Fgemma-4-26b-a4b-it \\\n  --kv-bits 3.5 \\\n  --kv-quant-scheme turboquant\n```\n\n### How It Works\n\nTurboQuant uses random rotation + codebook quantization ([arXiv:2504.19874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19874)) to compress KV cache entries from 16-bit to 2-4 bits per dimension:\n\n- **Keys & Values**: MSE codebook quantization with Hadamard rotation\n- **Fractional bits** (e.g. 3.5): uses lower bits for keys, higher for values (3-bit K + 4-bit V)\n\nCustom Metal kernels fuse score computation and value aggregation directly on packed quantized data, avoiding full dequantization during decode.\n\n### Performance\n\nTested on Qwen3.5-4B-4bit at 128k context:\n\n| Metric | Baseline | TurboQuant 3.5-bit |\n|--------|----------|-------------------|\n| KV Memory | 4.1 GB | 0.97 GB (**76% reduction**) |\n| Peak Memory | 18.3 GB | 17.3 GB (**-1.0 GB**) |\n\nAt 512k+ contexts, TurboQuant's per-layer attention is **faster than FP16 SDPA** due to reduced memory bandwidth requirements.\n\nTested on gemma-4-31b-it at 128k context:\n\n| Metric | Baseline | TurboQuant 3.5-bit |\n|--------|----------|-------------------|\n| KV Memory | 13.3 GB | 4.9 GB (**63% reduction**) |\n| Peak Memory | 75.2 GB | 65.8 GB (**-9.4 GB**) |\n\n### Supported Bit Widths\n\n| Bits | Compression | Best For |\n|------|------------|----------|\n| 2 | ~8x | Maximum compression, some quality loss |\n| 3 | ~5x | Good balance of quality and compression |\n| 3.5 | ~4.5x | Recommended default (3-bit keys + 4-bit values) |\n| 4 | ~4x | Best quality, moderate compression |\n\n### Compatibility\n\nTurboQuant automatically quantizes `KVCache` layers (global attention). Models with `RotatingKVCache` (sliding window) or `ArraysCache` (MLA\u002Fabsorbed keys) keep their native cache format for those layers since they are already memory-efficient.\n\nTurboQuant is supported in both single-request generation and continuous batching on the server. In continuous batching mode, KV states are stored in TurboQuant's compressed format and dequantized at attention time (custom Metal kernels are not yet batch-aware).\n\n## Distributed Inference\n\nmlx-vlm supports distributed inference across multiple computers. It works by sharding the language model (not the vision tower), because the LLM is much larger and vision embeddings only need to be computed once.\n\nThe parallel implementation is compatible with [mlx-lm](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx-lm) sharding primitives.\n\nSee [docs\u002Fusage.md](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-vlm\u002Fblob\u002Fmain\u002Fdocs\u002Fusage.md#distributed-inference) for command-line examples.\n\n# Fine-tuning\n\nMLX-VLM supports fine-tuning models with LoRA and QLoRA.\n\n## LoRA & QLoRA\n\nTo learn more about LoRA, please refer to the [LoRA.md](.\u002Fmlx_vlm\u002FLORA.MD) file.\n","MLX-VLM 是一个用于在 Mac 上通过 MLX 进行视觉语言模型（VLMs）和全模态模型（支持音频和视频的 VLMs）推理与微调的软件包。该项目利用 Python 编程语言，支持多种先进的功能和技术特性，包括但不限于：命令行界面（CLI）、推测解码、基于 Gradio 的聊天用户界面、Python 脚本集成、FastAPI 服务器部署选项等。此外，它还提供了针对特定模型的详细文档，涵盖从 OCR 到多模态理解等多种应用场景的最佳实践。该工具特别适合需要在本地环境中高效处理多媒体内容分析与生成任务的研究人员和开发者使用。",2,"2026-06-11 03:31:21","trending"]