[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2356":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":15,"starSnapshotCount":15,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},2356,"cider","Mininglamp-AI\u002Fcider","Mininglamp-AI","W8A8\u002FW4A8 inference + optimized SDPA on Apple Silicon — unlocking unused INT8 TensorOps in M5 for 1.2–1.9× faster LLM prefill, plus FlashInfer-inspired GQA decode attention for up to 1.6× SDPA speedup, built as MLX custom primitives.","",null,"Python",428,26,14,0,104,111,143,312,94.29,"MIT License",false,"main",true,[26,27,28,29,30,31],"apple-silicon","metal","mlx","quantization","w4a8","w8a8","2026-06-12 04:00:14","# cider\n\nCider is developed on top of MLX for macOS. It provides online activation quantization operators absent in MLX, with custom int-matmul kernels built as MLX custom primitives supporting full lazy evaluation. It also includes service-side extensions and non-intrusive compatibility patches for mlx_vlm (validated on mlx_vlm 0.4.3), including fixes for Qwen3-VL multi-image inference issues related to RoPE position handling and chunked prefill. \n\n## Conditional Compilation (M4 \u002F M5)\n\nCider uses **conditional compilation**: the INT8 TensorOps C++ extension is only built on Apple M5+.\n\n| Chip | `pip install -e .` behavior | `import cider` behavior |\n|------|----------------------------|------------------------|\n| **M5+** | Full build (CMake + Metal kernels) | All features available |\n| **M4 and below** | Skips C++ build, installs pure-Python package | `is_available()` → False, `convert_model()` is a warning no-op |\n\n**Override via environment variable:**\n```bash\nCIDER_FORCE_BUILD=1 pip install -e .   # Force build (e.g., CI)\nCIDER_FORCE_BUILD=0 pip install -e .   # Force skip\n```\n\n## Modes\n\n| Mode | Weights | Activations | Compute Path | Status |\n|------|---------|-------------|--------------|--------|\n| **W8A8** | INT8 symmetric | INT8 per-token | TensorOps matmul2d | ✅ Implemented |\n| **W4A8** | INT4 packed (uint8) | INT8 per-token | Unpack→TensorOps | ✅ Implemented |\n| W4A16 | — | — | MLX built-in | Baseline |\n| W8A16 | — | — | MLX built-in | Baseline |\n\n**W4A16 and W8A16 are already supported by MLX natively** — this SDK provides the missing W8A8 and W4A8 modes that MLX does not implement.\n\nMLX's quantization is **weight-only**: QuantizedLinear dequantizes weights to FP16 and uses FP16 GEMM. While MLX's Steel NAX templates are generic enough to be instantiated with INT8 types (and would achieve identical raw matmul throughput — [see our transparent benchmark](benchmarks\u002Fmlx_native\u002Fcider_vs_mlx_int8.md)), MLX does not provide the quantization\u002Fdequantization pipeline needed for actual W8A8 inference. Cider fills this gap with fused quantize-matmul-dequant primitives.\n\nThis SDK implements online INT8 activation quantization and INT8 TensorOps-based compute for the supported inference paths. \n\n### W8A8 Quantization Granularity\n\n| Granularity | Description | Speed | Precision |\n|-------------|-------------|-------|-----------|\n| **Per-channel** | One scale per output channel | Fastest (1.8x prefill) | Slightly lower |\n| **Per-group (gs=128)** | One scale per 128 elements | Fast (1.5x prefill) | Moderate precision retention |\n| **Per-group (gs=64)** | One scale per 64 elements | Moderate (1.3x prefill) | Higher precision |\n\n## Performance (Apple M5 Pro)\n\n### Individual Operator Latency \n\nShape [N=10240, K=2560]\n| M |   PC(ms) |  PG(ms)  |  w8a16  |  w4a16 |   PC\u002Fw8  | PC\u002Fw4  | PG\u002Fw8  | PG\u002Fw4|\n|-----|------|------|-----|-----|-----|------|-----|----|\n| 1 |   0.27ms |   0.26ms |  0.26ms |  0.18ms |  0.96x | 0.67x | 0.99x | 0.69x |\n|128 |   0.34ms  | 0.39ms |  0.49ms |  0.44ms |  1.43x | 1.28x  | 1.26x |  1.13x |\n|1024 |   1.23ms |  1.52ms  | 2.24ms  | 2.04ms |  1.82x  | 1.66x | 1.47x | 1.34x|\n|4096 |   4.41ms |  5.65ms |  8.12ms |  7.72ms |  1.84x |  1.75x | 1.44x  | 1.37x |\n|8192 |   8.71ms |  11.40ms |  16.23ms | 15.09ms |  1.86x | 1.73x | 1.42x | 1.32x|\n\n\nShape [N=2560, K=10240]\n| M |   PC(ms)  | PG(ms)   | w8a16  |  w4a16 |   PC\u002Fw8  | PC\u002Fw4  | PG\u002Fw8  | PG\u002Fw4 |\n|--------|------|--------|-------| ---|--------|------|-------------|------------------|\n| 1 |   0.25ms |  0.26ms |  0.26ms  | 0.20ms |  1.03x | 0.78x | 0.98x | 0.75x |\n|128 |   0.39ms |  0.41ms |  0.55ms |  0.46ms |  1.43x | 1.19x | 1.35x | 1.12x |\n| 1024 |   1.31ms |  1.65ms |  2.35ms  | 2.14ms |  1.80x  | 1.64x | 1.43x | 1.30x |\n| 4096 |   5.37ms  | 6.79ms  | 8.54ms |  8.04ms |  1.59x | 1.50x | 1.26x | 1.18x |\n| 8192 |  10.97ms | 12.94ms | 17.28ms | 16.23ms |  1.58x | 1.48x | 1.34x | 1.25x | \n\n### End-to-End VLM \n\n**Qwen3-VL-2B**\n\n| Prompt Tokens | FP16 Prefill (tok\u002Fs) | W8A16 Prefill (tok\u002Fs) | **W8A8 PC Prefill (tok\u002Fs)** | FP16 Decode (tok\u002Fs) | W8A16 Decode (tok\u002Fs) | **W8A8 PC Decode (tok\u002Fs)** |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| 1334 | 3010 | 2065 | **3242** | 70 | 107 | **104** |\n| 2393 | 2868 | 1847 | **2983** | 69 | 97 | **100** |\n| 3455 | 2777 | 1741 | **2796** | 66 | 90 | **95** |\n\n**Qwen3-VL-4B**\n\n Prompt Tokens | FP16 Prefill (tok\u002Fs) | W8A16 Prefill (tok\u002Fs) | **W8A8 PC Prefill (tok\u002Fs)** | FP16 Decode (tok\u002Fs) | W8A16 Decode (tok\u002Fs) | **W8A8 PC Decode (tok\u002Fs)** |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| 1334 | 1884 | 1786 | **2186** | 32 | **56** | 54 |\n| 2393 | 1815 | 1700 | **2028** | 31 | **55** | 52 |\n| 3455 | 1755 | 1603 | **1881** | 30 | **52** | 49 |\n\n\n### LLM Quantization: Precision vs. Speed Comparison\n\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth>Models\u003C\u002Fth>\n      \u003Cth>Quantization Configuration\u003C\u002Fth>\n      \u003Cth>wikitext2 PPL（↓）\u003C\u002Fth>\n      \u003Cth>Prefill Time (s)（↓）\u003C\u002Fth>\n      \u003Cth>Peak Memory (GB)（↓）\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd rowspan=\"5\">\u003Cb>Qwen3-8B\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>FP16\u003C\u002Ftd>\n      \u003Ctd>9.726\u003C\u002Ftd>\n      \u003Ctd>179.9\u003C\u002Ftd>\n      \u003Ctd>18.93\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A16 (mlx RTN)\u003C\u002Ftd>\n      \u003Ctd>9.707\u003C\u002Ftd>\n      \u003Ctd>221.3\u003C\u002Ftd>\n      \u003Ctd>12.07\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-channel)\u003C\u002Ftd>\n      \u003Ctd>9.756\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>123.5\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>11.32\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-group gs=64)\u003C\u002Ftd>\n      \u003Ctd>9.744\u003C\u002Ftd>\n      \u003Ctd>179.1\u003C\u002Ftd>\n      \u003Ctd>11.83\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-group gs=128)\u003C\u002Ftd>\n      \u003Ctd>9.727\u003C\u002Ftd>\n      \u003Ctd>165.8\u003C\u002Ftd>\n      \u003Ctd>11.61\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n  \u003Ctr style=\"border-top: 1px solid #333;\">\n      \u003Ctd colspan=\"5\" style=\"padding: 0; height: 3px;\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd rowspan=\"5\">\u003Cb>Llama3-8B\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>FP16\u003C\u002Ftd>\n      \u003Ctd>6.138\u003C\u002Ftd>\n      \u003Ctd>175.8\u003C\u002Ftd>\n      \u003Ctd>18.32\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A16 (mlx RTN)\u003C\u002Ftd>\n      \u003Ctd>6.147\u003C\u002Ftd>\n      \u003Ctd>236.9\u003C\u002Ftd>\n      \u003Ctd>11.46\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-channel)\u003C\u002Ftd>\n      \u003Ctd>6.271\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>123.3\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>10.69\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-group, gs=64)\u003C\u002Ftd>\n      \u003Ctd>6.269\u003C\u002Ftd>\n      \u003Ctd>178.7\u003C\u002Ftd>\n      \u003Ctd>11.19\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>W8A8 (per-group, gs=128)\u003C\u002Ftd>\n      \u003Ctd>6.270\u003C\u002Ftd>\n      \u003Ctd>155.7\u003C\u002Ftd>\n      \u003Ctd>10.98\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n## NOTE\nIt's important to note that cider assumes the weights are \u003Cspan style=\"color: red;\">**quantization-friendly**\u003C\u002Fspan> . This means you need to ensure your model has already undergone some post-training quantization calibration methods, such as GTPQ, SmoothQuant, or Quarot etc, to handle outliers or the model itself was trained using QAT. When cider transforms the model (e.g., with w8a8) and calculates the quantization scale, it defaults to the simplest **min-max** method, although it also supports the 99th percentile method (clip_percentile=99.9). If you encounter garbled output, it's highly likely that your model is affected by \u003Cspan style=\"color: red;\">**outliers**\u003C\u002Fspan> . Therefore, some preprocessing is **necessary**.\n\n\n## Requirements\n\n\n- **Apple M5+** for INT8 TensorOps (M4 and below: installs as pure-Python, `is_available()` returns False)\n- Python 3.12+\n- MLX >= 0.31\n- nanobind >= 2.12 (only needed on M5+ for C++ build)\n- CMake >= 3.27 (only needed on M5+ for C++ build)\n\n## Install\n\n```bash\npip install -e .\n```\n\nOn M5+, this runs CMake to compile the C++ extension, then installs the Python package.\nOn M4 and below, only the Python package is installed (no compilation errors).\n\n## Quick Start\n\n### One-line Model Conversion (Recommended)\n\n```python\nfrom cider import convert_model, is_available\n\nmodel, proc = load(\"path\u002Fto\u002Fmodel\")\n\nif is_available():\n    convert_model(model)\n    # CiderLinear auto-detects:\n    #   seq_len > 1  → W8A8 INT8 TensorOps (faster prefill)\n    #   seq_len == 1 → INT8 MV kernel (near-native decode speed)\nelse:\n    pass  # Falls back to standard MLX inference on M4\n```\n\n**Important**\nWhen quantizing Vision-Language Models (VLMs), the vision transformer (ViT) is generally not replaced. Directly using convert_model will quantize the vision model's linear layers as well, which typically causes accuracy drop. For VLMs, we recommend calling convert_model(model.language_model) to apply existing quantization methods like GPTQ, SmoothQuant, and AWQ to the language model only.\n\nTested on selected MLX transformer models, including Qwen3, Qwen3-VL and Llama3 families. Other architectures may require adaptation.\n\n\n### Layer-level API\n\n```python\nimport numpy as np\nimport mlx.core as mx\nfrom cider import W8A8Linear, W4A8Linear, is_available\n\nassert is_available(), \"Requires Apple M5+\"\n\n# Prepare weight\nW = np.random.randn(4096, 4096).astype(np.float16)\n\n# W8A8 linear (per-channel)\nfrom cider.ops import quantize_weight_int8\nw_int8, scale = quantize_weight_int8(W)\nlayer = W8A8Linear(\n    w_int8=mx.array(w_int8), scale_w=mx.array(scale),\n    group_size=0, in_features=4096, out_features=4096\n)\nx = mx.random.normal((32, 4096)).astype(mx.float16)\ny = layer(x)    # lazy — builds MLX graph\nmx.eval(y)       # GPU executes\n\n# W4A8 linear (half the weight memory)\nlayer4 = W4A8Linear.from_weights(W)\ny4 = layer4(x)\nmx.eval(y4)\n```\n\n## Low-Level API\n\n```python\nfrom cider import perchannel_linear, w4a8_linear, quantize_weight_int8, pack_weight_int4\n\n# Quantize weights (numpy, offline)\nw_int8, scale = quantize_weight_int8(W_np)\npacked_w4, scale4 = pack_weight_int4(W_np)\n\n# Primitive calls (return lazy mx.array)\ny = perchannel_linear(x, mx.array(w_int8), mx.array(scale))\ny4 = w4a8_linear(x, mx.array(packed_w4), mx.array(scale4))\n```\n\n## Project Structure\n\n```\ncider\u002F\n├── cider\u002F              # Python package\n│   ├── __init__.py        # Public API (conditional on is_available)\n│   ├── ops.py             # Primitive wrappers + quantize helpers\n│   ├── nn.py              # CiderLinear, W4A8Linear (nn.Module)\n│   ├── convert.py         # convert_model() high-level API\n│   └── kernels\u002F           # Metal shaders (bundled)\n│       ├── w8a8_matmul.metal       # W8A8 GEMM (prefill, M>1)\n│       ├── w8a8_int8_mv.metal      # W8A8 per-channel MV (decode, M=1)\n│       ├── w8a8_quantize.metal     # Per-token activation quantization\n│       ├── w4a8_matmul.metal       # W4A8 GEMM (prefill)\n│       ├── pergroup_int8_gemm.metal # Per-group GEMM (prefill)\n│       └── pergroup_int8_mv.metal   # Per-group MV (decode)\n├── csrc\u002F                  # C++ MLX primitives (nanobind, M5+ only)\n│   ├── include\u002F\n│   │   ├── w8a8_primitive.h\n│   │   ├── w4a8_primitive.h\n│   │   └── pergroup_primitive.h\n│   └── src\u002F\n│       ├── w8a8_primitive.mm\n│       ├── w4a8_primitive.mm\n│       ├── pergroup_primitive.mm\n│       └── prim_bindings.cpp\n├── benchmarks\u002F\n│   ├── bench_e2e_wxa16.py    # End-to-end VLM benchmark (Qwen3-VL-2B)\n│   ├── bench_full.py         # Isolated kernel latency (per-channel\u002Fper-group vs MLX)\n│   ├── test_bitexact.py      # Numerical correctness verification\n│   └── mlx_native\u002F           # MLX native INT8 comparison\n├── tutorial\u002F\n│   ├── how_to_write_efficient_int_gemm_m5_en.md\n│   └── how_to_write_efficient_int_gemm_m5_zh.md\n├── tools\u002F\n│   ├── eval_ppl_all.py               # Unified PPL eval (FP16\u002FW8A16\u002Fper-channel\u002Fper-group)\n│   ├── convert_compressed_tensors_to_mlx.py\n│   └── smoothquant.py                # SmoothQuant calibration\n├── examples\u002F\n│   └── basic_usage.py\n├── vlm_service\u002F           # OpenAI-style VLM inference server\n│   ├── server.py             # FastAPI server (streaming + non-streaming)\n│   ├── core_infer.py         # HMInference engine (singleton)\n│   ├── custom_qwen3vl.py     # Custom Qwen3-VL generation loop\n│   ├── config.py             # Config loader\n│   ├── bench_client.py       # Server benchmark client\n│   └── client.py             # API client example\n├── config\u002F\n│   └── config.yaml           # Server & model configuration\n├── experimental\u002F             # ANE+GPU hybrid tensor parallelism (M4)\n│   ├── split_linear.py       # SplitLinear + ANEBridge + patch_model()\n│   ├── bench.py              # End-to-end benchmark\n│   ├── libane_bridge_v6.m    # ANE private API bridge (Obj-C source)\n│   └── README.md\n├── CMakeLists.txt\n├── pyproject.toml\n├── setup.py               # Conditional build (M5+: full, M4: pure-Python)\n└── README.md\n```\n\n## VLM Inference Service\n\n`vlm_service\u002F` provides a ready-to-use **OpenAI-style** VLM inference server with W8A8 acceleration.\n\n### Quick Start\n\n1. **Configure** `config\u002Fconfig.yaml`:\n\n```yaml\nmodel_name_or_path: \u002Fpath\u002Fto\u002Fyour\u002Fmodel   # MLX VLM model (e.g., Qwen3-VL-2B W8A16)\nsampling:\n  max_new_tokens: 1024\n  temperature: 1.0\n  top_p: 1.0\nserver:\n  host: 0.0.0.0\n  port: 8341\n  ttl: 1800\nw8a8:\n  mode: 'off'   # 'auto' | 'on' | 'off'\n```\n\n- `auto`: Enable W8A8 if hardware supports it, fallback to default otherwise\n- `on`: Force W8A8 (error if unsupported). \"When 'on' is selected, it means your model needs to perform online activation quantization. In this case, Cider itself does **not** guarantee quantization accuracy, and you need to apply some quantization algorithms yourself, such as SmoothQuant, QuaRot, GPTQ, or even QAT, to ensure that the accuracy does not degrade significantly after activation quantization. This option simply provides a way for you to leverage the hardware's computational advantages when your model applies W8A8, rather than just simulating quantization.\"\n- `off`: Disable W8A8, use standard MLX inference\n\n2. **Start the server**:\n\n```bash\ncd vlm_service\npython server.py --config ..\u002Fconfig\u002Fconfig.yaml\n```\n\n3. **Send requests** (OpenAI-style API):\n\n```bash\n# Text-only\ncurl http:\u002F\u002Flocalhost:8341\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"vlm\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hello\"}],\n    \"stream\": false\n  }'\n\n# With image (base64)\ncurl http:\u002F\u002Flocalhost:8341\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"vlm\",\n    \"messages\": [{\"role\": \"user\", \"content\": [\n      {\"type\": \"image_url\", \"image_url\": {\"url\": \"data:image\u002Fpng;base64,...\"} },\n      {\"type\": \"text\", \"text\": \"What is in this image?\"}\n    ]}],\n    \"stream\": true\n  }'\n```\n\n### API Endpoints\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `\u002Fv1\u002Fchat\u002Fcompletions` | POST | Chat completion (stream \u002F non-stream) |\n| `\u002Fv1\u002Fmodels` | GET | List available models |\n| `\u002Fhealth` | GET | Health check |\n| `\u002Fv1\u002Fqueue` | GET | Request queue status |\n\n### How W8A8 Works in the Service\n\nWhen `w8a8.mode` is `auto` or `on`, the server calls `cider.convert_model()` at startup to replace all Linear layers with `CiderLinear`. During inference:\n\n- **Prefill** (processing input tokens, seq_len > 1): Uses W8A8 INT8 GEMM for faster computation\n- **Decode** (generating tokens one by one, seq_len == 1): Uses INT8 MV kernel (near-native speed)\n\nNo code changes needed — the switching is automatic based on input sequence length.\n\n## Architecture\n\n### MLX Custom Primitives\n\nBoth W8A8Linear and W4A8Linear are implemented as `mlx::core::Primitive` subclasses. This means:\n\n1. **Lazy evaluation**: `y = layer(x)` builds a graph node, not immediate computation\n2. **Graph composition**: Multiple primitive calls compose into a single MLX graph\n3. **Stream scheduling**: MLX's scheduler handles GPU dispatch order\n\n### Metal Kernel Pipeline\n\nEach primitive dispatches Metal compute kernels:\n\n**Prefill (M > 1):**\n1. **quantize_per_token**: FP16 activations → INT8 + per-token scales\n2. **matmul_fused_dequant**: INT8 × INT8 → INT32 → FP16 (with fused scale dequantization)\n\n**Decode (M = 1):**\n- **int8_mv**: Direct INT8 matrix-vector product with on-the-fly weight dequantization (no activation quantization needed)\n\nFor W4A8, the GEMM step includes inline INT4→INT8 unpacking in the fragment load.\n\n### TensorOps matmul2d\n\nThe INT8 GEMM uses Apple's `mpp::tensor_ops::matmul2d(16, 32, 16)` — hardware-accelerated INT8×INT8→INT32 matrix multiply available on M5+ via Metal 4's `cooperative_tensor` API. This is the same hardware instruction available to MLX's NAX templates. Cider's kernel adds fused dequantization (INT32 × scales → FP16) in the store phase, avoiding an extra device memory round-trip. See [kernel comparison](benchmarks\u002Fmlx_native\u002Fcider_vs_mlx_int8.md) for details.\n\n### Tile Configurations\n\n| Config | BM | BN | BK | SK | Threads | Use When |\n|--------|----|----|----|----|---------|----------|\n| Large  | 128 | 128 | 512 | 32 | 512 | M > 64 |\n| Small  | 32  | 128 | 512 | 32 | 128 | M ≤ 64 |\n\nAuto-selected based on M. L2 cache swizzle dispatch included.\n\n## ANE+GPU Heterogeneous Tensor Parallelism (experimental)\n\nWe found that during inference on Mac, only two hardware computing units—GPU and CPU—were utilized, while the ANE (Apple Neural Engine) computing unit on Mac remained idle. We identified this as a potential optimization opportunity. Inspired by [maderix\u002FANE](https:\u002F\u002Fgithub.com\u002Fmaderix\u002FANE), we conducted experimental work on a hybrid ANE+GPU inference mode. Currently, we apply this approach to tensor parallel computing. On the M4 chip, during synchronous-only forward inference (MLX natively uses a technique called lazy evaluation, which reduces synchronization overhead; in end-to-end testing, the hybrid inference currently shows no advantage, mainly because we have not yet implemented this using MLX's lazy evaluation—this remains future work), we observed approximately **3%~16%** performance improvement compared to pure GPU inference under synchronize pipeline. This remains exploratory work, and end-to-end gains are currently limited by the lack of a lazy-evaluation-compatible implementation.\n\n\nDuring LLM prefill, the GPU's matrix units are fully occupied — but the **Apple Neural Engine sits completely idle**. ANE Split exploits this by splitting each linear layer's GEMM along output channels:\n\n- **ANE** computes ~65% of output channels (FP32, via reverse-engineered private `_ANEClient` API)\n- **GPU** computes the remaining ~35% (FP16, standard MLX matmul)\n- Both run **concurrently**, and results are concatenated\n\nThis is a form of **heterogeneous tensor parallelism** — not data parallelism, not pipeline parallelism — exploiting two distinct compute units on the same SoC.\n\n### Performance (Apple M4, Qwen3-VL-2B Prefill)\n\n| seq | W8A16 GPU | SplitLinear | Speedup vs W8A16 |\n|-----|----------|-----------|-------------|\n| 512 | 639.9 ms | **615.9 ms** | **1.039×** |\n| 1024 | 1348.6 ms | **1156.9 ms** | **1.17×** |\n\nIn the tested benchmark cases, cosine similarity was close to 1.0 and top-1 token agreement was 100%.\n\n\n### Key Design Choices\n\n- **Prefill only**: Decode falls back to original GPU linear (zero overhead)\n- **Shared input preparation**: Q\u002FK\u002FV and Gate\u002FUp projections share a single input transpose+numpy copy via `_InputGroup`\n- **Auto-routing**: Down projections (IC > 2×OC) stay GPU-only where ANE is inefficient\n- **Short-seq bypass**: Sequences \u003C 192 tokens skip splitting (overhead > benefit)\n\nSee [`experimental\u002FREADME.md`](experimental\u002FREADME.md) for full documentation, usage, and build instructions and limitations.\n\n> **Note:** ANE Split is tested on M4. M5 introduced ANE architecture changes that may break the private API bridge — not yet validated on M5.\n\n## Quantization\n\n| Component | Scheme | Granularity |\n|-----------|--------|-------------|\n| W8A8 weights | Symmetric INT8 | Per-channel or per-group (gs=64\u002F128) |\n| W4A8 weights | Symmetric INT4 (zp=8) | Per-column |\n| Activations | Symmetric INT8 | Per-token |\n| Accumulation | INT32 | — |\n| Output dequant | `C_fp16 = C_int32 * s_act * s_weight` | Per-element |\n\n## Limitations\n\n- **M=1 individual operator**: Per-channel MV kernel is slower than MLX W4A16 for isolated decode calls. The per-group MV kernel is within 5% of MLX W8A16 decode speed in end-to-end benchmarks.\n- **Apple M5+ only** for INT8 TensorOps: M4 and below installs but `is_available()` returns False.\n- **W4A8 slower than W8A8**: INT4→INT8 unpack ALU overhead (Metal 4 matmul2d has no native INT4 operand).\n\n## Tools\n\n### Unified PPL Evaluation\n\n```bash\n# Run all 5 configurations in one script\npython tools\u002Feval_ppl_all.py --num-samples 50\n\n# Evaluates: FP16, W8A16 (MLX native), W8A8 per-channel, per-group(gs=64), per-group(gs=128)\n# Outputs comparison table at the end\n```\n\n## Roadmap\n\n- [x] One-line model conversion API (`convert_model`, auto prefill\u002Fdecode)\n- [x] Automatic dtype handling (float16 \u002F bfloat16)\n- [x] Per-channel and per-group W8A8 quantization\n- [x] Dedicated decode MV kernel (matches native MLX speed)\n- [x] Conditional compilation (M4 graceful fallback)\n- [x] mlx_vlm and mlx_lm integration examples\n- [ ] ANE primitives lazy evaluation\n- [ ] Integrated Pruning Feature\n- [ ] KVCache quantization\n\n## Authors\n\nMultimodal Team, Mininglamp Technology\n\nFor bug reports, feature requests, and usage questions, please open an issue in this repository.\n\n\n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@software{wang2026cider,\n  author = {Multimodal Team, Mininglamp Technology},\n  title = {Cider: Exploiting Unused INT8 TensorOps for Faster LLM Prefill on Apple Silicon},\n  year = {2026},\n  howpublished = {https:\u002F\u002Fgithub.com\u002FMininglamp-AI\u002Fcider}\n}\n```\n\n## License\n\nMIT\n\n## Acknowledgments\n\n- [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) by Apple — primitive API, NAXFrag kernel architecture\n- Metal 4 MetalPerformancePrimitives for INT8 TensorOps\n- [maderix\u002FANE](https:\u002F\u002Fgithub.com\u002Fmaderix\u002FANE) — inspired and informed our ANE+GPU tensor-parallel implementation\n","Cider 是一个基于 MLX 为 macOS 开发的项目，旨在通过解锁 M5 芯片中未使用的 INT8 TensorOps 实现更快的 LLM 预填充。其核心功能包括在线激活量化操作符和自定义 int-matmul 内核，支持 W8A8 和 W4A8 模式的推理加速。Cider 利用条件编译技术，在 M5 及以上芯片上构建完整的 C++ 扩展和 Metal 内核，而在 M4 及以下芯片上则提供纯 Python 包。该项目特别适用于需要在 Apple Silicon 设备上高效运行大规模语言模型的应用场景，如自然语言处理任务。此外，Cider 还提供了服务端扩展及非侵入式兼容补丁，确保与 mlx_vlm 的良好集成。",2,"2026-06-11 02:49:36","CREATED_QUERY"]