[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74802":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},74802,"flash-moe","danveloper\u002Fflash-moe","danveloper","Running a big model on a small laptop",null,"Objective-C",3912,494,40,10,0,9,23,73,27,30.08,false,"main",true,[],"2026-06-12 02:03:28","# Flash-MoE: Running a 397B Parameter Model on a Laptop\n\n> **[Read the paper](paper\u002Fflash_moe.pdf)** — Full technical details, 90+ experiments, and the story of how an AI and a human built this in 24 hours.\n\nPure C\u002FMetal inference engine that runs **Qwen3.5-397B-A17B** (a 397 billion parameter Mixture-of-Experts model) on a MacBook Pro with 48GB RAM at **4.4+ tokens\u002Fsecond** with production-quality output including tool calling.\n\nThe entire 209GB model streams from SSD through a custom Metal compute pipeline. No Python. No frameworks. Just C, Objective-C, and hand-tuned Metal shaders.\n\n## Results\n\n![Progress](progress.png)\n\n| Configuration | tok\u002Fs | Quality | Notes |\n|--------------|-------|---------|-------|\n| 4-bit experts, FMA kernel | **4.36** | Excellent | Current best. Full tool calling. 209GB on disk. |\n| 4-bit experts, baseline | 3.90 | Excellent | Before FMA kernel optimization. |\n| 2-bit experts, trust OS | 5.74 | Good* | 120GB on disk. *Breaks JSON\u002Ftool calling. |\n| 2-bit peak single token | 7.05 | Good* | Warm cache burst. *Not suitable for tool use. |\n\n*2-bit quantization produces `\\name\\` instead of `\"name\"` in JSON output, making tool calling unreliable. 4-bit is the production configuration.\n\n## Hardware\n\n- **Machine**: MacBook Pro, Apple M3 Max\n- **Chip**: 16-core CPU (12P + 4E), 40-core GPU, 16-core ANE\n- **Memory**: 48 GB unified (~400 GB\u002Fs bandwidth)\n- **SSD**: 1TB Apple Fabric, **17.5 GB\u002Fs sequential read** (measured)\n- **macOS**: 26.2 (Darwin 25.2.0)\n\n## Architecture\n\nThe model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention. Each layer has 512 experts, of which K=4 are activated per token (plus one shared expert). Hidden dimension is 4096.\n\n### Key Techniques\n\n1. **SSD Expert Streaming** — Expert weights (209GB at 4-bit) are read from NVMe SSD on demand via parallel `pread()` with GCD dispatch groups. Only the K=4 active experts per layer are loaded (~6.75MB each). The OS page cache manages caching — no custom cache needed (\"Trust the OS\" principle). Inspired by Apple's \"LLM in a Flash\" paper.\n\n2. **FMA-Optimized Dequant Kernel** — The inner loop of the 4-bit dequantized matrix-vector multiply rearranges the math from `(nibble * scale + bias) * x` to `fma(nibble, scale*x, bias*x)`. Pre-computing `scale*x` and `bias*x` lets the GPU fused multiply-add unit do dequant+multiply in one instruction. 12% faster than the naive formulation.\n\n3. **Metal Compute Shaders** — Hand-written Metal kernels for:\n   - 4-bit and 2-bit dequantized matrix-vector multiply (tiled, SIMD-reduced, shared input cache, FMA-optimized)\n   - Fused SwiGLU activation\n   - RMS normalization (two-pass: sum-of-squares reduction + apply)\n   - Batched GPU attention (Q@K^T, softmax, scores@V) for full attention layers\n   - GPU RoPE (fused with Q deinterleave and K normalization)\n   - MoE combine + residual + sigmoid gate (fused kernel)\n\n4. **Deferred GPU Expert Compute** — CMD3 (expert forward pass) is submitted without waiting. The GPU executes it while the CPU prepares the next layer. The combine + residual + norm are also on GPU, feeding directly into the next layer's attention projections.\n\n5. **Accelerate BLAS for Linear Attention** — The GatedDeltaNet recurrence uses `cblas_sscal`, `cblas_sgemv`, and `cblas_sger` for the 64-head × 128×128 state matrix update. 64% faster than scalar code.\n\n6. **Trust the OS** — No custom expert cache. The OS page cache (~35GB) manages expert data caching via standard LRU. Every custom caching approach we tested (Metal LRU, malloc cache, LZ4 compressed cache) was slower due to GPU memory pressure or overhead. The page cache achieves ~71% hit rate naturally.\n\n### Pipeline Per Layer (4.28ms average at 4-bit)\n\n```\nCMD3(prev) → CMD1: attention projections + delta-net  [1.22ms GPU]\n           → CPU: flush results                       [0.01ms CPU]\n           → CMD2: o_proj + norm + routing + shared    [0.55ms GPU]\n           → CPU: softmax + topK routing               [0.003ms]\n           → I\u002FO: parallel pread K=4 experts           [2.41ms SSD]\n           → CMD3: expert forward + combine + norm     [0.04ms encode, DEFERRED]\n```\n\n### Unified Memory Constraint\n\nOn Apple Silicon, SSD DMA and GPU compute share the same memory controller and cannot be profitably overlapped. The GPU's dequant kernels are bandwidth-saturated at ~418 GiB\u002Fs. Even small background SSD DMA causes disproportionate GPU latency spikes through memory controller arbitration. The serial pipeline (GPU → SSD → GPU) is hardware-optimal.\n\n## Quick Start\n\n```bash\ncd metal_infer\nmake\n# 4-bit inference (needs packed_experts\u002F directory)\n.\u002Finfer --prompt \"Explain quantum computing\" --tokens 100\n\n# 2-bit inference (faster but breaks tool calling)\n.\u002Finfer --prompt \"Explain quantum computing\" --tokens 100 --2bit\n\n# Interactive chat with tool calling\n.\u002Fchat\n\n# Per-layer timing breakdown\n.\u002Finfer --prompt \"Hello\" --tokens 20 --timing\n```\n\n## Project Structure\n\n```\nmetal_infer\u002F\n  infer.m              # Complete inference engine (~7000 lines)\n  shaders.metal        # Metal compute kernels (~1200 lines)\n  chat.m               # Interactive chat TUI with tool calling\n  tokenizer.h          # C BPE tokenizer (single-header, 449 lines)\n  main.m               # MoE-only benchmark\n  Makefile             # Build system\n  extract_weights.py   # Creates model_weights.bin from safetensors\n  repack_experts_2bit.py  # 4-bit → 2-bit expert requantization\n  train_predictor.py   # Expert routing prediction analysis\n  model_weights.bin    # Non-expert weights (5.5GB, mmap'd)\n  model_weights.json   # Tensor manifest\n  vocab.bin            # Vocabulary for token decoding\n  tokenizer.bin        # Pre-exported BPE tokenizer data\n\nrepack_experts.py      # 4-bit expert packing from safetensors\nprogress.py            # Results visualization (Q2\u002FQ4 tracks)\nresults.tsv            # Experiment log (58 experiments)\n```\n\n## What We Tried (and What Worked)\n\n### Kept\n| Approach | Result | Impact |\n|----------|--------|--------|\n| FMA dequant kernel | GPU compute -12% | **+12% tok\u002Fs** |\n| Trust OS page cache | Deleted Metal LRU → +38% | **Foundational** |\n| GPU combine+norm in CMD3 | Eliminates CPU round-trip | **Pipeline** |\n| BLAS delta-net (Accelerate) | cpu_attn 0.78→0.28ms | **+64% attn** |\n| F_NOCACHE for 2-bit | +3% from avoiding page thrash | **2-bit only** |\n| GPU fused attention (RoPE) | +2% for full-attn layers | **Small** |\n| C BPE tokenizer | 180ms vs 3500ms startup | **20x startup** |\n| Deferred CMD3 execution | GPU\u002FCPU overlap | **Pipeline** |\n\n### Discarded (58 experiments, highlights)\n| Approach | Result | Why |\n|----------|--------|-----|\n| LZ4 expert compression | -13% | Decompress overhead > warm cache savings |\n| F_RDADVISE prefetch | net 0% | Unified memory: SSD DMA slows GPU -73% |\n| Temporal expert prediction | -18% | 25% hit rate, SSD bandwidth waste |\n| MLP routing predictor | 31% accuracy | Worse than temporal baseline |\n| GPU LUT dequant kernel | -2% | Indirect register access serializes |\n| GPU private buffer compression | -20% pipeline | Blit cost 4×7MB > matvec savings |\n| Spin-poll GPU wait | -23% | CPU thermal competes with GPU |\n| Expert file clustering | 0% | NVMe ignores scatter at 7MB granularity |\n| dispatch_io | -70% | dispatch_data management overhead |\n| mmap expert files | -5x | Per-page fault overhead on cold data |\n| Speculative early routing | -38% | Cache pollution + overhead |\n| MTP speculative decoding | break-even | MoE I\u002FO scales per-token (unlike dense) |\n\n## Safety\n\nThis is a primary development machine. The engine explicitly controls memory:\n- Non-expert weights: 5.5GB (mmap'd, read-only)\n- Metal scratch buffers: ~200MB\n- Total: ~6GB, leaving 42GB for OS + page cache\n- No OOM risk. Expert data streams from SSD on demand.\n- No custom caches. Trust the OS.\n","Flash-MoE 是一个能够在普通笔记本电脑上运行大规模模型的项目。它使用纯C\u002FMetal推理引擎，能够以每秒4.4个以上的token速度在配备48GB RAM的MacBook Pro上运行具有3970亿参数的Qwen3.5-397B-A17B混合专家模型，并支持工具调用等生产级功能。该项目通过自定义Metal计算管线直接从SSD流式传输整个209GB模型数据，采用4位或2位量化技术及FMA优化内核来提高效率与性能。适用于需要在资源受限环境下高效执行大型语言模型的应用场景，如个人开发、研究实验或小型企业部署等。",2,"2026-06-11 03:50:53","high_star"]