[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74800":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74800,"ANE","maderix\u002FANE","maderix","Training neural networks on Apple Neural Engine via reverse-engineered private APIs",null,"Objective-C",6705,926,65,9,0,17,19,52,51,99.6,"MIT License",false,"main",true,[],"2026-06-12 04:01:15","# ANE Training — Backpropagation on Apple Neural Engine\r\n\r\nTraining neural networks directly on Apple's Neural Engine (ANE) via reverse-engineered private APIs. No CoreML training APIs, no Metal, no GPU — pure ANE compute.\r\n\r\n## Project Scope & Intent\r\n\r\nI'm genuinely grateful for all the attention this project has received — I never expected a weekend research hack to blow up like this. Thank you to everyone who starred, forked, ran benchmarks on their own hardware, and shared the work. It means a lot.\r\n\r\nThat said, I want to set clear expectations about what this project is and isn't.\r\n\r\nThis is a **research project**, not a production framework.\r\n\r\nThe goal was to demonstrate that **training on the Apple Neural Engine — and potentially other NPUs — is possible**, and that the barrier has always been software support, not hardware capability. The ANE is a remarkably capable piece of silicon that Apple restricts to inference-only use through CoreML. This project bypasses that restriction using reverse-engineered private APIs to show what's possible when you give the hardware a chance.\r\n\r\n### What This Project Is\r\n\r\n- A proof of concept for ANE training via `_ANEClient` and `_ANECompiler` private APIs\r\n- A set of benchmarks documenting real ANE performance characteristics (throughput, power, SRAM behavior)\r\n- A reference for anyone exploring direct ANE access outside CoreML\r\n- Research code that I update when I find something interesting\r\n\r\n### What This Project Is Not\r\n\r\n- A maintained framework or library\r\n- A replacement for CoreML, MLX, llama.cpp, or any production inference stack\r\n- A path to training large models on consumer hardware (yet)\r\n\r\n### On The Hype\r\n\r\nSome coverage of this project has overstated its implications. To be clear:\r\n\r\n- Training works, but utilization is low (~5-9% of peak) with significant engineering challenges remaining\r\n- Many element-wise operations still fall back to CPU\r\n- This does **not** replace GPU training for anything beyond small research models today\r\n\r\nThe honest results — including all limitations — are documented in the accompanying articles:\r\n- [Part 1: Reverse Engineering](https:\u002F\u002Fmaderix.substack.com\u002Fp\u002Finside-the-m4-apple-neural-engine)\r\n- [Part 2: Benchmarks](https:\u002F\u002Fmaderix.substack.com\u002Fp\u002Finside-the-m4-apple-neural-engine-615)\r\n- [Part 3: Training](https:\u002F\u002Fmaderix.substack.com\u002Fp\u002Finside-the-m4-apple-neural-engine-c8b)\r\n\r\n### On Maintenance\r\n\r\nI don't intend to grow this into a large community project. My focus is on original research (compiler infrastructure for edge AI optimization), and maintaining an open-source framework takes time away from that.\r\n\r\nThat said:\r\n- I'll keep pushing updates when I discover something interesting\r\n- Bug fixes and benchmark contributions (especially on hardware I don't own) are welcome\r\n- Feature requests will likely go unaddressed — but feel free to fork\r\n- PRs will be merged at a relatively slow pace, otherwise I become the bottleneck for community growth around this tech\r\n\r\n### Fork it, build on it\r\n\r\nThis is MIT licensed for a reason. Everyone now has access to AI-assisted development tools that can adapt and extend code in hours. If this project is useful to you — take it, modify it, build something better. If you do something cool with it, I'd love to hear about it.If in future, community decides to maintain one source of truth repo, I'm in full support of that.\r\n\r\n---\r\n\r\n## What This Is\r\n\r\nA from-scratch implementation of transformer training (forward + backward pass) running on the ANE in Apple Silicon. The ANE is a 15.8 TFLOPS FP16 (M4) inference accelerator that Apple does not expose for training. This project reverse-engineers the `_ANEClient` \u002F `_ANECompiler` private APIs and the MIL (Model Intermediate Language) format to run custom compute graphs — including backpropagation — directly on ANE hardware.\r\n\r\n**Current results:**\r\n\r\n| Model | Params | ms\u002Fstep | Pipeline |\r\n|-------|--------|---------|----------|\r\n| Stories110M (12L, dim=768, MHA 12\u002F12) | 109M | **91 ms** | Dynamic (no recompile) |\r\n| Qwen3-0.6B (28L, dim=1024, GQA 16\u002F8) | 596M | **412 ms** | Dynamic (no recompile) |\r\n\r\n- All forward and backward dx passes on ANE, dW gradients on CPU (Accelerate cblas)\r\n- Adam optimizer, gradient accumulation, checkpoint\u002Fresume via exec() restart\r\n- GQA (Grouped-Query Attention) support with per-head tiling\u002Freduction\r\n- GPU↔ANE zero-copy pipeline via shared IOSurface (GPU prefill → ANE decode)\r\n\r\n**INT8 W8A8 quantization — 1.88x throughput (M4, H16G):**\r\n\r\n| Config | FP16 | INT8 W8A8 | Speedup |\r\n|--------|------|-----------|---------|\r\n| 128x conv 512ch 64x64 | 18.6 TOPS, 14.8ms | 35.1 TOPS, 7.8ms | **1.88x** |\r\n| 64x conv 512ch 64x64 | 18.4 TOPS, 7.5ms | 34.1 TOPS, 4.0ms | **1.85x** |\r\n\r\nINT8 activations halve L2 SRAM bandwidth between tiles via MIL `quantize`\u002F`dequantize` ops. Weights use `constexpr_affine_dequantize` (int8 stored, fp16 at compile time).\r\n\r\n## Architecture\r\n\r\nThe dynamic pipeline uses shared ANE kernels with weights packed into spatial dimensions (no recompilation when weights change):\r\n\r\n**MHA models (Stories110M) — 6 kernels per layer:**\r\n\r\n| Kernel | Function |\r\n|--------|----------|\r\n| `sdpaFwd` | QKV projection + SDPA + output projection |\r\n| `ffnFused` | SwiGLU FFN (W1, W3, SiLU, W2) |\r\n| `ffnBwdW2t` \u002F `ffnBwdW13t` | FFN backward (split for memory) |\r\n| `sdpaBwd1` \u002F `sdpaBwd2` | SDPA backward |\r\n\r\n**GQA models (Qwen3-0.6B) — 10 kernels per layer:**\r\nAdds separate `woFwd`, `qBwd`, `kvBwd` kernels for grouped-query attention (Q_DIM ≠ DIM).\r\n\r\nCPU handles: RMSNorm forward\u002Fbackward, residual connections (DeepNet α scaling), loss computation, dW gradient accumulation (cblas_sgemm), Adam optimizer updates.\r\n\r\nKey optimizations:\r\n- **Channel-first CPU layout** — matches ANE IOSurface `[1,C,1,S]` format, eliminates all transpose overhead\r\n- **vDSP vectorized RMSNorm** — 10x faster than naive (6.7ms → 0.7ms)\r\n- **GCD async cblas overlap** — dW gradient sgemms run in parallel with ANE evals on a serial dispatch queue\r\n- **Deferred cblas wait** — wait pushed into next step's forward pass for maximum overlap\r\n- **ANE RMSNorm fusion** — RMSNorm folded into forward kernels as MIL ops (reduce_sum + pow + mul)\r\n- **Wo^T fusion** — output projection backward merged into SDPA backward kernel\r\n- **Forward taps** — Q, K, V, attention scores, hidden states exposed via concat outputs, avoiding CPU recompute\r\n- **exec() restart** — bypasses ~119 ANE compile limit per process\r\n\r\n## File Structure\r\n\r\n```\r\n├── api_exploration.m           # Initial ANE API discovery\r\n├── inmem_basic.m               # In-memory MIL compilation proof-of-concept\r\n├── inmem_bench.m               # ANE dispatch latency benchmarks\r\n├── inmem_peak.m                # Peak TFLOPS measurement (2048x2048 matmul)\r\n├── ane_int8_bench.m            # INT8 W8A8 vs FP16 throughput benchmark\r\n├── sram_bench.m                # ANE SRAM bandwidth probing\r\n├── sram_probe.m                # SRAM size\u002Flayout exploration\r\n├── gpu_ane_share.m             # GPU↔ANE zero-copy IOSurface demo\r\n├── gpu_prefill_ane_decode.m    # GPU prefill → ANE decode pipeline\r\n├── bridge\u002F\r\n│   ├── ane_bridge.h            # C-callable ANE API (compile, eval, I\u002FO)\r\n│   ├── ane_bridge.m            # Bridge implementation (int8 + fp16 weight blobs)\r\n│   └── Makefile\r\n└── training\u002F\r\n    ├── ane_runtime.h           # ANE private API wrapper (compile, eval, IOSurface)\r\n    ├── ane_classifier.h        # Classifier fwd (32K conv), softmax, rmsnorm on ANE\r\n    ├── train_large.m           # Static pipeline (weights as constants, recompiles)\r\n    ├── training_dynamic\u002F\r\n    │   ├── train.m             # Dynamic training loop (model-agnostic)\r\n    │   ├── config.h            # Derived sizes, structs, alloc helpers\r\n    │   ├── mil_dynamic.h       # MIL generators for dynamic weight kernels (GQA-aware)\r\n    │   ├── io.h                # IOSurface I\u002FO, weight staging, GQA tile\u002Freduce\r\n    │   ├── models\u002F\r\n    │   │   ├── stories110m.h   # Stories110M config (12L, MHA)\r\n    │   │   └── qwen3_06b.h    # Qwen3-0.6B config (28L, GQA)\r\n    │   └── Makefile\r\n    ├── dashboard.py            # Live training dashboard (blessed TUI)\r\n    └── Makefile\r\n```\r\n\r\n## Training Data\r\n\r\nTraining requires pretokenized TinyStories data. To download:\r\n```bash\r\ncd training && bash download_data.sh\r\n```\r\nSee [training\u002FREADME.md](training\u002FREADME.md) for detailed training instructions.\r\n\r\n## Building\r\n\r\nRequires macOS 15+ on Apple Silicon (tested on M4).\r\n\r\n```bash\r\n# Dynamic pipeline (recommended) — model selected at build time\r\ncd training\u002Ftraining_dynamic\r\nmake MODEL=stories110m    # Stories110M (12L, MHA, 109M params)\r\nmake MODEL=qwen3_06b      # Qwen3-0.6B (28L, GQA, 596M params)\r\n.\u002Ftrain --scratch          # train from random init\r\n.\u002Ftrain --resume           # resume from checkpoint\r\n\r\n# Static pipeline (legacy — recompiles weights each step)\r\ncd training && make train_large\r\n.\u002Ftrain_large ane_stories110M_ckpt.bin 256 100 1e-4\r\n\r\n# INT8 benchmark\r\nxcrun clang -O2 -fobjc-arc -framework Foundation -framework IOSurface -ldl \\\r\n  -o ane_int8_bench ane_int8_bench.m\r\n.\u002Fane_int8_bench\r\n\r\n# Bridge library (C-callable ANE API)\r\ncd bridge && make\r\n```\r\n\r\nNo external dependencies. Uses only system frameworks + private ANE APIs resolved at runtime via `objc_msgSend`.\r\n\r\n## How It Works\r\n\r\n1. **MIL generation** — Objective-C code constructs MIL program text at runtime, specifying convolutions (for linear layers), matmul (for attention), softmax, element-wise ops\r\n2. **In-memory compilation** — `_ANEInMemoryModelDescriptor` compiles MIL text + weight blobs directly to ANE programs, no disk mlmodelc needed\r\n3. **IOSurface I\u002FO** — Input\u002Foutput tensors passed via IOSurface shared memory in `[1, channels, 1, spatial]` format (fp16 or fp32; fp16 direct I\u002FO is ~37% faster)\r\n4. **Dynamic weights** — Activations and weights packed into a single spatial input dimension, sliced apart inside the MIL kernel. Weights change without recompilation.\r\n5. **Gradient flow** — Forward taps expose intermediates needed for backward; backward kernels compute dx (input gradients) on ANE; dW (weight gradients) computed on CPU via cblas\r\n6. **INT8 quantization** — `constexpr_affine_dequantize` for int8 weights, `quantize`\u002F`dequantize` between layers for int8 activation caching in L2 SRAM (1.88x throughput)\r\n\r\n## Limitations\r\n\r\n- **SDPA causal masking** — ANE hardware ignores `attn_mask` in SDPA ops; causal attention is decomposed into separate Q@K^T (ANE) → mask+softmax (CPU) → scores@V (ANE)\r\n- **~119 compile limit** — ANE compiler leaks resources; worked around via `exec()` restart with checkpoint\r\n- **FP16 gradient underflow** — backward matmuls underflow in fp16; fixed with global loss scaling (`256 * NLAYERS`)\r\n- **Single-input constraint** — multi-input ANE requests cause 0x1d error; inputs packed into spatial dimension instead\r\n\r\n## Performance\r\n\r\n**Training throughput (M4):**\r\n\r\n| Model | Params | ms\u002Fstep | Layers | Kernels\u002Flayer |\r\n|-------|--------|---------|--------|---------------|\r\n| Stories110M | 109M | 91 ms | 12 | 6 (MHA) |\r\n| Qwen3-0.6B | 596M | 412 ms | 28 | 10 (GQA) |\r\n\r\n**ANE peak throughput (M4, H16G):**\r\n\r\n| Precision | Peak TOPS | Config |\r\n|-----------|-----------|--------|\r\n| FP16 | 18.6 | 128x conv 512ch 64x64 |\r\n| INT8 W8A8 | 35.1 | 128x conv 512ch 64x64 |\r\n\r\n**GPU↔ANE inference pipeline (M4, seq=256):**\r\n\r\n| Model | GPU Prefill | ANE Decode | Total |\r\n|-------|------------|------------|-------|\r\n| Stories110M | 6.7ms | 1.9ms | 8.8ms |\r\n| Qwen3-0.6B | 9.7ms | 2.3ms | 12.0ms |\r\n\r\n## Disclaimer\r\n\r\nThis project uses Apple's private, undocumented APIs (`_ANEClient`, `_ANECompiler`, `_ANEInMemoryModelDescriptor`). These APIs are not covered by any public stability guarantee and may change or break with any macOS update. This is independent research into Apple Neural Engine architecture, using APIs discovered through runtime introspection for research and educational purposes under fair use and interoperability provisions (see *Sega v. Accolade*, 1992; DMCA §1201(f)). No Apple proprietary code or binaries are included in this repository. This project is not affiliated with or endorsed by Apple Inc. Use at your own risk.\r\n\r\n## License\r\n\r\nMIT — see [LICENSE](LICENSE)\r\n\r\n---\r\n\r\n*Built by a human + Claude, one weekend at a time.*\r\n\r\n\r\n","该项目通过逆向工程的私有API直接在苹果神经引擎（ANE）上训练神经网络。其核心功能是利用Objective-C语言实现对ANE的直接访问，绕过了Apple限制ANE仅用于推理的软件屏障，展示了ANE在训练任务中的潜力。项目包括了基于_ANEClient和_ANECompiler私有API的概念验证、一系列记录ANE实际性能特征的基准测试以及对于希望探索直接ANE访问的研究人员的参考。尽管目前硬件利用率较低且存在许多技术挑战，如部分元素级操作仍需回退到CPU执行，但该项目非常适合于研究者探索ANE及其他类似神经处理单元（NPU）在非生产环境下的训练能力。",2,"2026-06-11 03:50:53","high_star"]