[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-124":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":14,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},124,"talos-vs-macbook","AlexCheema\u002Ftalos-vs-macbook","AlexCheema","microGPT benchmarks: a single M4 Max MacBook Pro P-core in C runs Karpathy's 4192-parameter transformer at ~71x the throughput of TALOS-V2's FPGA implementation.",null,"Python",161,13,152,1,0,5,3,3.44,"MIT License",false,"main",true,[],"2026-06-12 02:00:08","# talos-vs-macbook\n\nHave you ever wanted to know whether 50,000 tokens\u002Fsec on a custom FPGA is impressive? It is and it isn't. This repo runs Karpathy's [microGPT](https:\u002F\u002Fgist.github.com\u002Fkarpathy\u002F8627fe009c40f57531cb18360106ce95) — a 4,192-parameter character-level transformer — in five different ways on an M4 Max MacBook Pro and compares them to [TALOS-V2](https:\u002F\u002Fgithub.com\u002FLuthiraa\u002FTALOS-V2)'s 53,000 tok\u002Fsec hardware implementation on a Cyclone V FPGA.\n\nThe model is so small (~17 KB at fp32) that it fits in L1 cache and the whole forward pass is ~4,000 multiply-accumulates per token. That makes the benchmark less about arithmetic and more about *overhead*. The interesting question turns out to be: which implementations even *beat* the FPGA?\n\n```\nimplementation                tok\u002Fsec      vs FPGA\n----------------------  --------------  -----------\npure-python                      7,430        0.14x\nnumpy fp32                      40,244        0.76x   \u003C- slower than the FPGA!\nmlx fp32 (cpu)                   9,350        0.18x\nmlx fp32 (gpu)                   3,337        0.06x   \u003C- much slower\nc fp32+NEON                  3,756,165       70.87x\nc Q4.12 fixed-point          3,143,586       59.31x\nTALOS-V2 (FPGA, 56MHz)          53,000        1.00x\n```\n\nA single M4 Max MacBook Pro P-core in well-tuned C does **~71×** the FPGA's throughput. NumPy and MLX both come in *under* the FPGA: their per-call dispatch overhead is bigger than the actual work. MLX-on-GPU is the worst — kernel launch overhead annihilates a 4K-MAC forward pass. lol.\n\n![throughput](charts\u002Fthroughput.png)\n\nAnd on perf-per-watt — assuming ~5 W for one M4 Max P-core under load and ~2 W for the Cyclone V fabric — the MacBook still wins by a wide margin. TALOS sits comfortably above the Python and MLX bars (Python overhead is just wasted power) but the C versions clear it by ~25–30×.\n\n![perf-per-watt](charts\u002Fperf_per_watt.png)\n\n## try it yourself\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAlexCheema\u002Ftalos-vs-macbook && cd talos-vs-macbook && .\u002Frun.sh\n```\n\nThat's it. The script fetches microGPT's trained weights from upstream, builds the C versions with `clang -O3 -march=native -ffast-math`, and runs all five implementations back-to-back. Takes about 90 seconds total. You only need `python3`, `numpy`, `make`, and `clang` (all already on a stock Mac); MLX is optional (`pip install mlx` if you want those rows).\n\n## what's in here\n\nEach implementation is a single self-contained file. No frameworks pulled in past what's strictly needed.\n\n| file | what | lines |\n| --- | --- | --- |\n| `pure_python.py` | Karpathy's reference forward pass, dependency-free Python. The slow baseline. | 130 |\n| `bench_numpy.py` | NumPy fp32, BLAS pinned to 1 thread, KV cache. | 138 |\n| `bench_mlx.py` | Same forward pass in [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx), Apple's M-series-tuned framework. CPU and GPU. | 122 |\n| `bench_c.c` | Hand-written C with NEON intrinsics. fp32. The ceiling. | 268 |\n| `bench_c_q412.c` | Same, but with Q4.12 fixed-point matmuls — the exact arithmetic TALOS uses. | 270 |\n| `model.py` | Shared loader + sampler. | 66 |\n\nAbout 1,000 lines total across all five implementations. Same model, same weights, same multinomial sampling, same temperature 0.5, same single-thread batch=1 char-by-char autoregressive setup.\n\n## sample output\n\nEach implementation generates the same kinds of name-like strings. The Python ones (sharing Python's `random.choices`) produce identical output:\n\n```\nsample  1: kana\nsample  2: keelan\nsample  3: alilan\nsample  4: ariel\nsample  5: cairi\nsample  6: mayan\nsample  7: kenia\nsample  8: akalen\nsample  9: danyli\nsample 10: man\n```\n\nRun `python3 pure_python.py --names` (or `bench_numpy.py --names`, `bench_mlx.py --names`, `.\u002Fbench_c --names`, `.\u002Fbench_c_q412 --names`) to see your own.\n\n## why is NumPy slower than the FPGA?\n\nThe model is genuinely tiny. One forward pass is roughly:\n\n- 3 RMSNorms: ~100 FLOPs\n- 4 matmuls of shape (16,16)·(16,): 4 × 256 = 1,024 FMAs\n- attention with up to 16 keys: ~256 FMAs\n- 1 matmul (64,16)·(16,) + 1 matmul (16,64)·(64,): 2,048 FMAs\n- 1 lm_head matmul (27,16)·(16,): 432 FMAs\n\nRound it to ~4,000 multiply-accumulates per token. At single-thread M4 Max MacBook Pro NEON throughput (~16 GFLOPS in scalar fp32, much more with FMA pipelines), the *arithmetic* takes well under a microsecond. So if you can dispatch the work in \u003C1 µs you'll fly; otherwise you don't.\n\nNumPy's per-call overhead (Python ↔ C boundary, dtype dispatch, broadcast checks) is in the few-microseconds range. With ~25 ops per token × ~1 µs each, you're already at 25 µs\u002Ftoken = 40k tok\u002Fsec — which is exactly what we measure. The numbers aren't a NumPy weakness; they're a model-too-small situation.\n\nMLX-on-GPU is even worse because Metal kernel launches are tens of microseconds each. Apple silicon is brilliant; it's just not the right tool for a 4,000-MAC workload. This is why people batch.\n\nThe FPGA wins on *absolute* power draw — a Cyclone V on the DE1-SoC pulls maybe 2 W; one M4 Max MacBook Pro P-core under this load is more like 5 W — but with ~71× the throughput at ~2.5× the power, the MacBook wins on perf-per-watt by roughly an order of magnitude (~28×) too. The FPGA's real advantages are form factor and deterministic latency: you can run TALOS off a battery on something credit-card sized, you can't run a MacBook there. To match TALOS in C we use about 1.4% of one core's time.\n\n## how the C version works\n\n`bench_c.c` is the interesting one. The trick is that the model is small enough that everything — weights (16 KB), KV cache (2 KB), all activations — fits in L1 D-cache. So the bottleneck is purely instruction throughput.\n\nEach matmul is hand-unrolled. The (R,16)·(16,) shape is perfect for NEON: load the 16-element input vector once into four `float32x4_t` registers, then for each output row compute 4 fused multiply-adds and a horizontal reduce. The (16,64)·(64,) MLP-out matmul fully unrolls the inner 64-element dot product. RMSNorm reduces with `vaddvq_f32`. Sampling is xorshift32 + cumulative scan.\n\nThe Q4.12 version is the same structure but with `int16_t` weights and `vmlal_s16` widening MACs into `int32_t` accumulators, shifted right by 12 between layers. RMSNorm and softmax stay in float (TALOS uses LUTs and Newton iterations for these in hardware). Quantization error vs fp32 is ~0.0001 per weight, and several generated names match between the fp32 and Q4.12 versions byte-for-byte.\n\n## extras: M3 Ultra, M1 Max, DGX Spark\n\nSame code, more hardware. The C+NEON path is portable Armv8\u002F9 and builds cleanly on M3 Ultra, M1 Max, and the Grace ARM cores in NVIDIA's [DGX Spark](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fproducts\u002Fworkstations\u002Fdgx-spark\u002F). The DGX Spark also has a Blackwell GPU on the same package, so we threw in two CUDA implementations as well — a naïve launch-per-op baseline and a fused persistent kernel (`bench_cuda.cu`, `bench_cuda_persistent.cu`, ~560 lines together). The M4 Max story above is unchanged; this section just lays out where the same forward pass lands on other hardware.\n\n```\nimplementation                       tok\u002Fsec      vs FPGA\n-----------------------------  --------------  -----------\nGrace · c fp32+NEON                4,364,405       82.35x\nM4 Max · c fp32+NEON               3,756,165       70.87x\nM3 Ultra · c fp32+NEON             3,632,988       68.55x\nM4 Max · c Q4.12                   3,143,586       59.31x\nGrace · c Q4.12                    3,007,686       56.75x\nM3 Ultra · c Q4.12                 2,935,620       55.39x\nM1 Max · c fp32+NEON               2,910,293       54.91x\nM1 Max · c Q4.12                   2,345,483       44.25x\nBlackwell · cuda persistent          413,603        7.80x\nTALOS-V2 (FPGA, 56MHz)                53,000        1.00x\nGrace · numpy fp32                    41,032        0.77x\nM4 Max · numpy fp32                   40,244        0.76x\nM3 Ultra · numpy fp32                 38,175        0.72x\nM1 Max · numpy fp32                   28,866        0.54x\nBlackwell · cuda fp32 (naive)         19,127        0.36x\nM4 Max · mlx fp32 (cpu)                9,350        0.18x\nM1 Max · mlx fp32 (cpu)                9,122        0.17x\nM3 Ultra · pure-python                 8,039        0.15x\nM4 Max · pure-python                   7,430        0.14x\nGrace · pure-python                    6,455        0.12x\nM3 Ultra · mlx fp32 (cpu)              5,407        0.10x\nM1 Max · pure-python                   4,600        0.09x\nM4 Max · mlx fp32 (gpu)                3,337        0.06x\nM1 Max · mlx fp32 (gpu)                2,196        0.04x\nM3 Ultra · mlx fp32 (gpu)              1,785        0.03x\n```\n\n![throughput across all platforms](charts\u002Fthroughput_all.png)\n\nA few observations:\n\n**Apple silicon scales with clock and microarch.** M4 Max 3.76M (~4.4 GHz) > M3 Ultra 3.63M (~4.05 GHz) > M1 Max 2.91M (~3.2 GHz Firestorm). M1 Max's Q4.12 path is ~30% behind its own fp32, vs ~17% on M4 Max — Apple's `int16` widening MAC pipeline got noticeably wider between 2021 and 2024.\n\n**Grace edges out M4 Max by 16% on the same C.** A single Cortex-X925 (3.9 GHz boost) is the wider ARMv9 core; the NEON path is portable Armv8\u002F9, no Apple-specific intrinsics. On Linux glibc the link line needs `-lm -lmvec` because gcc's auto-vectorised `expf` in the attention softmax pulls in libmvec.\n\n**NumPy under the FPGA replicates on Grace.** 41,032 tok\u002Fsec there, despite OpenBLAS on aarch64 Linux being completely different from Apple's Accelerate. The bottleneck is the Python ↔ C boundary, not BLAS, so the number is roughly platform-flat.\n\n**Naïve CUDA loses to the FPGA, persistent CUDA beats it.** ~19k tok\u002Fsec for `bench_cuda.cu` (one launch per matmul \u002F RMSNorm \u002F softmax \u002F sample, with the token id round-tripping host↔device every step) — same launch-overhead pattern as MLX-on-GPU, just on a different launch surface. `bench_cuda_persistent.cu` pins all 4,192 fp32 weights in shared memory and runs the entire timed window in a single launch — 413K tok\u002Fsec, 7.8× the FPGA. Still ~10× slower than every C core in the table, though.\n\n![perf-per-watt across all platforms](charts\u002Fperf_per_watt_all.png)\n\nOn perf-per-watt — assuming ~3 W per Grace core (estimated, not measured), ~5 W per Apple P-core, 19.96 W for Blackwell during the persistent kernel run (`nvidia-smi --query-gpu=power.draw -lms 100`, n=257), and ~2 W for the Cyclone V — Grace c+NEON wins outright at ~1.45M tok\u002Fsec\u002FW. Apple's C+NEON paths come next at ~700–750k tok\u002Fsec\u002FW. The FPGA at 26.5k still beats Blackwell's 20.7k on watts; the FPGA's 2 W floor wins efficiency even when CUDA wins absolute throughput.\n\n### try it on DGX Spark\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAlexCheema\u002Ftalos-vs-macbook && cd talos-vs-macbook && .\u002Frun_gx10.sh\n```\n\nBuilds the same Python + C benchmarks plus both CUDA paths. Needs `nvcc` (CUDA 13, `-arch=sm_121` for Blackwell GB10), `gcc 13+`, glibc with `libmvec`, and `python3 + numpy`. The Makefile is gated on `nvcc` being present, so the Apple Silicon path is unchanged.\n\n### how the persistent CUDA kernel works\n\n`bench_cuda_persistent.cu` launches one block of 32 threads — a single warp — and runs the entire timed window in that one launch. On entry the warp co-loads all 4,192 fp32 weights into shared memory and zeros the KV cache. Then a per-token loop runs the full forward pass without leaving the kernel:\n\n- RMSNorm uses warp-shuffle reductions (`__shfl_xor_sync`) — no shared-memory scratch.\n- Each (R, EMBD) matvec is one thread per output row — 16 lanes for the 16-wide outputs, all 32 active when R=64 (each thread handles two MLP rows).\n- Attention runs one thread per head: 4 lanes do the dot-product \u002F softmax \u002F weighted-sum independently across the 4 heads.\n- Sampling is xorshift32 + cumulative scan on lane 0, broadcast back via `__shfl_sync`.\n\nCrucially, the next iteration just continues the loop. No relaunch, no host roundtrip, no global memory traffic for activations between tokens. Only the KV cache writes touch shared memory across iterations. The only host involvement during timing is `cudaDeviceSynchronize()` at the very end. The naïve `bench_cuda.cu` loses 22× to this — almost all of that gap is launch overhead, not arithmetic.\n\n### why is Blackwell slower than Grace?\n\nA single Grace core in C+NEON does 4.36M tok\u002Fsec. The persistent CUDA kernel does 413K tok\u002Fsec on Blackwell. **A 10.5× gap on identical work.** The model still fits in cache on both sides, so it's not memory bandwidth.\n\nTwo factors stack. **Clock:** Grace's Cortex-X925 boosts to ~3.9 GHz; a Blackwell SM in GB10 clocks at ~1.5–2 GHz, ~2× behind on raw clock. **Active SIMT lanes:** a single warp does 1 instruction per warp-clock across 32 lanes. On this model that's only 16 lanes during the (R, EMBD) matvecs, 4 during attention, and 1 during the sampler. Effective lane utilisation is roughly 50%, so useful ops per warp-clock end up similar to what one Grace core issues per cycle through its wide out-of-order + 4-lane NEON FMA pipes — but at half the wall-clock rate.\n\n2× clock × 2× IPC + a small per-op `__syncwarp` overhead across ~25 sequential ops ≈ 10×. Matches the measured ratio. The persistent kernel isn't doing anything wrong — single-stream char-by-char inference at this scale just isn't where GPUs live. *Multiple* persistent kernels — one block per SM, ~80 SMs on GB10 — would scale linearly: >30M tok\u002Fsec batched throughput on the same hardware. Different question, different answer.\n\n## todos\n\n- multi-thread version with 12 independent sampling streams (would scale to ~45M tok\u002Fsec, probably)\n- a Metal compute shader version (just to confirm the GPU launch-overhead theory directly on Apple silicon)\n- batched throughput numbers (where MLX would actually shine)\n- multi-stream persistent CUDA: N independent streams, N blocks. With ~80 SMs on GB10 and 413k tok\u002Fsec\u002Fstream that's potentially >30M tok\u002Fsec batched throughput.\n- multi-thread C on Grace's 10 X925 cores — extrapolating from 4.36M\u002Fcore, ~40M tok\u002Fsec.\n- direct power measurement on Apple silicon and Grace, to retire the ~5 W and ~3 W per-core estimates\n\n## references\n\n- [TALOS-V2](https:\u002F\u002Fgithub.com\u002FLuthiraa\u002FTALOS-V2) by Luthira Abeykoon, the FPGA implementation we're comparing to. Worth reading; the RTL is genuinely tight.\n- [microGPT](https:\u002F\u002Fgist.github.com\u002Fkarpathy\u002F8627fe009c40f57531cb18360106ce95) by Andrej Karpathy, the 200-line dependency-free transformer + autograd that started this. Trained weights from the TALOS repo.\n- [makemore](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fmakemore), the names dataset and the larger family this came from.\n- [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx), Apple's array framework.\n- [NVIDIA DGX Spark](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fproducts\u002Fworkstations\u002Fdgx-spark\u002F) and the [GB10 superchip](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdata-center\u002Fgrace-cpu\u002F) — the Grace + Blackwell desktop platform used for the CUDA and Grace ARM rows in the extras section.\n- [Persistent threads \u002F persistent blocks](https:\u002F\u002Fresearch.nvidia.com\u002Fpublication\u002F2012-06_understanding-efficiency-ray-traversal-gpus-kepler-and-fermi-addendum) — the GPU-side pattern that `bench_cuda_persistent.cu` uses. Aila & Laine, NVIDIA Research, 2012.\n- [glibc libmvec](https:\u002F\u002Fsourceware.org\u002Fglibc\u002Fwiki\u002Flibmvec) — auto-vectorized math (`_ZGVnN4v_expf` for AArch64 NEON) that gcc's `-O3 -ffast-math` emits in the attention softmax. Required at link time on Linux: `-lm -lmvec`.\n\n## license\n\nMIT\n","该项目对比了在M4 Max MacBook Pro上运行Karpathy的4192参数微型GPT模型的不同实现方式与TALOS-V2 FPGA实现的性能。核心功能包括使用Python、NumPy、MLX（CPU和GPU）、C（fp32+NEON）及C（Q4.12定点）五种方法执行模型，并评估其吞吐量与能耗比。结果显示，优化后的C语言版本在单个M4 Max P核上的表现远超FPGA实现，达到约71倍的吞吐量。此项目适用于需要了解不同编程语言和硬件平台对小型神经网络性能影响的研究者或开发者。",2,"2026-06-11 02:31:01","CREATED_QUERY"]