[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-85138":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":15,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":19,"hasPages":17,"topics":20,"createdAt":10,"pushedAt":10,"updatedAt":21,"readmeContent":22,"aiSummary":10,"trendingCount":15,"starSnapshotCount":15,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},85138,"gateGPT","fguzman82\u002FgateGPT","fguzman82","Full Transformer into a custom chip. microGPT in RTL, generating names on a Virtex-5 FPGA at ~56k tokens\u002Fsecond.","",null,"Verilog",237,45,134,0,37.77,false,"main",true,[],"2026-06-15 10:04:46","# gateGPT\n\n**gateGPT** is a hardware (RTL) implementation of [Andrej Karpathy's microGPT](https:\u002F\u002Fkarpathy.github.io\u002F2026\u002F02\u002F12\u002Fmicrogpt\u002F)\n— a small character-level GPT — running entirely on a **Xilinx Virtex-5** FPGA (XC5VLX110T, XUPV5 \u002F\nML509 board, ISE 14.7, Verilog-2001), here trained to generate names. The model (one transformer\nblock: RMSNorm → multi-head causal attention → MLP, in Q5.11 fixed point) executes as a\n**microcode-ROM sequencer** driving modular datapath actuators over a shared dual-port scratchpad;\n**incremental decoding with a persistent KV cache** computes only the new token's K\u002FV each step and\nattends over the cached context, instead of recomputing the whole window. It generates names on the\nboard's character LCD at **around 50,000 tokens\u002Fsecond at 80 MHz**, while a rotary encoder sets the\ngeneration speed and the sampling temperature.\n\nThis is an independent design — the RTL, the fixed-point spec, the microcode ISA, and the trained\nweights are all our own. Throughput improved **28×** over the first working version (from ~2.4k to\n**~50–69k tokens\u002Fsecond**, depending on context length), all bit-exact to a Python reference and\nconfirmed generating names on the board (closing timing post place-and-route at 80 MHz).\n\n---\n\n## Architecture\n\nThe inference core is a **microcode-ROM sequencer driving modular datapath actuators** — not\na hand-coded monolithic state machine. A small program ROM (`generated\u002Fucode.hex`, produced by\n`tools\u002Fucode_asm.py`) encodes the transformer schedule as macro-ops; a micro-PC fetches one per\nstep, starts the matching actuator, and waits for `done`. Actuators share a true dual-port\nactivation scratchpad (`vmem`, one Block RAM) that also holds the persistent KV cache.\n\n```mermaid\nflowchart TB\n    IN([token_in, pos_in]) --> PC\n\n    subgraph SEQ[\"microcode sequencer\"]\n        direction LR\n        PC[\"micro-PC\"] --> UROM[\"ucode ROM\u003Cbr\u002F>(macro-ops)\"] --> DEC[\"decode \u002F\u003Cbr\u002F>actuator select\"]\n    end\n\n    DEC -->|start one| ACT\n\n    subgraph ACT[\"datapath actuators — one active per step\"]\n        direction LR\n        EMB[\"embed\"]\n        NRM[\"norm\u003Cbr\u002F>(RMSNorm)\"]\n        MV[\"matvec\u003Cbr\u002F>24x2 MAC tile\"]\n        ATT[\"attn\u003Cbr\u002F>multi-head\"]\n        VEC[\"vecop\u003Cbr\u002F>add \u002F ReLU\"]\n        SMP[\"sampler\u003Cbr\u002F>softmax + LCG\"]\n    end\n\n    WROM[(\"weight ROM\u003Cbr\u002F>(wrom)\")] --> MV\n    GROM[(\"gain ROM\u003Cbr\u002F>(grom)\")] --> NRM\n\n    ACT \u003C-->|\"port A + port B\"| VMEM[(\"vmem — true dual-port BRAM\u003Cbr\u002F>working set + persistent KV cache\")]\n\n    SMP --> OUT([next_token, rng_out])\n```\n\nDatapath actuators (`core\u002F`):\n\n| Module | Role |\n|---|---|\n| `matvec` | parallel multiply-accumulate tile — the linear projections (24 lanes × 2 columns\u002Fcycle) |\n| `norm` | RMSNorm (`udiv` + `isqrt` primitives), 2 elements\u002Fcycle on the dual-port vmem |\n| `attn` | single-position multi-head causal attention with per-head parallel dividers |\n| `exp_unit` | fixed-point `exp` via table + linear interpolation |\n| `sampler` | temperature softmax + LCG categorical sampling, or greedy argmax |\n| `embed`, `vecop` | embedding lookup, residual add \u002F ReLU |\n| `wrom`, `grom`, `vmem2` | wide weight ROMs, RMSNorm gains, true dual-port activation scratchpad |\n\n**Model:** 1 transformer block, `n_embed=24`, 4 heads × head-dim 6, MLP hidden 96, context 16,\nvocabulary 27 (`.` + `a`–`z`). All arithmetic is signed **Q5.11** fixed point (FRAC=11). The\nPython integer reference (`tools\u002Ffixedpoint.py`) is the bit-exact specification the RTL matches.\n\n| Parameter | Value |\n|---|---|\n| Blocks \u002F heads \u002F head-dim | 1 \u002F 4 \u002F 6 |\n| Embedding \u002F MLP hidden | 24 \u002F 96 |\n| Context (block size) \u002F vocab | 16 \u002F 27 |\n| Number format | Q5.11 signed 16-bit |\n| RNG \u002F divide | 32-bit LCG \u002F truncate-toward-zero |\n| RMSNorm | integer `isqrt` + reciprocal |\n| `exp` | 17-entry table + linear interpolation |\n\n---\n\n## Results — the optimization journey\n\nEvery step below is **bit-exact** to the Python reference (greedy `alaya`, sampled `rosphod`\nat seed 2, T=0.7) and verified in the iSim oracle. Throughput is per-token at the board clock.\n\n| # | Stage | Key change | Cycles\u002Ftoken | tok\u002Fs @ 80 MHz | LUT | DSP | Status |\n|---|---|---|---:|---:|---:|---:|---|\n| 0 | First core | microcode core, recompute full 16-tok context | 32,872 | 2,433 | 8.6k | 15 | 33 MHz board |\n| 1 | Timing rework | vmem→BRAM (registered read), read-ahead, pipelining | 32,872 | 2,433 | ~9k | 15 | **80 MHz** board |\n| 2 | KV cache | incremental decode, absolute positions, persistent K\u002FV | 10,192 | 7,849 | ~9k | 15 | 80 MHz |\n| 3 | Parallel MAC | 24-lane systolic matvec tile | 2,757 | 29,016 | 14k | 35 | 80 MHz |\n| 4 | Parallel attn dividers | per-head concurrent softmax divides | 1,781 | 44,919 | 14k | 35 | **80.2 MHz** board |\n| 5 | radix-4 `udiv` | divider does 2 quotient bits\u002Fcycle | 1,541 | 51,914 | – | – | 80 MHz |\n| 6 | narrow `isqrt` + matvec writeback overlap | 32-bit isqrt; writeback hides behind next tile | 1,428 | 56,022 | 17k | 35 | **80 MHz** board |\n| 7 | dual-port vmem + RMSNorm 2×\u002Fcycle | true dual-port BRAM scratchpad | 1,356 | 58,997 | – | – | (intermediate) |\n| 8 | matvec 2 cols\u002Fcycle + 2 rows\u002Fcycle writeback | double-width weight ROM, dual-port reads\u002Fwrites | 1,145 | 69,869 | 16.7k | 62 | needed pipelining |\n| 9 | **operand pipeline (final)** | extra register stage before the multiply closes timing | 1,156 | **69,204** | 15.5k | 62 | **80 MHz** board ✅ |\n\n**Throughput, final design @ 80 MHz** (bit-exact, post-PAR closed at 12.461 ns, 0 timing errors):\n\n| Metric | Cycles\u002Ftoken | tok\u002Fs |\n|---|---:|---:|\n| First token (best case) | 1,156 | ~69,200 |\n| Average over a full name | 1,321 | ~60,600 |\n| Longest-context token | 1,488 | ~53,800 |\n\n### FPGA resource utilization\n\nFull board (inference core + LCD driver + rotary control + tok\u002Fs meter + DCM) on the\n**XC5VLX110T-1 FF1136**, post-PAR at 80 MHz (min period 12.458 ns, 0 timing errors):\n\n| Resource | Used | Available | Util. |\n|---|---:|---:|---:|\n| Slice LUTs | 16,548 | 69,120 | 23% |\n| &nbsp;&nbsp;— as logic | 16,427 | 69,120 | 23% |\n| &nbsp;&nbsp;— as distributed RAM | 56 | 17,920 | \u003C1% |\n| Slice Registers (FF) | 5,530 | 69,120 | 8% |\n| Occupied Slices | 5,362 | 17,280 | 31% |\n| **DSP48E** | **62** | **64** | **96%** |\n| Block RAM (RAMB36) | 2 | 148 | 1% |\n| BUFG | 2 | 32 | 6% |\n| DCM_ADV | 1 | 12 | 8% |\n| Bonded IOBs | 29 | 640 | 5% |\n\n**DSP is the binding resource** — the 24-lane × 2-column matvec tile uses 48 of the 62. Everything\nelse is comfortable (≤31%). The activation scratchpad + KV cache fit in a single dual-port Block RAM;\nthe weight\u002Fembedding\u002Fmicrocode ROMs are LUT-baked constants (see the bring-up note below).\n\n### Logic-gate estimate\n\nFPGA resources don't map 1:1 to ASIC gates, but converting each primitive to **2-input-NAND\nequivalents** (factors in parentheses) puts the whole design's complexity in perspective:\n\n| Element | Count | × gates\u002Felem | Gate-equiv |\n|---|---:|---:|---:|\n| Logic LUT6 | 16,427 | × 12 | ~197,000 |\n| Flip-flops | 5,530 | × 6 | ~33,000 |\n| DSP48E (as a 16×16 MAC) | 62 | × 3,500 | ~217,000 |\n| **Total logic** | | | **≈ 450,000 (~0.45 M) gates** |\n| Block RAM (SRAM) | 2 × 36 Kb | (memory) | ~74 Kbit on-chip |\n\nSo the active design is on the order of **~0.45 million NAND2-equivalent gates** — the LUT fabric and\nthe DSP multipliers each contribute about half — plus ~74 Kbit of on-chip SRAM (the activation\nscratchpad + KV cache). This is a *rough* figure: LUT- and DSP-to-gate conversions vary by roughly\n±2×, and FPGA logic doesn't translate cleanly to a standard-cell count.\n\n---\n\n## Key engineering lessons\n\n- **KV cache is the single biggest win** (3.2×): recomputing the whole context every token is\n  the dominant cost in a naïve decoder. Switching to absolute-position training enabled it.\n- **Post-synthesis Fmax lies under congestion.** A 2-columns\u002Fcycle matvec reported 88 MHz\n  post-synth but collapsed to 35 MHz post-PAR — because a mis-written dual-port template made\n  XST infer the 1024×16 scratchpad as **16,384 flip-flops** instead of a Block RAM (look for\n  `N flip-flops were inferred for signal \u003Cmem>` in the HDL report). The fix: **one `always`\n  block per port** for the true-dual-port BRAM template. LUT dropped 46.7k → 16.7k.\n- **Break long BRAM→DSP nets with a register.** The final 0.14 ns to 80 MHz was closed by\n  pipelining the activation\u002Fweight operands one extra stage so the high-fanout BRAM-output net\n  stays off the multiply's critical path.\n- **Exact integer arithmetic is free to parallelize.** radix-4 division and split MAC lanes\n  preserve the floor-divide \u002F saturating results, so the golden never changes.\n\n### Hardware bring-up: two XST 14.7 bugs that pass simulation but hang the board\n\nThe bit-exact iSim golden passed at every step, yet the first board run **hung** (frozen banner,\n`gen_busy` stuck, 0 tok\u002Fs) while the rotary\u002FLEDs still worked. Two XST 14.7 synthesis-vs-sim\ndivergences were the cause — neither shows up in RTL simulation:\n\n- **`$readmemh` ROMs get tied to zero.** XST silently zeroes small `$readmemh` distributed-ROM\n  arrays (look for `Signal \u003Cname> is used but never assigned. Tied to default value` in the `.syr`).\n  This zeroed the **microcode** ROM → the sequencer ran all-NOP, never hit `HALT`, and hung; it\n  also zeroed the weights\u002Fexp\u002Fembeddings → garbage output. `$readmemb` does **not** help (same\n  mechanism). Fix: emit every ROM the core reads as a **combinational `case` function** (explicit\n  constants XST bakes into LUTs reliably) — see `core\u002Fucode_rom.vh`, `wrom_data.vh`, `tok_emb.vh`,\n  `pos_emb.vh`, `exp_data.vh`, `gains.vh`. Verify the `.syr` \"tied to default\" list is empty.\n- **A live register can be constant-folded away.** XST trimmed the matvec's tile base `obase` to\n  constant 0 (`has a constant value of 0 ... will be trimmed`), so every **multi-tile** matmul\n  (fc1\u002Flm) looped forever — the core hung at microcode `pc=9` (the fc1 matvec). A `pc`-on-LEDs\n  debug probe localized it. Fix: `(* keep = \"true\" *)` on `obase`\u002F`wbase`. Also avoid bit-selecting\n  an `integer` parameter (`LANES[6:0]`) — assign it to a sized `localparam` first.\n\n**Takeaway:** post-PAR timing closure ≠ a working design. On XST 14.7, never trust `$readmemh` for\nROM init (use `case` functions), and treat \"constant value \u002F tied to default\" warnings as bugs.\nWith both fixed, the board generates names correctly at 80 MHz.\n\n---\n\n## Layout\n\n```\ncore\u002F         independent inference core (RTL) + generated includes (*.vh)\nboard\u002F        XUPV5 top, HD44780 LCD driver, rotary control, tokens\u002Fsec meter, UCF\ntools\u002F        model, training, fixed-point reference, weight\u002Fmicrocode export\ndata\u002F         public makemore names corpus (training data)\ngenerated\u002F    fixed-point weight ROMs (*.hex) + microcode program (ucode.hex)\nsim\u002F          iSim testbenches (per-actuator + end-to-end golden)\n```\n\n## Build & run\n\nTrain and export the model artifacts (Python 3 + numpy + torch):\n\n```bash\npython tools\u002Ftrain.py            # -> tools\u002Fweights.npz\npython tools\u002Fexport.py           # -> generated\u002F*.hex, core\u002Fcore_params.vh, gains.vh\npython tools\u002Fucode_asm.py        # -> generated\u002Fucode.hex, core\u002Fcoremap.vh\n```\n\nSimulate the core against the golden (Xilinx iSim):\n\n```bash\nfuse -incremental -prj tb_core.prj -o sim\u002Ftb_core_sim work.tb_core\n.\u002Fsim\u002Ftb_core_sim -tclbatch sim\u002Fisim_run.tcl     # prints CYCLES_PER_TOKEN + CORE PASS\n```\n\nBuild the board bitstream (ISE 14.7): `xst → ngdbuild → map → par → trce → bitgen` against\n`xupv5_microgpt_top.prj` \u002F `board\u002Fxupv5_microgpt.ucf` for part `xc5vlx110t-1-ff1136`.\n\n## Board\n\nVerified on the XUPV5: names generate and scroll on the LCD at 80 MHz.\n\n- 100 MHz oscillator → DCM CLKFX ×4\u002F5 → **80 MHz** core clock.\n- Names auto-generate; the **rotary encoder** adjusts one of two settings, chosen by\n  **pressing** it:\n  - **RATE** — auto-rotation speed, from ~1 Hz (readable) up to back-to-back (max throughput).\n  - **TEMP** — sampling temperature, `T = 0.5 … 1.2` in 0.1 steps (default 0.7).\n- `led[5]` lights while in TEMP mode. LCD row 1 shows the current name; row 2 shows the active\n  setting (`rate: NNNNN t\u002Fs` measured, or `temp: X.Y`). `led[7]` is a 1 Hz heartbeat.\n```\n",2,"2026-06-15 02:30:07","CREATED_QUERY"]