[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74710":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":15,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},74710,"picolm","RightNow-AI\u002Fpicolm","RightNow-AI","Run a 1-billion parameter LLM on a $10 board with 256MB RAM","https:\u002F\u002Fwww.rightnowai.co\u002Fforge",null,"C",1649,206,28,13,0,6,50,18,78.45,"MIT License",false,"main",true,[26,27,28,29,30,31,32,33,34],"arm","embedded","inference","llm","openclaw","picoclaw","quantization","raspberry-pi","risc-v","2026-06-12 04:01:15","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLanguage-C11-blue?style=flat-square\" alt=\"C11\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBinary_Size-~80KB-brightgreen?style=flat-square\" alt=\"Binary Size\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRuntime_RAM-45MB-orange?style=flat-square\" alt=\"RAM\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDependencies-Zero-success?style=flat-square\" alt=\"Zero Dependencies\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow?style=flat-square\" alt=\"MIT License\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">PicoLM\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Run a 1-billion parameter LLM on a $10 board with 256MB RAM.\u003C\u002Fstrong>\u003Cbr>\n  Pure C. Zero dependencies. One binary. No Python. No cloud.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ccode>echo \"Explain gravity\" | .\u002Fpicolm model.gguf -n 100 -j 4\u003C\u002Fcode>\n\u003C\u002Fp>\n\n---\n\n## The Perfect Match: PicoLM + PicoClaw\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"picolm.jpg\" alt=\"PicoLM — Run a 1-billion parameter LLM on a $10 board\" width=\"640\">\n  \u003Cbr>\u003Cbr>\n\u003C\u002Fdiv>\n\nPicoLM was built as the **local brain** for [PicoClaw](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw) — an ultra-lightweight AI assistant in Go that runs on $10 hardware. Together, they form a **fully offline AI agent** — no cloud, no API keys, no internet, no monthly bills.\n\n> **Every other LLM provider needs the internet. PicoLM doesn't.**\n\n\u003Ctable align=\"center\">\n  \u003Ctr align=\"center\">\n    \u003Ctd>\u003Cb>The Hardware\u003C\u002Fb>\u003C\u002Ftd>\n    \u003Ctd>\u003Cb>The Architecture\u003C\u002Fb>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fsipeed\u002Fpicoclaw\u002Fmain\u002Fassets\u002Flicheervnano.png\" alt=\"$9.90 LicheeRV Nano\" width=\"360\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fsipeed\u002Fpicoclaw\u002Fmain\u002Fassets\u002Farch.jpg\" alt=\"PicoClaw architecture — PicoLM sits in the LLM box\" width=\"420\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cem>$9.90 — that's the entire server\u003C\u002Fem>\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cem>PicoLM powers the LLM box in PicoClaw's agent loop\u003C\u002Fem>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Why they're a perfect fit\n\n| | Cloud Provider (OpenAI, etc.) | PicoLM (Local) |\n|---|---|---|\n| **Cost** | Pay per token, forever | Free forever |\n| **Privacy** | Your data sent to servers | Everything stays on-device |\n| **Internet** | Required for every request | Not needed at all |\n| **Latency** | Network round-trip + inference | Inference only |\n| **Hardware** | Needs a $599 Mac Mini | Runs on a $10 board |\n| **Binary** | N\u002FA | ~80KB single file |\n| **RAM** | N\u002FA | 45 MB total |\n\n### How it works\n\nPicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI — PicoClaw formats them into a chat template, pipes the prompt to `picolm` via stdin, and reads the response from stdout. When tools are needed, `--json` grammar mode guarantees valid JSON even from a 1B model.\n\n```\nTelegram \u002F Discord \u002F CLI\n        │\n        ▼\n   ┌──────────┐    stdin: prompt     ┌───────────┐\n   │ PicoClaw │ ──────────────────►  │  picolm   │\n   │   (Go)   │ ◄──────────────────  │   (C)     │\n   └──────────┘    stdout: response  │ + model   │\n        │                            └───────────┘\n        ▼                            45 MB RAM\n   User gets reply                   No internet\n```\n\n### Quick setup\n\n```bash\n# 1. Build PicoLM\ncd picolm && make native    # or: make pi (Raspberry Pi)\n\n# 2. Download model (one-time, 638 MB)\nmake model\n\n# 3. Build PicoClaw\ncd ..\u002Fpicoclaw && make deps && make build\n\n# 4. Configure (~\u002F.picoclaw\u002Fconfig.json)\n```\n\n```json\n{\n  \"agents\": {\n    \"defaults\": {\n      \"provider\": \"picolm\",\n      \"model\": \"picolm-local\"\n    }\n  },\n  \"providers\": {\n    \"picolm\": {\n      \"binary\": \"~\u002F.picolm\u002Fbin\u002Fpicolm\",\n      \"model\": \"~\u002F.picolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\",\n      \"max_tokens\": 256,\n      \"threads\": 4,\n      \"template\": \"chatml\"\n    }\n  }\n}\n```\n\n```bash\n# 5. Chat — fully offline!\npicoclaw agent -m \"What is photosynthesis?\"\n```\n\n### Or install everything in one line\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\n### Performance on real hardware\n\n| Device | Price | Generation Speed | RAM Used |\n|--------|-------|-----------------|----------|\n| **Pi 5** (4-core) | $60 | ~10 tok\u002Fs | 45 MB |\n| **Pi 4** (4-core) | $35 | ~8 tok\u002Fs | 45 MB |\n| **Pi 3B+** | $25 | ~4 tok\u002Fs | 45 MB |\n| **Pi Zero 2W** | $15 | ~2 tok\u002Fs | 45 MB |\n| **LicheeRV Nano** | $10 | ~1 tok\u002Fs | 45 MB |\n\n### JSON tool calling\n\nPicoClaw automatically activates `--json` grammar mode when it needs structured output. This **guarantees syntactically valid JSON** even from a 1B parameter model — essential for reliable tool calling on tiny hardware:\n\n```bash\npicoclaw agent -m \"Search for weather in Tokyo\"\n# → PicoLM generates: {\"tool_calls\": [{\"function\": {\"name\": \"web_search\", \"arguments\": \"{\\\"query\\\": \\\"weather Tokyo\\\"}\"}}]}\n```\n\n> For the full PicoClaw documentation, see the [PicoClaw README](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw).\n\n---\n\n## What is PicoLM?\n\nPicoLM is a **minimal, from-scratch LLM inference engine** written in ~2,500 lines of C11. It runs [TinyLlama 1.1B](https:\u002F\u002Fhuggingface.co\u002FTinyLlama\u002FTinyLlama-1.1B-Chat-v1.0) (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:\n\n- **Raspberry Pi Zero 2W** ($15, 512MB RAM, ARM Cortex-A53)\n- **Sipeed LicheeRV** ($12, 512MB RAM, RISC-V)\n- **Raspberry Pi 3\u002F4\u002F5** (1-8GB RAM, ARM NEON SIMD)\n- Any Linux\u002FWindows\u002FmacOS x86-64 machine\n\nThe model file (638MB) stays on disk. PicoLM **memory-maps** it and streams one layer at a time through RAM. Total runtime memory: **~45MB** including the FP16 KV cache.\n\n```\n                    ┌──────────────────────────────────────────┐\n   What goes        │         45 MB Runtime RAM                │\n   in RAM           │  ┌─────────┐ ┌──────────┐ ┌───────────┐  │\n                    │  │ Buffers │ │ FP16 KV  │ │ Tokenizer │  │\n                    │  │  1.2 MB │ │ Cache    │ │   4.5 MB  │  │\n                    │  │         │ │  ~40 MB  │ │           │  │\n                    │  └─────────┘ └──────────┘ └───────────┘  │\n                    └──────────────────────────────────────────┘\n\n                    ┌──────────────────────────────────────────┐\n   What stays       │        638 MB Model on Disk              │\n   on disk          │       (mmap — OS pages in layers         │\n   (via mmap)       │        as needed, ~1 at a time)          │\n                    └──────────────────────────────────────────┘\n```\n\n---\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **GGUF Native** | Reads GGUF v2\u002Fv3 files directly — no conversion needed |\n| **K-Quant Support** | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32 |\n| **mmap Layer Streaming** | Model weights stay on disk; OS pages in one layer at a time |\n| **FP16 KV Cache** | Halves KV cache memory (44MB vs 88MB for 2048 context) |\n| **Flash Attention** | Online softmax — no O(seq_len) attention buffer needed |\n| **Pre-computed RoPE** | cos\u002Fsin lookup tables eliminate transcendentals from hot loop |\n| **SIMD Acceleration** | ARM NEON (Pi 3\u002F4\u002F5) and x86 SSE2 (Intel\u002FAMD) auto-detected |\n| **Fused Dot Products** | Dequantize + dot-product in one pass — no intermediate buffer |\n| **Multi-threaded matmul** | Parallel matrix-vector multiply across CPU cores |\n| **Grammar-Constrained JSON** | `--json` flag forces valid JSON output (for tool calling) |\n| **KV Cache Persistence** | `--cache` saves\u002Floads prompt state — skip prefill on re-runs |\n| **BPE Tokenizer** | Score-based byte-pair encoding, loaded from GGUF metadata |\n| **Top-p Sampling** | Temperature + nucleus sampling with configurable seed |\n| **Pipe-friendly** | Reads prompts from stdin: `echo \"Hello\" \\| .\u002Fpicolm model.gguf` |\n| **Zero Dependencies** | Only libc, libm, libpthread. No external libraries. |\n| **Cross-platform** | Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V. |\n\n---\n\n## Quick Start\n\n### One-liner install (Raspberry Pi \u002F Linux)\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\nThis will:\n1. Detect your platform (ARM64, ARMv7, x86-64)\n2. Install build dependencies (`gcc`, `make`, `curl`)\n3. Build PicoLM with optimal SIMD flags for your CPU\n4. Download TinyLlama 1.1B Q4_K_M (638 MB)\n5. Run a quick test\n6. Generate PicoClaw config\n7. Add `picolm` to your PATH\n\n### Build from source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frightnow-ai\u002Fpicolm.git\ncd picolm\u002Fpicolm\n\n# Auto-detect CPU (enables SSE2\u002FAVX on x86, NEON on ARM)\nmake native\n\n# Download a model\nmake model\n\n# Run it\n.\u002Fpicolm \u002Fopt\u002Fpicolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \\\n    -p \"The meaning of life is\" -n 100\n```\n\n### Build on Windows (MSVC)\n\n```cmd\ncd picolm\nbuild.bat\npicolm.exe model.gguf -p \"Hello world\" -n 50\n```\n\n### Platform-specific builds\n\n```bash\nmake native      # x86\u002FARM auto-detect (recommended for local machine)\nmake pi          # Raspberry Pi 3\u002F4\u002F5 (64-bit ARM + NEON SIMD)\nmake pi-arm32    # Pi Zero \u002F Pi 1 (32-bit ARM)\nmake cross-pi    # Cross-compile for Pi from x86 (static binary)\nmake riscv       # RISC-V (Sipeed LicheeRV, etc.)\nmake static      # Static binary for single-file deployment\nmake debug       # Debug build with symbols, no optimization\n```\n\n---\n\n## Usage\n\n```\nPicoLM — ultra-lightweight LLM inference engine\n\nUsage: picolm \u003Cmodel.gguf> [options]\n\nGeneration options:\n  -p \u003Cprompt>    Input prompt (or pipe via stdin)\n  -n \u003Cint>       Max tokens to generate (default: 256)\n  -t \u003Cfloat>     Temperature (default: 0.8, 0=greedy)\n  -k \u003Cfloat>     Top-p \u002F nucleus sampling (default: 0.9)\n  -s \u003Cint>       RNG seed (default: 42)\n  -c \u003Cint>       Context length override\n  -j \u003Cint>       Number of threads (default: 4)\n\nAdvanced options:\n  --json         Grammar-constrained JSON output mode\n  --cache \u003Cfile> KV cache file (saves\u002Floads prompt state)\n```\n\n### Examples\n\n**Basic generation:**\n```bash\n.\u002Fpicolm model.gguf -p \"Once upon a time\" -n 200\n```\n\n**Greedy decoding (deterministic, temperature=0):**\n```bash\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Output: Paris. It is the largest city in France and...\n```\n\n**Chat with TinyLlama (ChatML format):**\n```bash\n.\u002Fpicolm model.gguf -n 200 -t 0.7 -p \"\u003C|user|>\nWhat is photosynthesis?\u003C\u002Fs>\n\u003C|assistant|>\n\"\n```\n\n**Force JSON output (for tool calling \u002F structured data):**\n```bash\n.\u002Fpicolm model.gguf --json -t 0.3 -n 100 -p \"\u003C|user|>\nReturn the current time as JSON.\u003C\u002Fs>\n\u003C|assistant|>\n\"\n# Output: {\"time\": \"12:00 PM\"}\n```\n\n**Pipe from stdin:**\n```bash\necho \"Explain quantum computing in one sentence\" | .\u002Fpicolm model.gguf -n 50\n```\n\n**KV cache — skip repeated prefill:**\n```bash\n# First run: processes prompt + saves cache\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n\n# Second run: loads cache, skips prompt prefill (74% faster)\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n# Output: \"Skipping 25 cached prompt tokens\"\n```\n\n**Multi-threaded on a Pi 4 (4 cores):**\n```bash\n.\u002Fpicolm model.gguf -p \"Hello\" -n 100 -j 4\n```\n\n---\n\n## Performance\n\nMeasured on TinyLlama 1.1B Q4_K_M (638 MB model):\n\n| Metric | x86-64 (8 threads) | Pi 4 (4 cores, NEON) | Pi Zero 2W |\n|--------|--------------------|-----------------------|------------|\n| **Prefill** | ~11 tok\u002Fs | ~6 tok\u002Fs | ~1.5 tok\u002Fs |\n| **Generation** | ~13 tok\u002Fs | ~8 tok\u002Fs* | ~2 tok\u002Fs* |\n| **Runtime RAM** | 45 MB | 45 MB | 45 MB |\n| **First token** | ~2.3s | ~4s | ~16s |\n| **Binary size** | ~80 KB | ~70 KB | ~65 KB |\n\n*\\*Estimated with NEON SIMD enabled. Actual numbers depend on SD card speed and thermal throttling.*\n\n### What makes it fast\n\n```\n Raw C inference          ████████████░░░░░░░░  13.5 tok\u002Fs  (baseline: 1.6)\n + Fused dot products     ████████████████░░░░  (eliminate dequant buffer)\n + Multi-threaded matmul  █████████████████░░░  (4-8 cores in parallel)\n + FP16 KV cache          █████████████████░░░  (halve memory bandwidth)\n + Pre-computed RoPE      ██████████████████░░  (no sin\u002Fcos in hot loop)\n + Flash attention        ██████████████████░░  (no O(n) attention alloc)\n + NEON\u002FSSE2 SIMD         ███████████████████░  (4-wide vector ops)\n + KV cache persistence   ████████████████████  (skip prefill entirely)\n```\n\n---\n\n## Architecture\n\n```\n                          ┌─────────────────────────────────┐\n                          │           picolm.c              │\n                          │     CLI + Generation Loop       │\n                          └──────┬──────────────┬───────────┘\n                                 │              │\n                    ┌────────────┘              └────────────┐\n                    │                                        │\n           ┌────────┴────────┐                    ┌──────────┴──────────┐\n           │    model.h\u002Fc    │                    │    sampler.h\u002Fc      │\n           │  GGUF Parser    │                    │  Temperature +      │\n           │  mmap Layer     │                    │  Top-p Sampling     │\n           │  Streaming      │                    └──────────┬──────────┘\n           │  Forward Pass   │                               │\n           │  KV Cache I\u002FO   │                    ┌──────────┴──────────┐\n           └───┬────────┬────┘                    │    grammar.h\u002Fc      │\n               │        │                         │  JSON Constraint    │\n      ┌────────┘        └───────┐                 │  Logit Masking      │\n      │                         │                 └─────────────────────┘\n┌─────┴──────┐          ┌───────┴────────┐\n│ tensor.h\u002Fc │          │ tokenizer.h\u002Fc  │\n│ matmul     │          │ BPE Encode     │\n│ rmsnorm    │          │ Decode         │\n│ softmax    │          │ Vocab Lookup   │\n│ rope       │          └────────────────┘\n│ silu       │\n│ threading  │\n└─────┬──────┘\n      │\n┌─────┴──────┐\n│  quant.h\u002Fc │\n│ Q4_K, Q6_K │\n│ Q3_K, Q2_K │\n│ FP16, F32  │\n│ NEON + SSE │\n│ Fused Dots │\n└────────────┘\n```\n\n### The LLaMA Forward Pass (what happens for each token)\n\n```\nInput Token\n    │\n    ▼\n┌───────────────┐\n│ Embedding     │  Dequantize row from token_embd → x[2048]\n│ Lookup        │\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐  ×22 layers\n│ RMSNorm       │─────────────────────────────────────────┐\n│               │                                         │\n│ Q = xb @ Wq   │  Matrix-vector multiply (quantized)     │\n│ K = xb @ Wk   │  Store K,V in FP16 KV cache             │\n│ V = xb @ Wv   │                                         │\n│               │                                         │\n│ RoPE(Q, K)    │  Rotary position encoding (table lookup)│\n│               │                                         │\n│ Attention     │  Flash attention with online softmax    │\n│ (GQA 32→4)    │  Grouped-query: 32 Q heads, 4 KV heads  │\n│               │                                         │\n│ x += Out@Wo   │  Output projection + residual           │\n│               │                                         │\n│ RMSNorm       │                                         │\n│               │                                         │\n│ SwiGLU FFN    │  gate=SiLU(xb@Wg), up=xb@Wu             │\n│               │  x += (gate*up) @ Wd                    │\n└───────┬───────┘─────────────────────────────────────────┘\n        │\n        ▼\n┌───────────────┐\n│ Final RMSNorm │\n│ x @ W_output  │─→ logits[32000]\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐\n│ Grammar Mask  │  (if --json: force valid JSON structure)\n│ Sample Token  │  temperature → softmax → top-p → pick\n└───────────────┘\n```\n\n---\n\n## Memory Budget\n\nFor TinyLlama 1.1B Q4_K_M with 2048 context length:\n\n| Component | Size | Notes |\n|-----------|------|-------|\n| FP16 KV cache | ~40 MB | 22 layers x 2 x 2048 x 256 x 2 bytes |\n| Tokenizer | ~4.5 MB | 32K vocab strings + scores + sorted index |\n| Activation buffers | ~0.14 MB | x, xb, xb2, q, hb, hb2 |\n| Logits buffer | ~0.12 MB | 32000 x 4 bytes |\n| Dequant scratch | ~0.02 MB | Max(n_embd, n_ffn) floats |\n| Norm weights (pre-dequant) | ~0.35 MB | 45 norm vectors x 2048 x 4 bytes |\n| RoPE tables | ~0.03 MB | cos + sin x 2048 x 32 entries |\n| **Total runtime** | **~45 MB** | |\n| | | |\n| Model file (on disk) | 638 MB | Memory-mapped, ~1 layer in RAM at a time |\n\nWith 512 context (for constrained devices):\n\n| Component | Size |\n|-----------|------|\n| FP16 KV cache | ~10 MB |\n| Everything else | ~5 MB |\n| **Total** | **~15 MB** |\n\n---\n\n## Optimizations Deep-Dive\n\nPicoLM implements 9 optimizations that brought generation speed from **1.6 tok\u002Fs to 13.5 tok\u002Fs** on x86, with even larger gains expected on ARM with NEON:\n\n### 1. ARM NEON SIMD\n\n4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32`, and RoPE with interleaved `vld2q_f32` \u002F `vst2q_f32`.\n\n### 2. x86 SSE2 SIMD\n\nAuto-detected on Intel\u002FAMD. 4-wide `__m128` operations for dot products, RMSNorm, and vector operations.\n\n### 3. FP16 KV Cache\n\nKey and value vectors stored as 16-bit floats instead of 32-bit. Halves KV cache memory from ~88MB to ~44MB. Conversion uses software `fp32_to_fp16()` \u002F `fp16_to_fp32()` — no hardware FP16 support required.\n\n### 4. Pre-computed RoPE Tables\n\nSine and cosine values for all positions computed once at model load. The forward pass does a table lookup instead of calling `sinf()` \u002F `cosf()` \u002F `powf()` 64 times per token.\n\n### 5. Flash Attention (Online Softmax)\n\nSingle-pass attention with running maximum rescaling. Eliminates the `O(seq_len)` attention score buffer — critical for long contexts on memory-constrained devices.\n\n### 6. Fused Dequantize + Dot Product\n\n`vec_dot_q4_K_f32()` dequantizes and accumulates in one pass. No intermediate float buffer for the weight row. Reduces memory traffic by ~50% for matmul.\n\n### 7. Multi-threaded Matrix Multiply\n\n`matmul()` distributes output rows across threads using pthreads. Each thread processes its chunk independently with fused dot products. Scales linearly up to ~8 cores.\n\n### 8. Grammar-Constrained JSON\n\nThe `--json` mode pre-analyzes every token in the vocabulary at load time (brace delta, bracket delta, quote parity). During generation, it masks logits to guarantee syntactically valid JSON — essential for tool-calling with small models.\n\n### 9. KV Cache Persistence\n\n`--cache file.kvc` saves the FP16 KV cache state after prompt processing. On the next run with the same prompt, it loads the cache and skips prefill entirely. **74% latency reduction** for repeated system prompts.\n\n---\n\n## Supported Models\n\nPicoLM supports any LLaMA-architecture model in GGUF format:\n\n| Model | Parameters | GGUF Size (Q4_K_M) | RAM Needed |\n|-------|-----------|---------------------|------------|\n| **TinyLlama 1.1B** | 1.1B | 638 MB | ~45 MB |\n| **Llama 2 7B** | 7B | 4.1 GB | ~200 MB |\n| **Phi-2** | 2.7B | 1.6 GB | ~90 MB |\n\n> **Recommended for embedded:** TinyLlama 1.1B Q4_K_M — fits comfortably on devices with 256MB+ RAM.\n\n### Supported quantization formats\n\n`Q2_K` `Q3_K` `Q4_K` `Q4_0` `Q5_K` `Q6_K` `Q8_0` `F16` `F32`\n\n---\n\n## File Structure\n\n```\nPicoLM\u002F\n├── README.md              ← you are here\n├── BLOG.md                ← technical deep-dive blog post\n├── install.sh             ← one-liner Pi installer\n│\n├── picolm\u002F                ← the inference engine (pure C)\n│   ├── picolm.c           ← CLI entry point, generation loop (273 lines)\n│   ├── model.h\u002Fc          ← GGUF parser, mmap, forward pass (146 + 833 lines)\n│   ├── tensor.h\u002Fc         ← matmul, rmsnorm, softmax, rope (44 + 298 lines)\n│   ├── quant.h\u002Fc          ← dequantization, SIMD kernels (140 + 534 lines)\n│   ├── tokenizer.h\u002Fc      ← BPE tokenizer (32 + ~200 lines)\n│   ├── sampler.h\u002Fc        ← temperature + top-p sampling (19 + ~100 lines)\n│   ├── grammar.h\u002Fc        ← JSON grammar constraints (64 + 175 lines)\n│   ├── Makefile           ← build targets for all platforms\n│   └── build.bat          ← Windows MSVC build script\n│\n└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  ← model file (638 MB, not in git)\n```\n\n**Total C source: ~2,500 lines.** That's the entire inference engine — GGUF parsing, mmap, dequantization, matrix math, attention, tokenization, sampling, and grammar constraints.\n\n---\n\n## How It Works\n\n### The mmap trick\n\nTraditional inference engines load the entire model into RAM. PicoLM doesn't. Instead:\n\n1. The model file is **memory-mapped** (`mmap` on Linux\u002FmacOS, `MapViewOfFile` on Windows)\n2. Weight pointers point directly into the mapped file — no copying\n3. During the forward pass, each layer's weights are accessed sequentially\n4. The OS automatically pages in the needed weights and evicts old ones\n5. `madvise(MADV_SEQUENTIAL)` hints the access pattern to the kernel\n\n**Result:** A 638MB model runs on a device with 256MB RAM. Only ~30MB of the model is in physical memory at any time.\n\n### Quantization\n\nWeights are stored in 4-bit quantized format (Q4_K_M). For TinyLlama:\n- **Original:** 1.1B parameters x 4 bytes = 4.4 GB\n- **Q4_K:** 1.1B parameters x ~0.56 bytes = 638 MB\n- **Quality loss:** Minimal — Q4_K preserves 6-bit scales per 32-weight sub-block\n\n### Grouped-Query Attention (GQA)\n\nTinyLlama uses 32 query heads but only 4 key\u002Fvalue heads. Each KV head is shared by 8 query heads. This reduces KV cache size by 8x compared to full multi-head attention.\n\n---\n\n## Building & Testing\n\n### Prerequisites\n\n| Platform | Requirements |\n|----------|-------------|\n| **Linux\u002FPi** | `gcc`, `make` (install via `apt install build-essential`) |\n| **macOS** | Xcode Command Line Tools (`xcode-select --install`) |\n| **Windows** | Visual Studio Build Tools (cl.exe) |\n\n### Verify your build\n\n```bash\n# Build\nmake native\n\n# Test with greedy decoding (deterministic output)\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Expected: \"Paris. It is the largest city in France...\"\n\n# Test JSON mode\n.\u002Fpicolm model.gguf --json -p \"Return JSON with name and age\" -n 50 -t 0.3\n# Expected: valid JSON like {\"name\": \"...\", \"age\": ...}\n\n# Test KV cache\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n# Second run should say \"Skipping N cached prompt tokens\"\n```\n\n### Memory verification\n\nPicoLM prints memory stats to stderr:\n\n```\nMemory: 1.17 MB runtime state (FP16 KV cache separate)\n```\n\nTotal = runtime state + FP16 KV cache. For TinyLlama with 2048 context: ~45 MB.\n\n---\n\n## FAQ\n\n**Q: Can this run Llama 2 7B?**\nA: Yes, if you have enough RAM for the KV cache (~1.4 GB for 7B with 4096 context). The model file stays on disk via mmap. On a Pi 4 with 4GB RAM, it works but is slow (~1-2 tok\u002Fs).\n\n**Q: Why not use llama.cpp?**\nA: llama.cpp is excellent but requires ~200MB+ for the runtime on small models, has complex build dependencies, and targets desktop\u002Fserver use cases. PicoLM is purpose-built for embedded: 45MB RAM, 80KB binary, zero dependencies.\n\n**Q: Is the output quality good?**\nA: TinyLlama 1.1B is a small model — it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the `--json` grammar mode guarantees valid JSON regardless of model quality.\n\n**Q: What about GPU acceleration?**\nA: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86\u002FARM CPUs, SIMD (NEON\u002FSSE2) provides meaningful speedup.\n\n**Q: Can I use a different model?**\nA: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=gguf) and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality\u002Fsize balance) or Q2_K (smallest, lower quality).\n\n---\n\n## Roadmap\n\n- [ ] AVX2\u002FAVX-512 kernels for x86 (2-4x generation speed on modern CPUs)\n- [ ] Speculative decoding with a draft model\n- [ ] Context sliding window (infinite generation beyond max_seq_len)\n- [ ] Weight pruning for further memory reduction\n- [ ] Continuous batching for server mode\n- [ ] Mistral \u002F Phi architecture support\n\n---\n\n## Technical Blog\n\nFor a detailed writeup of the optimization journey (with code snippets and war stories), see [**BLOG.md**](BLOG.md).\n\n---\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n---\n\n\u003Cp align=\"center\">\n  \u003Cstrong>PicoLM\u003C\u002Fstrong> — because intelligence shouldn't require a data center.\n\u003C\u002Fp>\n","PicoLM是一个能够在仅配备256MB RAM的10美元开发板上运行10亿参数大语言模型的项目。它使用纯C语言编写，具有零依赖、单个约80KB大小的二进制文件以及运行时仅需45MB RAM的技术特点。特别适合于需要低成本、低功耗且完全离线运行AI助手的应用场景，如基于Raspberry Pi或RISC-V架构的嵌入式系统。通过与PicoClaw结合，PicoLM能够作为一个本地化的智能大脑，在不依赖互联网的情况下提供高效的自然语言处理能力，适用于隐私敏感环境下的自动化任务执行和个人助理服务。",2,"2026-06-11 03:50:30","high_star"]