[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81817":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":16,"lastSyncTime":27,"discoverSource":28},81817,"NanoCamelid","timtoole02\u002FNanoCamelid","timtoole02","High-performance, Rust-native LLM inference engine for Raspberry Pi and ARM64.",null,"Rust",35,5,33,0,1,2,3,45.53,"MIT License",false,"main",true,[],"2026-06-12 04:01:35","# NanoCamelid\n\nNanoCamelid is a compact Rust inference runtime for running GGUF local chat\nmodels on Raspberry Pi-class ARM64 hardware.\n\nIt is not a wrapper around a desktop inference stack. The current goal is a\nsmall, inspectable runtime that can load local GGUF files, run model smoke tests,\nchat in a terminal, and make every performance claim traceable to Pi-side\nevidence.\n\n## Current State\n\n- GGUF metadata and tensor layout inspection are available.\n- Q8_0, Q4_0, Q4_1, Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_K, and\n  IQ4_NL tensor paths are implemented in the runtime. The model catalog below\n  only marks rows as supported after Pi-local, row-specific smoke\u002Fparity\n  evidence exists.\n- Llama, Qwen, ChatML, Mistral, DeepSeek-R1-Qwen, and Gemma turn-template\n  rendering is available for smoke tests and chat.\n- The terminal TUI keeps the model loaded and reuses matching KV-cache prefixes\n  across turns.\n- Prompt ingestion uses guarded batched prefill by default. The current default\n  batch size is `16`.\n- Long-context models can be smoke-tested with an explicit\n  `NANOCAMELID_CONTEXT_LIMIT` cap to avoid allocating their full advertised KV\n  cache.\n- On AArch64 boards with dot-product support, NanoCamelid now auto-selects the\n  SDOT Q8 kernel, Q4_0 1x4 swizzled layout, Q4_0\u002FQ6_K SDOT matmuls, and\n  head-parallel attention by default. Scalar and forced-kernel modes remain\n  available for comparison.\n- Scalar reference paths remain in the test suite. Optimized kernels are kept\n  tied to parity tests and Pi-side smoke evidence.\n- The working model catalog lives in\n  [`docs\u002FMODEL_CATALOG.md`](docs\u002FMODEL_CATALOG.md). It separates Pi-smoked\n  supported rows from likely-compatible candidates and broader runtime families\n  that still need more evidence before broad claims.\n\nQuick 1B readiness check on a Pi workspace:\n\n```bash\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- model 1b --dry-run\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- inspect 1b --dry-run\n.\u002Fscripts\u002Fpi\u002Fmodel-1b.sh --dry-run\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh\n.\u002Fscripts\u002Fpi\u002Fchat-1b.sh --dry-run\n.\u002Fscripts\u002Fpi\u002Fcontext-pack-1b.sh --dry-run\n.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh --dry-run\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- chat 1b --dry-run\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- evidence 1b --dry-run\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- inspect 1b\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget cargo run -- smoke 1b chat \"Say hello in one sentence.\" 8\nCARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget NANOCAMELID_READY_TOKENS=8 cargo run -- ready 1b\n```\n\n`inspect 1b` resolves `NANOCAMELID_SMOKE_GGUF` or `NANOCAMELID_MODEL_GGUF`\nfirst, then the Pi-local `Llama-3.2-1B-Instruct-Q4_0.gguf` or Q8_0 fallback\nunder `${NANOCAMELID_WORKSPACE:-\u002Fmnt\u002Fnanocamelid}\u002Fmodels`.\nNon-dry-run `inspect 1b` is also a strict Llama 3.2 1B shape gate: it exits\nnonzero if the selected GGUF does not match the expected 1B architecture,\nmetadata, all 16 block tensor shapes, and `llama3_instruct` chat renderer.\nSuccessful `inspect 1b` runs end with `inspect_1b_status: ok` and a compact\n`json:` status row; dry runs print the same row as `json_on_success:`.\n`model 1b --dry-run` prints the same selected source, Q4_0\u002FQ8_0 default paths,\nexistence checks, selected quantization row, and the exact follow-up `inspect`,\n`smoke`, `ready`, and `evidence` commands from the Rust CLI before the heavier\ngates.\nSuccessful `model 1b` shape-audit runs end with `model_1b_status: ok` and a\ncompact `json:` status row; dry runs print the same row as `json_on_success:`.\n`inspect 1b --dry-run` prints the resolved inspect command and model existence\nchecks plus the expected JSON evidence row without opening the GGUF, so it is\nsafe before the model has been copied.\n`.\u002Fscripts\u002Fpi\u002Fmodel-1b.sh --dry-run` prints the same 1B model resolution plan\nand shows whether the Q4_0, Q8_0, and selected GGUF files exist before printing\nthe exact follow-up `model`, `inspect`, `smoke`, `ready`, and `evidence`\ncommands plus the success markers automation should expect. Without\n`--dry-run`, it runs the same strict shape audit as the Rust `model 1b` command.\n`smoke 1b` now runs the strict Llama 3.2 1B shape audit before the\nscalar-vs-selected smoke validation; dry runs print `shape_audit: enabled` so\nautomation can confirm the guard is in the plan without opening the GGUF, plus\nthe exact `model_command` that will run before the smoke gate.\nSuccessful 1B smoke runs end with `smoke_1b_status: ok` and a compact `json:`\nstatus row that records the selected model, quantization row, context cap,\nstrict shape-audit marker, smoke prompt, smoke kind, smoke token count, and\nprefill batch; dry runs print the same row as `json_on_success:`.\nThe `generate 1b`, `chat 1b`, and `tui 1b` commands use the same Pi-local 1B\nmodel resolution, with `NANOCAMELID_SMOKE_GGUF` taking precedence over\n`NANOCAMELID_MODEL_GGUF` for those aliases. Their dry runs print the selected\nmodel source, resolved model path, strict shape-audit command, and launch plan\nwithout loading the GGUF.\nNon-dry-run 1B aliases run the strict Llama 3.2 1B shape audit before direct\ngeneration, chat, or TUI launch.\n`ready 1b` runs the host fast-path probe, strict Llama 3.2 1B shape audit,\ninspect, scalar-vs-selected smoke validation, and one direct chat turn. Set\n`NANOCAMELID_READY_CHAT=0` (also `false`, `no`, or `off`) for\nprobe+audit+inspect+smoke only, or set\n`NANOCAMELID_READY_PROMPT`, `NANOCAMELID_READY_TOKENS`, and\n`NANOCAMELID_READY_TEMP` when the direct chat turn should differ from the smoke\nprompt. Successful readiness runs end with `ready_1b_status: ok`; dry runs\nprint `status_on_success: ready_1b_status: ok` for log collectors. Successful\nruns also emit a compact `json:` status row with the selected 1B model, context\ncap, quantization row, probe marker, strict shape-audit marker, smoke kind,\nsmoke prompt, smoke token count, direct-chat prompt, direct-chat token count,\nand direct-chat temperature.\n`.\u002Fscripts\u002Fpi\u002Fchat-1b.sh --dry-run` prints the exact smoke and TUI launch plan\nwithout requiring the GGUF to exist yet. It honors the same\n`NANOCAMELID_SMOKE_GGUF` then `NANOCAMELID_MODEL_GGUF` override order as the\n1B smoke\u002Fready gates, so an exact GGUF selected for smoke validation is also the\ninteractive chat target. If `NANOCAMELID_CHAT_SMOKE=0` skips the pre-chat smoke\ngate, the launcher still runs `nanocamelid model 1b` before TUI so the strict\nLlama 3.2 1B shape gate remains in the launch path.\n`.\u002Fscripts\u002Fpi\u002Fcontext-pack-1b.sh` reruns the 1B smoke gate across context caps\nfrom `NANOCAMELID_CONTEXT_PACKS`, defaulting to `512,1024,2048,4096,8192`.\nSuccessful context-pack runs end with `context_pack_1b_status: ok` and a compact\n`json:` status row listing the selected 1B model, strict shape-audit marker,\nsmoke prompt, smoke kind, token count, prefill batch, and validated context\ncaps; dry runs also print `shape_audit: enabled` and the same row as\n`json_on_success:`.\n`.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh` is the Pi-side evidence bundle for one 1B run. It\ndelegates to `model-1b.sh`, `ready-1b.sh --no-chat`, `context-pack-1b.sh`, and\n`bench-1b-prefill.sh` in that order. It records and forwards any\n`NANOCAMELID_CONTEXT_LIMIT` cap in the delegated dry-run plan, and its JSON\nstatus records the active `NANOCAMELID_PREFILL_BATCH` used by readiness\u002Fcontext\nsmoke separately from the prefill sweep batch list. Successful runs end with\n`evidence_1b_status: ok`; dry runs print `shape_audit: enabled` and the exact\ndelegated command plan. The dry run also prints the default Q4_0 and Q8_0 1B\ncandidate paths plus existence checks before selecting the model, so missing Pi\nmodel copies are visible before the full bundle runs.\n`cargo run -- evidence 1b` runs the same bounded 1B evidence bundle from the\nRust CLI when the selected GGUF is present. Dry runs print the selected-model\naudit, readiness no-chat, per-context smoke, and prefill sweep commands plus a\ncompact `json_on_success` row before loading the model. When\n`NANOCAMELID_PREFILL_BATCH` is set, the dry-run prefill sweep command includes\nthat env var so the printed plan matches the inherited smoke preflight batch.\n`.\u002Fscripts\u002Fpi\u002Fbench-1b-prefill.sh --dry-run` prints the strict 1B shape-audit\npreflight, inspect preflight, scalar-vs-selected chat smoke gate, and real\nprefill batch sweep plan, honors the same `NANOCAMELID_SMOKE_GGUF` then\n`NANOCAMELID_MODEL_GGUF` override order as the smoke gate, and validates any\n`NANOCAMELID_CONTEXT_LIMIT` cap before the model is loaded. Batch lists must be\npositive and unique, so Pi sweeps cannot spend time rerunning the same batch.\nSuccessful sweeps end with `prefill_bench_1b_status: ok` and a compact `json:`\nsummary row for log collectors that includes the strict `llama32_1b` shape\nmarker and prefill prompt.\n`cargo run -- bench 1b` runs the same model-backed prefill sweep from the Rust\nCLI when the selected 1B GGUF is present. It audits the strict 1B shape first,\nruns inspect and smoke preflights, runs each `NANOCAMELID_PREFILL_BATCH`, emits\nper-batch `json:` rows, and ends with `prefill_bench_1b_status: ok` plus the\nprefill prompt, best observed prefill\u002Fdecode batches, and the best prefill\nprompt tokens\u002Fsec.\n`cargo run -- bench 1b --dry-run` prints the selected 1B GGUF, strict\nshape-audit preflight, inspect command, smoke command, batch commands, context\ncap, and `json_on_success` row before the Pi-local GGUF is present.\nThe `inspect 3b`, `generate 3b`, `chat 3b`, `tui 3b`, and `smoke 3b` aliases\nresolve the Pi-local `Llama-3.2-3B-Instruct-Q4_0.gguf` row.\n\n## High-Performance Architecture\n\nNanoCamelid is tuned around the Raspberry Pi 5's Cortex-A76 cores rather than\nbeing a general desktop inference wrapper. The current fast path is built from a\nsmall set of explicit runtime choices:\n\n- **Auto-detected SDOT kernels.** When `NANOCAMELID_Q8_DOT_KERNEL` is unset,\n  the runtime probes CPU features and selects SDOT on AArch64 systems with\n  dot-product support, NEON when SDOT is unavailable, and scalar otherwise.\n- **Q4_0 1x4 swizzled storage.** Compatible Q4_0 matrices are swizzled at load\n  time so four adjacent rows can be streamed together in cache-friendly chunks.\n- **Q4_0 and Q6_K SDOT matmuls.** The ARM dot-product paths are enabled by\n  default on supported CPUs, with scalar references retained for tests and\n  diagnostics.\n- **Vectorized activation quantization.** AArch64 builds use NEON rounding and\n  saturating-narrowing instructions for Q8 activation blocks, avoiding the\n  scalar per-element round\u002Fclamp loop in the hot path.\n- **Head-parallel attention.** Attention heads can be evaluated across Rayon\n  workers using per-head scratch storage. This is most useful on longer prompts;\n  very short prompts are still dominated by matmul work.\n- **Governor telemetry.** `probe` and the TUI surface CPU governor information\n  and recommend the non-overclock `performance` governor when Linux reports\n  `ondemand`.\n\nThe implementation uses stable Rust with targeted `unsafe` AArch64 intrinsics\ninside hot kernels. The goal is not a huge abstraction stack; it is an\ninspectable Pi runtime where each optimization has a fallback, a test, or a\nsmoke path.\n\n## Recent Pi Results\n\nLatest runtime evidence below was captured through `59e374d`\n(`perf(neon): vectorize activation Q8 block quantization using rounding and\nsaturating narrowing instructions`).\n\nOn the Pi 2 benchmark lane, the default Q8 dot benchmark now auto-selects SDOT\nwith no speed environment variables set:\n\n- selected kernel: `sdot`\n- scalar median: about `3.18 ns\u002Fblock`\n- NEON median: about `2.11 ns\u002Fblock`\n- SDOT median: about `1.69 ns\u002Fblock`\n- SDOT speedup: about `1.88x` over the scalar run in that benchmark sample and\n  about `1.25x` over NEON\n\nThe isolated Q4 layout benchmark for a Qwen-sized shape also shows the memory\nlayout win:\n\n- row-major Q4: `90.536ms`\n- swizzled 1x4 Q4: `70.648ms` (`1.28x`)\n- page-aligned swizzled 1x4: `68.337ms` (`1.32x` vs row-major, `1.034x` vs\n  contiguous swizzled)\n\nPage alignment remains opt-in because the incremental real-model gain has not\njustified the duplicate chunk storage.\n\nLlama 3.2 1B Instruct Q4_0 now passes the Pi-local chat smoke path and direct\ngeneration check with the default fast profile:\n\n- `smoke q8-chat` generated text: `\"Hello!\"`\n- `max_logit_delta: 0.00000000`\n- direct generation prompt: `Say hello in one sentence.`\n- model load: about `0.90s`\n- prompt ingest: about `0.38s`\n- generated text: `\"Hello, how are you?\" is`\n- throughput: `8` tokens in `1.91s` (`4.18 tok\u002Fsec`)\n\nLlama 3.2 3B Instruct Q4_0 is also supported as a Pi-local single-node row:\n\n- `inspect 3b` reports `readiness: ready` and `tensor_layouts: ok`\n- tokenizer: `llama3_instruct`\n- `smoke 3b chat` generated text: `\"Hello!\"`\n- `max_logit_delta: 0.00000000`\n- direct generation prompt: `Say hello in one sentence.`\n- model load: about `1.45s`\n- prompt ingest: about `1.43s`\n- generated text: `I'd like to introduce you to`\n- direct generation throughput: `8` tokens in `3.60s` (`2.22 tok\u002Fsec`)\n- chat smoke throughput: `2` tokens in `1.02s` (`1.96 tok\u002Fsec`)\n- capped 8096-context smoke: `NANOCAMELID_CONTEXT_LIMIT=8096 .\u002Fscripts\u002Fpi\u002Fsmoke-3b.sh chat ...`\n  passed with `max_logit_delta: 0.00000000` and generated `\"Hello!\"`\n- capped 8096-context TUI launch: `.\u002Fscripts\u002Fpi\u002Fchat-3b.sh` loaded with `ctx 8096`\n\nSmall single-Pi support now has row-level evidence for load, direct completion,\nchat completion, compact model parity, broader chat-template parity,\nmetadata-backed or template-shape chat rendering, unique perf\u002FRSS sampling, and\ncapped context packs. The unique direct\u002Fchat runs below use a 512-token runtime\ncontext cap; the context pack checks run `smoke ... chat` at each listed cap.\n\n| Model row | Load + chat renderer | Direct completion | Chat perf\u002FRSS | Compact parity | Broader chat parity | Checked context packs |\n| --- | --- | --- | --- | --- | --- | --- |\n| Qwen2.5 0.5B Instruct Q4_0 | `ready`; `qwen2`; `qwen_im` | 8 tokens at `28.41 tok\u002Fsec` | 4 tokens at `36.30 tok\u002Fsec`; HWM about `1.32 GiB` | `max_logit_delta: 0.00000000`, generated `\" SMART\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"Hello\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Qwen2.5-Coder 0.5B Instruct Q4_0 | `ready`; `qwen2`; `qwen_im` | 8 tokens at `32.16 tok\u002Fsec` | 4 tokens at `37.37 tok\u002Fsec`; HWM about `1.32 GiB` | `max_logit_delta: 0.00000000`, generated `\"nodiscard\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"To\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Qwen2.5-Coder 0.5B Instruct Q5_K_M | `ready`; `qwen2`; `qwen_im` | 8 tokens at `9.98 tok\u002Fsec` | 8 tokens at `9.71 tok\u002Fsec`; direct load about `0.55s` | Not yet covered by scalar-vs-selected smoke helper | renderer `qwen_im`; generated `\"Certainly! Below is a simple Rust function\"` | Not yet run |\n| Qwen3 0.6B Instruct Q8_0 | `ready`; `qwen3`; `qwen_im` | 8 tokens at `5.90 tok\u002Fsec` | 8 tokens at `6.41 tok\u002Fsec`; sampled RSS about `1.41 GiB` | `max_logit_delta: 0.00000000`, generated `\"0\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"0\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Qwen3 1.7B Instruct Q8_0 | `ready`; `qwen3`; `qwen_im` | 8 tokens at `2.86 tok\u002Fsec` | 8 tokens at `2.87 tok\u002Fsec`; sampled RSS about `3.56 GiB` | `max_logit_delta: 0.00000000`, generated zero-width space | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated zero-width space | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Qwen3 4B Instruct Q4_0 | `ready`; `qwen3`; `qwen_im` | 8 tokens at `2.22 tok\u002Fsec` | 8 tokens at `2.22 tok\u002Fsec`; sampled RSS about `5.50 GiB` | `max_logit_delta: 0.00000000`, generated `\"ations\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"ations\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| SmolLM3 3B Q4_0 | `ready`; `smollm3`; `qwen_im_token_fallback` | 8 tokens at `3.10 tok\u002Fsec` | 8 tokens at `3.09 tok\u002Fsec`; sampled RSS about `3.94 GiB` | `max_logit_delta: 0.00000000`, generated `\"\u003Cthink>\"` | renderer `qwen_im_token_fallback`; `max_logit_delta: 0.00000000`, generated `\"\u003Cthink>\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| SmolLM2 1.7B Instruct Q4_0 | `ready`; `llama`; `qwen_im` | Prompt ended immediately in direct mode | 2 tokens at `5.82 tok\u002Fsec`; sampled RSS about `2.02 GiB` | `max_logit_delta: 0.00000000`, generated `\"Hello\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"Hello\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Gemma 3 1B IT Q4_0 | `ready`; `gemma3`; `gemma_turn` | 8 tokens at `3.78 tok\u002Fsec` | 8 tokens at `3.83 tok\u002Fsec`; sampled RSS about `2.20 GiB` | `max_logit_delta: 0.00000000`, generated `\"擬\"` | renderer `gemma_turn`; `max_logit_delta: 0.00000000`, generated `\"擬\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| DeepSeek-R1-Distill-Qwen 1.5B Q4_0 | `ready`; `qwen2`; `deepseek_r1_qwen` | 8 tokens at `12.38 tok\u002Fsec` | 4 tokens at `15.59 tok\u002Fsec`; HWM about `2.83 GiB` | `max_logit_delta: 0.00000000`, generated `\"!\"` | renderer `deepseek_r1_qwen`; `max_logit_delta: 0.00000000` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Llama 3.2 1B Instruct Q4_0 | `ready`; `llama`; `llama3_instruct` | 8 tokens at `4.11 tok\u002Fsec` | 2 tokens at `3.61 tok\u002Fsec`; HWM about `2.37 GiB` | `max_logit_delta: 0.00000000`, generated `\",\"` | renderer `llama3_instruct`; `max_logit_delta: 0.00000000`, generated `\"Hello\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Llama 3.2 1B Instruct Q8_0 | `ready`; `llama`; `llama3_instruct`; `llama32_1b_shape: ok` | 8 tokens at `3.49 tok\u002Fsec` | End-to-end ready run generated `\"Hello!\"` at `3.20 tok\u002Fsec`; HWM about `3.31 GiB` | `max_logit_delta: 0.00000000`, generated `\",\"` | renderer `llama3_instruct`; `max_logit_delta: 0.00000000`, generated `\"Hello!\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Llama 3.2 3B Instruct Q4_0 | `ready`; `llama`; `llama3_instruct` | 8 tokens at `2.29 tok\u002Fsec` | 2 tokens at `2.04 tok\u002Fsec`; HWM about `4.93 GiB` | `max_logit_delta: 0.00000000`, generated `\",\"` | renderer `llama3_instruct`; `max_logit_delta: 0.00000000`, generated `\"Hello\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Mistral 7B Instruct v0.1 Q4_0 | `ready`; tested GGUF reports `llama`; `mistral_inst_token_fallback` | 8 tokens at `2.91 tok\u002Fsec` | 4 tokens at `3.53 tok\u002Fsec`; HWM about `8.21 GiB` | `max_logit_delta: 0.00000000`, generated `\",\"` | renderer `mistral_inst_token_fallback`; `max_logit_delta: 0.00000000` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n| Qwen2.5-Coder-7B-Instruct Q4_0 | `ready`; `qwen2`; `qwen_im` | 8 tokens at `2.88 tok\u002Fsec` | 4 tokens at `3.54 tok\u002Fsec`; HWM about `10.14 GiB` | `max_logit_delta: 0.00000000`, generated `\"odzi\"` | renderer `qwen_im`; `max_logit_delta: 0.00000000`, generated `\"One\"` | `512`, `1024`, `2048`, `4096`, `8192`: all exact parity |\n\nThe Mistral v0.1 GGUF row above does not provide `tokenizer.chat_template`\nmetadata. NanoCamelid detects the row shape from GGUF metadata and renders the\nstandard `[INST] ... [\u002FINST]` prompt form through `mistral_inst_token_fallback`;\nthat fallback is covered by scalar-vs-selected parity and every capped context\npack listed above.\n\nThe main prefill improvement came from loop-inverted batched Q4 prefill:\n\n- batch 1, 145-token Qwen prompt: `48.90s`\n- batch 16 before loop inversion: `31.38s`\n- batch 16 after loop inversion: `17.04s`\n- default batch 16 real-model check: about `17.0s`\n\nSynthetic Q4 prefill tuning on the same Pi showed batch 32 slightly ahead in\nthe isolated benchmark, but the real Qwen chat path favored batch 16, so `16` is\nthe production default.\n\nRecent experiments and narrow wins:\n\n- Register-accumulated attention was correct but did not improve the short Qwen\n  decode run (`1.88 tok\u002Fsec` baseline vs `1.87 tok\u002Fsec` experiment).\n- f16 KV-cache storage preserved short Qwen smoke output but did not improve the\n  short 16-token prompt (`1.83 tok\u002Fsec` vs `1.88 tok\u002Fsec` for f32 cache), so it\n  remains an opt-in memory-pressure mode.\n- Vectorized activation quantization is now landed after Pi-side smoke passed;\n  the short 1B run moved from `4.16` to `4.18 tok\u002Fsec`, which is a small\n  positive result but still within normal run noise.\n\nStrand Rust Coder 14B Q6_K now inspects and runs with a capped context on the\nPi 2 benchmark lane. It is useful compatibility evidence for Qwen2 + Q6_K, but\nit is not a practical Pi target yet:\n\n- model: `Fortytwo-Network\u002FStrand-Rust-Coder-14B-v1-GGUF`\n- file: `Fortytwo_Strand-Rust-Coder-14B-v1-Q6_K.gguf`\n- size: `12.1 GB`\n- metadata: Qwen2, 48 layers, 5120 hidden width, 32k advertised context\n- short run with `NANOCAMELID_CONTEXT_LIMIT=128`: load about `39-54s`, one-token\n  prompt prefill about `6.6s`, 8-token generation `46.06s` (`0.17 tok\u002Fsec`)\n- Q6_K SDOT preserved the initial smoke output and reduced a capped one-token\n  Strand run from about `78s` to about `54s`.\n\nMixtral Q4_0 now has exact-row three-Pi cluster chat support. NanoCamelid can\nparse the expert-indexed MoE tensors, render the Mistral\u002FMixtral `[INST]` chat\nprompt, route through the top experts, and produce prompt-level chat output\nacross the Pi pipeline. This is not a single-Pi support claim: full Mixtral\nsingle-node generation exceeds 16 GB Pi RAM during eager weight load and needs\nclustered execution or a future lazy expert loader.\nThe cluster TCP path now performs a startup handshake so every node confirms the\nsame protocol, model shape, MoE expert shape, and adjacent layer ranges before\nactivation streaming begins.\n\nLatest Mixtral cluster chat evidence:\n\n- model: `mixtral-8x7b-instruct-v0.1.Q4_0.gguf`\n- mode: three-Pi `master-chat`\n- layer split: `0..11`, `11..22`, `22..32`\n- prompt: `Write one short sentence about Raspberry Pi clusters.`\n- rendered prompt: `\u003Cs>[INST] Write one short sentence about Raspberry Pi clusters. [\u002FINST]`\n- generated text: `Raspberry Pi clusters are groups of`\n- handshake: master, middle, and final worker agreed on protocol, layer ranges,\n  hidden width, context cap, KV width, vocab size, and MoE expert shape before\n  activation streaming\n- generated tokens: `8`\n- throughput: about `1.26 tok\u002Fsec`\n\nUse the Pi launcher to print the exact current run plan without hard-coding node\naddresses in the repo:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fmixtral-cluster.sh --dry-run\n```\n\n## Runtime Design\n\nNanoCamelid keeps the runtime small and explicit:\n\n- Rust CLI only; no Python service dependency and no required C++ build step.\n- Bounded Rayon worker setup tuned for small ARM boards.\n- Optional CPU affinity when the platform exposes it.\n- GGUF tensor bytes are sourced from an mmap-backed view during model loading,\n  avoiding one temporary file-read buffer per tensor while preserving owned\n  runtime weights.\n- NEON\u002FSDOT hot paths guarded by architecture checks and parity tests.\n- Default fast-path Q4_0\u002FQ6_K SDOT, Q4_0 1x4 swizzled storage, Q8 SDOT\n  auto-selection, and head-parallel attention on supported Pi-class ARM64\n  hardware.\n- Repeatable smoke and benchmark commands instead of broad model-family claims.\n\n## Requirements\n\n- Raspberry Pi 5 or another ARM64 Linux machine\n- Rust toolchain\n- A local GGUF model file\n\n## Quick Start\n\nInstall the latest release build from GitHub:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Ftimtoole02\u002FNanoCamelid\u002Fmain\u002Fscripts\u002Finstall.sh | bash\n```\n\nThe installer clones NanoCamelid, builds the release binary with Cargo, and\nlinks `nanocamelid` into `~\u002F.local\u002Fbin`. On macOS it refuses to build unless\n`CARGO_TARGET_DIR` or `NANOCAMELID_TARGET_DIR` points at an external `\u002FVolumes`\npath, matching the local validation guard. Override paths when needed:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002Ftimtoole02\u002FNanoCamelid\u002Fmain\u002Fscripts\u002Finstall.sh | \\\n  env NANOCAMELID_INSTALL_DIR=\u002Fmnt\u002Fnanocamelid\u002Fsrc\u002FNanoCamelid \\\n    CARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget \\\n    bash\n```\n\nOn Pi workspaces mounted at `\u002Fmnt\u002Fnanocamelid`, the installer uses\n`\u002Fmnt\u002Fnanocamelid\u002Ftarget` by default unless `CARGO_TARGET_DIR` or\n`NANOCAMELID_TARGET_DIR` is set.\nRun `.\u002Fscripts\u002Finstall.sh --dry-run` from a checkout to print the resolved plan\nwithout cloning or building.\n\nManual checkout still works:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ftimtoole02\u002FNanoCamelid.git\ncd NanoCamelid\n\ncargo run -- probe\ncargo run -- inspect \u002Fpath\u002Fto\u002Fmodel.gguf\ncargo run --release -- smoke q8-model \u002Fpath\u002Fto\u002Fmodel.gguf \"Hello\" 1\ncargo run --release -- smoke q8-chat \u002Fpath\u002Fto\u002Fmodel.gguf \"Say hello in one sentence.\" 8\nNANOCAMELID_MODEL_GGUF=\u002Fpath\u002Fto\u002Fmodel.gguf cargo run --release -- tui 0.0 64\n```\n\n`probe` prints CPU and SIMD feature information. `inspect` reads GGUF metadata\nand tensor layout. `smoke q8-model` loads a Q8_0 model, checks scalar\u002Fruntime\nlogit parity, and runs a short greedy generation path from directly tokenized\nprompt text. `smoke q8-chat` runs the same parity\u002Fgeneration validation through\nthe tokenizer chat template so Llama 3.2 1B Instruct rows can be smoke-tested\nthrough the real instruct prompt path. Set `NANOCAMELID_MODEL_GGUF` to reuse\nthe same GGUF path across repeated `inspect`, `generate`, `chat`, and `tui`\nruns, or `NANOCAMELID_SMOKE_GGUF` to make the 1B\u002F3B alias launchers use the\nsame GGUF selected for smoke validation.\n\nRemote Pi validation can also cap the single readiness pass without changing\ncontext-pack sweeps. Set `NANOCAMELID_REMOTE_CONTEXT_LIMIT=512` when\n`remote_build.sh` should pass `NANOCAMELID_CONTEXT_LIMIT` into the default 1B\nreadiness gate and the optional `NANOCAMELID_REMOTE_PREFILL_BENCH=1` sweep.\nSet `NANOCAMELID_REMOTE_PREFILL_BATCH=32` to forward a tuned\n`NANOCAMELID_PREFILL_BATCH` into remote readiness, context-pack, evidence, and\nprefill-sweep smoke preflights without changing the sweep batch list. Evidence\nbundle JSON records that active smoke\u002Freadiness prefill batch as `prefill_batch`.\nSet `NANOCAMELID_REMOTE_EVIDENCE=1` when the remote build should delegate the\nmodel-backed 1B portion to `.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh` after format, tests,\nclippy, release build, probe, and Q8 benchmark complete. The evidence mode uses\nthe same redacted dry-run output and the same model override variables as the\nindividual remote readiness path.\n\nFor the standard local validation gate, use:\n\n```bash\nNANOCAMELID_TARGET_DIR=\"\u002FVolumes\u002FSSK Drive\u002Fnanocamelid-target\" .\u002Fscripts\u002Fvalidate.sh\n.\u002Fscripts\u002Fvalidate.sh --dry-run\n```\n\nOn macOS, `validate.sh` refuses to guess a default target directory or use a\nnon-`\u002FVolumes` target. Set `CARGO_TARGET_DIR` or `NANOCAMELID_TARGET_DIR` to an\nexternal drive path first so the repo does not create build artifacts on the\ninternal disk. On prepared Pi workspaces, the same script defaults to\n`\u002Fmnt\u002Fnanocamelid\u002Ftarget`. The gate also runs the core Cargo checks, the\n`model 1b`, `inspect 1b`, `generate 1b`, `chat 1b`, `smoke 1b`, `ready 1b`,\n`evidence 1b`, `tui 1b`, and `bench 1b` CLI dry runs, the Pi 1B launcher dry\nruns, the Strand\u002FMixtral cluster launcher dry runs, redacted remote-build dry\nruns, and the installer dry run. That keeps the default Llama 3.2 1B command\npaths, evidence bundle plan, and build-entry target-dir guard covered without\nrequiring the GGUF during local validation.\n\nSingle-turn generation is available through either raw prompt text or a rendered\nchat prompt:\n\n```bash\nNANOCAMELID_MODEL_GGUF=\u002Fpath\u002Fto\u002Fmodel.gguf \\\n  cargo run --release -- generate \"Hello\" 0.0 32\n\nNANOCAMELID_MODEL_GGUF=\u002Fpath\u002Fto\u002Fmodel.gguf \\\n  cargo run --release -- chat \"Say hello in one sentence.\" 0.0 32\n```\n\nSuccessful `generate` and `chat` runs end with `generation_status: ok` and a\n`json: {\"command\":\"chat\",...}` line that records prompt tokens, generated\ntokens, prefill batch, prefill seconds, generation seconds, tokens\u002Fsec, and the\nselected chat renderer when one is used. This gives Pi smoke logs and prefill\nsweeps a stable machine-readable timing record.\n\n`tui` opens an interactive terminal chat that keeps the model loaded, shows the\nconnected model path\u002Fname, selected Q8 kernel, chat renderer, and per-turn plus\nsession token-in\u002Ftoken-out counters, TTFT, and throughput. The prompt surface is\ncloser to a modern assistant CLI: slash commands expose model switching, nearby\nmodel discovery, live decoding settings, system prompts, transcript saving, and\nsession history without restarting the process.\n\n`NANOCAMELID_PREFILL_BATCH` controls how many prompt tokens are ingested at once\nbefore decode begins. The default is `16`. Set it to `1` for the old\nsingle-token reference behavior, or use `bench q4-prefill` to compare candidate\nbatch sizes on the current host without loading a GGUF model.\nOn a prepared Pi workspace, `.\u002Fscripts\u002Fpi\u002Fbench-1b-prefill.sh` first runs the\nstrict Llama 3.2 1B shape audit, then sweeps the real 1B chat path across\nprefill batch sizes and prints the model-backed prompt ingestion timing plus\none `json:` summary line for each batch. When the sweep finishes, it also\nreports the best observed prefill batch and decode throughput batch in a final\nJSON status row.\n\nSet `NANOCAMELID_TRACE=1` on `generate`, `chat`, or `tui` runs to print an\naggregate stage-level timing summary. It is intended for focused tuning: the\nsummary groups decode and batched prefill work by layer stage so slow paths can\nbe identified before changing kernels.\n\nFor very long-context GGUFs, `NANOCAMELID_CONTEXT_LIMIT` can cap the runtime KV\ncache during local smoke tests:\n\n```bash\nNANOCAMELID_CONTEXT_LIMIT=128 \\\n  NANOCAMELID_MODEL_GGUF=\u002Fpath\u002Fto\u002Fmodel.gguf \\\n  cargo run --release -- generate \"Hello\" 0.0 8\n```\n\nThis does not change the model metadata or make broad context-length support\nclaims; it only bounds memory for short validation runs.\n\n![NanoCamelid terminal chat showing model telemetry and token counters](docs\u002Fimages\u002Fnanocamelid-tui.png)\n\nInside the TUI:\n\n- `\u002Fmodel \u003Cpath>` loads a different GGUF without restarting the process. A\n  successful switch resets the conversation and token counters. If the new model\n  fails to load, the current model stays active.\n- `\u002Fmodels` lists GGUFs next to the current model.\n- `\u002Ftemp [value]` and `\u002Ftokens [count]` show or change per-turn decoding\n  settings.\n- `\u002Fsystem [prompt]` sets a system prompt and resets chat state; `\u002Fsystem clear`\n  removes it.\n- `\u002Fstatus`, `\u002Fhistory`, `\u002Ftrim \u003Cturns>`, and `\u002Fsave \u003Cpath>` manage the active\n  session.\n- `\u002Fclear`, `\u002Fexit`, and `\u002Fquit` reset or leave the chat.\n\nOn a prepared Pi workspace with the Llama 3.2 1B Instruct Q4_0 or Q8_0 GGUF at\nthe default model path, start the interactive 1B chat directly:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fchat-1b.sh\n```\n\nFor the matching one-command 1B validation path on that same Pi workspace:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fsmoke-1b.sh\n```\n\nFor the fuller 1B readiness gate, including host fast-path probe, strict model\nshape audit, inspect, smoke, and one direct chat turn:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fmodel-1b.sh\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh --no-chat\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh --dry-run\n.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh --dry-run\n```\n\nUse `model-1b.sh --dry-run` as a cheap model-placement preflight when the GGUF\nhas not been copied yet. It prints the exact `model`, `inspect`, `smoke`,\n`ready`, and `evidence` commands for the selected model plus the selected\nquantization row and success markers automation should expect. Without\n`--dry-run`, it exits nonzero if the selected 1B GGUF is missing, otherwise it\nruns the strict Llama 3.2 1B shape audit through the Rust CLI.\n\nThe same gate is available through the CLI when you are already using the\nrelease binary or Cargo directly:\n\n```bash\nnanocamelid ready 1b\nnanocamelid ready 1b --no-chat\nnanocamelid ready 1b --dry-run\nnanocamelid evidence 1b --dry-run\nnanocamelid ready 1b \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf chat \"Say hello in one sentence.\" 8\n```\n\nUse `--no-chat` or `--smoke-only` for the audit plus inspect plus smoke-only\nform when you want the gate to validate the model path without launching the\nfinal direct chat turn. Use `--dry-run` to print the resolved 1B model path,\nactive `NANOCAMELID_CONTEXT_LIMIT` cap, smoke settings, host probe command,\nmodel audit command, and direct-chat gate without loading the GGUF. For\nnon-interactive automation, the CLI and `ready-1b.sh` also honor\n`NANOCAMELID_READY_SMOKE_KIND`, `NANOCAMELID_READY_SMOKE_PROMPT`, and\n`NANOCAMELID_READY_SMOKE_TOKENS` as smoke defaults before the final direct chat\nturn. Successful runs print `ready_1b_status: ok` as the final readiness\nmarker.\n\nWhen you want one Pi log that captures the core 1B evidence after a fresh build,\nrun:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh\n.\u002Fscripts\u002Fpi\u002Fevidence-1b.sh \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf\nnanocamelid evidence 1b\nnanocamelid evidence 1b \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf\n```\n\nThe bundle runs the strict model audit, readiness gate without the final direct\nchat turn, context-pack smoke gate, and prefill batch sweep in order. It honors\nthe same model override variables as the individual 1B scripts and CLI aliases\nand finishes with `evidence_1b_status: ok` when every delegated gate passes.\n\nFor the supported Llama 3.2 3B Instruct Q4_0 row, place\n`Llama-3.2-3B-Instruct-Q4_0.gguf` under the same `models\u002F` directory and use the\nmatching launchers:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fsmoke-3b.sh\n.\u002Fscripts\u002Fpi\u002Fchat-3b.sh\n```\n\nThe launcher prefers a leading `.gguf` argument, then `NANOCAMELID_SMOKE_GGUF`,\nthen `NANOCAMELID_MODEL_GGUF`, then the Pi-local Q4_0 model when present, and\nfinally Q8_0. It defaults the block dot path to SDOT on Pi-class ARM64\nhardware. It runs a `smoke 1b chat` preflight before opening the TUI, so the 1B\ninstruct path keeps the scalar-vs-selected-kernel parity gate in front of\ninteractive chat. It still honors `NANOCAMELID_Q8_DOT_KERNEL` if you want to\nforce a different kernel for comparison. When the helper needs to build through\nCargo, it uses `\u002Fmnt\u002Fnanocamelid\u002Ftarget` by default, or an explicit\n`CARGO_TARGET_DIR` or `NANOCAMELID_TARGET_DIR` override.\n\nOptional arguments set the model path, temperature, and maximum assistant output\ntokens:\n\n```bash\nnanocamelid tui 1b --dry-run\n.\u002Fscripts\u002Fpi\u002Fchat-1b.sh \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf 0.0 64\n.\u002Fscripts\u002Fpi\u002Fchat-1b.sh 0.0 64\n```\n\nUse `nanocamelid tui 1b --dry-run` to verify the resolved 1B shape audit, TUI\nlaunch command, and context cap from the Rust CLI before loading the GGUF.\n\n`smoke-1b.sh` uses the same kernel defaults, but runs only the smoke gate and\nexits. Its model-selection precedence is a leading `.gguf` argument,\n`NANOCAMELID_SMOKE_GGUF`, `NANOCAMELID_MODEL_GGUF`, Pi-local Q4_0, then\nPi-local Q8_0. By default it runs the real instruct prompt path with `chat`, the\nprompt `Say hello in one sentence.`, and an 8-token response budget. Optional\narguments let you override the model path, smoke kind, prompt, and token budget\ndirectly. Add `--dry-run` to print the resolved smoke plan without loading the\nmodel:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fsmoke-1b.sh \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf chat \"Say hello in one sentence.\" 8\n.\u002Fscripts\u002Fpi\u002Fsmoke-1b.sh chat \"Say hello in one sentence.\" 8\n.\u002Fscripts\u002Fpi\u002Fsmoke-1b.sh --dry-run\n.\u002Fscripts\u002Fpi\u002Fsmoke-1b.sh model \"Hello\" 1\n.\u002Fscripts\u002Fpi\u002Fsmoke-3b.sh chat \"Say hello in one sentence.\" 4\n```\n\n`ready-1b.sh` uses the same Pi target directory and model defaults, then runs\nthe host probe, strict 1B shape audit, `inspect`, `smoke 1b`, and `chat` against\nthe resolved GGUF. A leading `.gguf` argument overrides the model path. An\noptional leading `chat`, `model`, `q8-chat`, or `q8-model` argument selects the\nsmoke gate kind. The remaining optional arguments override the final direct-chat\nprompt and token budget; when omitted, direct chat reuses the selected smoke\nprompt and token budget. Set `--no-chat`, `--smoke-only`, or\n`NANOCAMELID_READY_CHAT=0` (also `false`, `no`, or `off`) to stop after\ninspect and smoke when you only need the readiness gate:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf \"Say hello in one sentence.\" 8\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh \u002Fpath\u002Fto\u002FLlama-3.2-1B-Instruct-Q4_0.gguf chat \"Say hello in one sentence.\" 8\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh \"Say hello in one sentence.\" 8\n.\u002Fscripts\u002Fpi\u002Fready-1b.sh --smoke-only\n```\n\nFor faster local iteration, disable the preflight smoke gate explicitly. The\n1B launcher also accepts `false`, `no`, and `off`, and rejects misspelled\ntoggle values:\n\n```bash\nNANOCAMELID_CHAT_SMOKE=0 .\u002Fscripts\u002Fpi\u002Fchat-1b.sh\nNANOCAMELID_CHAT_SMOKE=0 .\u002Fscripts\u002Fpi\u002Fchat-3b.sh\n```\n\nThe preflight smoke defaults to `chat` with a one-token response budget, and\nyou can override the gate with:\n\n- `NANOCAMELID_CHAT_SMOKE_KIND=model|chat`\n- `NANOCAMELID_CHAT_SMOKE_PROMPT=\"...\"`\n- `NANOCAMELID_CHAT_SMOKE_TOKENS=1`\n\n## Benchmarks\n\nRun benchmarks on the target Pi in release mode.\n\n```bash\ncargo run --release -- bench q8-dot 1000 3\ncargo run --release -- bench q4-layout 32768 3584 3\ncargo run --release -- bench q4-prefill 128 16\n.\u002Fscripts\u002Fpi\u002Fbench-1b-prefill.sh --dry-run\n```\n\nEach cargo benchmark prints human-readable timing plus a JSON summary line.\n`bench q4-prefill` reports both milliseconds per prompt token and prompt\ntokens\u002Fsec so Pi prefill sweeps can be compared without manual conversion.\nTreat results as specific to the exact Pi, model, build, and environment used.\nThe 1B prefill sweep runs the strict 1B shape audit, inspect preflight, and\nchat smoke gate before reporting NanoCamelid's normal prompt ingestion and\ngeneration timing for each selected batch size. It emits a\n`json: {\"benchmark\":\"llama32-1b-prefill\",...}` line for each batch. Successful\nsweeps also end with `prefill_bench_1b_status: ok` and a final JSON summary\nincluding the selected model, strict 1B shape marker, context cap, planned\nbatches, best prefill batch, best prefill prompt tokens\u002Fsec, and best decode\nthroughput batch. Duplicate batch sizes are rejected before the model is loaded.\n\nUseful environment controls:\n\n- `NANOCAMELID_PREFILL_BATCH`: prompt-token batch size; default `16`.\n- `NANOCAMELID_CONTEXT_LIMIT`: optional runtime KV-cache context cap for short\n  smoke tests of long-context models.\n- `NANOCAMELID_TRACE=1`: print stage-level inference timing summaries for\n  generation and TUI turns.\n- `NANOCAMELID_RAYON_THREADS`: global Rayon worker count.\n- `NANOCAMELID_WORKER_CORES=1,2,3`: pin Rayon workers to a CPU list. If this is\n  unset and Linux reports isolated CPUs in `\u002Fsys\u002Fdevices\u002Fsystem\u002Fcpu\u002Fisolated`,\n  NanoCamelid uses that isolated set automatically.\n- `NANOCAMELID_MATMUL_MIN_ROWS`: row-count threshold before matmuls enter Rayon.\n- `NANOCAMELID_Q8_DOT_KERNEL=scalar|neon|sdot`: force the selected Q8 kernel.\n- `NANOCAMELID_Q8_DOT_SDOT=0`: disable SDOT candidate selection for comparison.\n- `NANOCAMELID_Q4_1X4_SDOT=0`: disable the Q4_0 1x4 SDOT path for comparison.\n- `NANOCAMELID_Q4_SWIZZLE_1X4=0`: disable compatible Q4_0 tensor swizzling and\n  use the row-major Q4 path for comparison.\n- `NANOCAMELID_Q4_PAGE_ALIGN_1X4=1`: when the swizzled Q4_0 path is enabled,\n  also keep an opt-in page-aligned copy of each 1x4 row chunk. This costs extra\n  memory and is not the default.\n- `NANOCAMELID_Q6K_SDOT=0`: disable the AArch64 SDOT path for Q6_K-by-Q8\n  matmuls when comparing against the scalar route.\n- `NANOCAMELID_ROPE_CACHE=0`: disable the default RoPE angle cache for\n  before\u002Fafter comparisons.\n- `NANOCAMELID_ATTENTION_HEAD_PARALLEL=0`: disable Rayon head-parallel\n  attention for comparison. This uses per-head score scratch space and is most\n  visible on longer prompts.\n- `NANOCAMELID_KV_CACHE_F16=1`: store KV-cache entries as f16 and decode them\n  during attention. This halves KV-cache storage and bandwidth for cached keys\n  and values, but it is lossy and remains opt-in until real-model parity and\n  long-context speed evidence justify broader use.\n\nThe swizzled Q4_0 1x4 path and SDOT kernels are the default fast profile on\nsupported Pi-class ARM64 hosts. The environment variables above are primarily\nfor diagnostics and before\u002Fafter benchmark runs.\n\nThe page-aligned Q4_0 1x4 path is narrower: the Pi 2 layout microbenchmark\nshowed a small gain over contiguous swizzled storage, but it duplicates the\nswizzled matrix chunks and should be treated as a measurement switch until\nreal-model runs justify making it broader.\n\nThe f16 KV-cache path is also opt-in. It intentionally compares against an\nexplicitly decoded f16 reference rather than the full-f32 cache, because\nhalf-precision cache storage is a lossy runtime mode.\n\n## Tested Models\n\nThese rows reflect models that have been loaded and smoke-tested on Raspberry Pi\nhardware with the current GGUF path. They are not broad family claims.\n\n| Model | GGUF quant | Status | Notes |\n| --- | --- | --- | --- |\n| Qwen2.5 0.5B Instruct | Q4_0 | Working | Pi smoke reports `ready`; 8-token generation runs at about `33.31 tok\u002Fsec`. |\n| Qwen2.5-Coder 0.5B Instruct | Q4_0 | Working | Pi smoke reports `ready`; 8-token generation runs at about `33.28 tok\u002Fsec`. |\n| Qwen2.5-Coder 0.5B Instruct | Q5_K_M | Working end-to-end | Official Q5_K_M GGUF inspects as `ready`; tensor mix includes `Q5_1`, `Q5_K`, `Q6_K`, and `Q8_0`; direct generation produced 8 tokens at `9.98 tok\u002Fsec`; Qwen chat rendering produced 8 tokens at `9.71 tok\u002Fsec`. |\n| DeepSeek-R1-Distill-Qwen 1.5B | Q4_0 | Working | Pi smoke reports `ready`; 8-token generation runs at about `13.25 tok\u002Fsec`. |\n| Llama 3.2 1B Instruct | Q4_0 | Working | Pi smoke passes with scalar-vs-selected-kernel logit parity and interactive TUI chat at about `4.18 tok\u002Fsec`. |\n| Llama 3.2 1B Instruct | Q8_0 | Working end-to-end | Exact Q8_0 row inspects as `ready` with `llama32_1b_shape: ok`; forced Pi readiness run passes host probe, inspect, scalar-vs-SDOT chat smoke, and direct chat generation of `\"Hello!\"` at about `3.20 tok\u002Fsec`. |\n| Llama 3.2 3B Instruct | Q4_0 | Working | Pi smoke passes with scalar-vs-selected-kernel logit parity; direct generation runs at about `2.22 tok\u002Fsec`; capped `8096` context smoke and TUI launch pass. |\n| Mistral 7B Instruct v0.1 | Q4_0 | Working for tested row | Tested GGUF reports a Llama-style architecture; 4-token smoke runs at about `3.68 tok\u002Fsec`. |\n| Qwen2.5-Coder-7B-Instruct | Q4_0 | Smoke passing | Official Q4_0 GGUF loads, Qwen chat rendering runs, and Pi smoke\u002Fchat generation passes with exact scalar-vs-selected logit parity on the smoke gate. |\n| Strand Rust Coder 14B v1 | Q6_K | Supported three-Pi cluster chat | Official Q6_K GGUF inspects and runs on one Pi, but the practical path is the three-Pi split. Current-head `master-chat` with Q6_K SDOT and pre-unpacked batched Q6_K weights generated 6 tokens at about `1.28 tok\u002Fsec` after an `8.38s` prompt ingest with the `0..16`, `16..32`, `32..48` split. |\n| Mixtral 8x7B Instruct v0.1 | Q4_0 | Supported three-Pi cluster chat | Expert-indexed MoE tensors inspect as `ready`; three-Pi `master-chat` handshake validated the `0..11`, `11..22`, `22..32` split, rendered the `[INST]` prompt, and generated 8 tokens at about `1.26 tok\u002Fsec`. Single-Pi full generation OOMs on 16 GB Pi RAM. |\n| Qwen2.5-Coder 32B Instruct | Q4_0 | Cluster smoke only | Three-Pi smoke produced matching code-text tokens at about `0.56 tok\u002Fsec`; this is not a single-Pi claim. |\n| Llama 3 70B Instruct | Q4_0 | Supported three-Pi cluster chat | Tested GGUF inspects as `ready`, uses the `llama3_instruct` chat renderer, and three-Pi `master-chat` generated 4 tokens with the `0..27`, `27..54`, `54..80` split at about `0.29 tok\u002Fsec` after a 19-token prompt ingest. This is not a single-Pi claim. |\n\nSee [`docs\u002FMODEL_CATALOG.md`](docs\u002FMODEL_CATALOG.md) for likely-compatible\nrows to test next and model families that are intentionally not claimable yet.\n\n## Pi Performance Snapshot\n\nCurrent Pi 2 evidence, measured on local release builds:\n\n### Fresh GitHub-head Throughput Sweep - 2026-05-26\n\nThese rows were rerun from clean GitHub-head release worktrees at commit\n`3229a15` using Raspberry Pi-class ARM64 nodes. Single-Pi rows used\n`nanocamelid chat \"\u003Cmodel>\" \"Say hello in one sentence.\" 0.0 8` with the\ndefault fast profile and `NANOCAMELID_CONTEXT_LIMIT=4096`, except Strand 14B\nsingle-Pi, which used a `2048` context cap. Cluster rows used three Pi nodes and\nthe documented split ranges. Short deterministic prompts may stop before the\nrequested token count when the model emits EOS.\n\n| Model | Mode | Fresh result |\n| --- | --- | --- |\n| Qwen2.5 0.5B Instruct Q4_0 | Single Pi | 6 tokens in `1.37s` (`4.36 tok\u002Fsec`); prompt ingest `0.12s`; generated `Hello, how are you?`. |\n| Qwen2.5-Coder 0.5B Instruct Q4_0 | Single Pi | 8 tokens in `1.45s` (`5.50 tok\u002Fsec`); prompt ingest `0.12s`. |\n| Qwen2.5-Coder 0.5B Instruct Q5_K_M | Single Pi | 8 tokens in `1.03s` (`7.80 tok\u002Fsec`); prompt ingest `1.48s`. |\n| DeepSeek-R1-Distill-Qwen 1.5B Q4_0 | Single Pi | 8 tokens in `6.05s` (`1.32 tok\u002Fsec`); prompt ingest `0.32s`. |\n| Llama 3.2 1B Instruct Q4_0 | Single Pi | 2 tokens in `0.55s` (`3.66 tok\u002Fsec`); prompt ingest `0.52s`; generated `Hello!`. |\n| Llama 3.2 1B Instruct Q8_0 | Single Pi | 2 tokens in `0.66s` (`3.04 tok\u002Fsec`); prompt ingest `1.01s`; generated `Hello!`. |\n| Llama 3.2 3B Instruct Q4_0 | Single Pi | 2 tokens in `0.95s` (`2.11 tok\u002Fsec`); prompt ingest `1.15s`; generated `Hello!`. |\n| Mistral 7B Instruct v0.1 Q4_0 | Single Pi | 8 tokens in `19.70s` (`0.41 tok\u002Fsec`); prompt ingest `2.09s`. |\n| Qwen2.5-Coder 7B Instruct Q4_0 | Single Pi | 2 tokens in `5.90s` (`0.34 tok\u002Fsec`) on one run and `0.33 tok\u002Fsec` on a second Pi; prompt ingest about `1.7s`; generated `Hello!`. |\n| Strand Rust Coder 14B v1 Q6_K | Single Pi, capped context | 2 tokens in `2.38s` (`0.84 tok\u002Fsec`); prompt ingest `8.14s`; generated `Hello!`. |\n| Qwen3 0.6B Q8_0 | Single Pi | 8 tokens in `1.26s` (`6.34 tok\u002Fsec`); prompt ingest `0.52s`. |\n| Qwen3 1.7B Q8_0 | Single Pi | 8 tokens in `2.80s` (`2.85 tok\u002Fsec`); prompt ingest `1.28s`. |\n| Qwen3 4B Q4_0 | Single Pi | 8 tokens in `3.69s` (`2.17 tok\u002Fsec`); prompt ingest `1.26s`. |\n| SmolLM2 1.7B Instruct Q4_0 | Single Pi | 2 tokens in `0.41s` (`4.88 tok\u002Fsec`); prompt ingest `0.53s`; generated `Hello!`. |\n| SmolLM3 3B Q4_0 | Single Pi | 8 tokens in `2.57s` (`3.11 tok\u002Fsec`); prompt ingest `0.98s`; generated `Hello! How can`. |\n| Gemma 3 1B IT Q4_0 | Single Pi | 8 tokens in `4.73s` (`1.69 tok\u002Fsec`); prompt ingest `1.06s`. |\n| LFM2 700M Q4_0 | Single Pi | Not runtime-ready: load failed on missing `output_norm.weight`. |\n| LFM2 1.2B Q4_0 | Single Pi | Not runtime-ready: load failed on missing `output_norm.weight`. |\n| LFM2 2.6B Q4_0 | Single Pi | Not runtime-ready: load failed on missing `output_norm.weight`. |\n| Phi 3.5 Mini Instruct Q4_0 | Single Pi | Not runtime-ready: load failed on missing `blk.0.attn_q.weight`. |\n| Strand Rust Coder 14B v1 Q6_K | Three-Pi cluster, `0..16`, `16..32`, `32..48` | 6 tokens in `5.855s` (`1.025 tok\u002Fsec`); prompt ingest `8.353s`; result `PASS_GENERATE_UNCHECKED`. |\n| Mixtral 8x7B Instruct v0.1 Q4_0 | Three-Pi cluster, `0..11`, `11..22`, `22..32` | 8 tokens in `22.913s` (`0.349 tok\u002Fsec`); prompt ingest `18.522s`; result `PASS_GENERATE_UNCHECKED`. |\n| Qwen2.5-Coder 32B Instruct Q4_0 | Three-Pi cluster, `0..21`, `21..43`, `43..64` | 6 tokens in `20.269s` (`0.296 tok\u002Fsec`); prompt ingest `37.796s`; result `PASS_GENERATE_UNCHECKED`. |\n| Llama 3 70B Instruct Q4_0 | Three-Pi cluster, `0..27`, `27..54`, `54..80` | 4 tokens in `24.146s` (`0.166 tok\u002Fsec`); prompt ingest `98.769s`; result `PASS_GENERATE_UNCHECKED`. |\n\n- Llama 3.2 1B Instruct Q4_0 short generation, default fast profile:\n  `4.18 tok\u002Fsec`.\n- Llama 3.2 1B Instruct Q8_0 forced end-to-end readiness run:\n  `ready`, `llama3_instruct`, `llama32_1b_shape: ok`,\n  `max_logit_delta: 0.00000000`, generated `\"Hello!\"` at about\n  `3.20 tok\u002Fsec`.\n- Llama 3.2 3B Instruct Q4_0 direct generation: about `2.22 tok\u002Fsec`;\n  chat smoke generated `\"Hello!\"` with `max_logit_delta: 0.00000000`;\n  capped 8096-context chat smoke and TUI launch pass.\n- Q8 dot microbenchmark, default-selected SDOT: about `1.69 ns\u002Fblock`.\n- Q4 layout microbenchmark: row-major `90.536ms`, swizzled 1x4 `70.648ms`,\n  page-aligned swizzled `68.337ms`.\n- Qwen2.5-Coder-7B-Instruct Q4_0 smoke: exact logit parity,\n  `max_logit_delta: 0.00000000`.\n- Qwen2.5-Coder-7B-Instruct Q4_0 direct generation, short Rust ownership\n  prompt: model load `3.66s`, prefill `4.05s`, generation `14` tokens in\n  `10.45s` (`1.34 tok\u002Fsec`).\n- Qwen2.5-Coder-7B-Instruct Q4_0 145-token chat prompt: prefill improved from\n  `48.90s` at batch 1 to about `17.0s` with loop-inverted batch 16 prefill.\n- Qwen2.5-Coder-0.5B-Instruct Q5_K_M Pi smoke: official\n  `qwen2.5-coder-0.5b-instruct-q5_k_m.gguf` inspects as `ready`; the row\n  exercises mixed `Q5_1`, `Q5_K`, `Q6_K`, and `Q8_0` tensors; direct generation\n  produced 8 tokens at `9.98 tok\u002Fsec`; chat generation used `qwen_im` and\n  produced 8 tokens at `9.71 tok\u002Fsec`.\n- Strand Rust Coder 14B v1 Q6_K single-Pi capped-context smoke: load about\n  `39-54s`, one-token prompt prefill about `6.6s`, 8 generated tokens in\n  `46.06s` (`0.17 tok\u002Fsec`).\n- Strand Rust Coder 14B v1 Q6_K three-Pi cluster chat: Q6_K SDOT, split\n  `0..16`, `16..32`, `32..48`, with pre-unpacked Q6_K weights reused across\n  batched prompt ingest, prompt ingest about `8.38s`, generated\n  `\"**Generated Code:**\\n\\n```\"` as 6 tokens in `4.71s` (`1.28 tok\u002Fsec`).\n  Use `scripts\u002Fpi\u002Fstrand-cluster.sh` for the repeatable launch plan.\n- Llama 3 70B Instruct Q4_0 three-Pi `master-chat`: 19-token prompt ingest\n  took about `100.39s`; generated `\"Raspberry Pi clusters\"` as 4 tokens in\n  `13.93s` (`0.29 tok\u002Fsec`) with the `0..27`, `27..54`, `54..80` split.\n- Q4_0 page-aligned 1x4 swizzled storage improved the isolated Pi 2 layout\n  microbenchmark from `99.716ms` to `96.445ms` over 7 runs, about `1.034x`\n  versus contiguous swizzled storage. The same Qwen prompt stayed essentially\n  flat end-to-end, so this remains opt-in because the win is small and requires\n  duplicate swizzled chunks.\n- Experimental f16 KV-cache storage preserved the Qwen2.5-Coder-7B-Instruct\n  Q4_0 4-token smoke output with `max_logit_delta: 0.00000000`, but the short\n  16-token Rust prompt was slightly slower than f32 cache (`1.83 tok\u002Fsec` vs\n  `1.88 tok\u002Fsec`). Treat it as a memory-pressure option until longer-context\n  runs prove a speed win.\n- mmap-backed source reads improve the warm Qwen2.5-Coder-7B-Instruct Q4_0\n  load path to `2.63s`, but they do not make large models instant. Strand 14B\n  Q6_K still takes about `47s` to load because the current runtime still\n  decodes\u002Fcopies quantized blocks and materializes embedding vectors.\n- Q8 SDOT single-block microkernel: split accumulators moved the Pi 2 SDOT\n  median from about `1.683 ns\u002Fblock` to about `1.679 ns\u002Fblock`.\n- Vectorized NEON activation quantization preserved the 1B smoke path and moved\n  a short default 1B run from `4.16` to `4.18 tok\u002Fsec`. Treat this as a safe\n  kernel cleanup, not a proven end-to-end breakthrough.\n\nThe Q4_0 1B path is faster than Q8_0 on the same prompt, but the measured\nend-to-end gain is still far below the theoretical memory-traffic ceiling. The\nnext useful performance work should be driven by real prompt\u002Fdecode timings, not\nisolated kernel wins alone.\n\nUse `nanocamelid probe` on Raspberry Pi hosts to inspect CPU max frequency,\ngovernor, isolated CPU state, selected worker-core policy, and SIMD support. The\ntool reports telemetry only; boot parameters and overclock settings remain an\noperator decision outside NanoCamelid. When Linux reports the `ondemand`\ngovernor, `probe` and the TUI banner recommend the safe non-overclock command\nfor repeatable low-latency decode:\n\n```bash\necho performance | sudo tee \u002Fsys\u002Fdevices\u002Fsystem\u002Fcpu\u002Fcpu*\u002Fcpufreq\u002Fscaling_governor\n```\n\n## Raspberry Pi Deployment\n\nPrepare a Pi workspace:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fbootstrap.sh\n```\n\nBuild and test remotely:\n\n```bash\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nPreview the resolved deploy\u002Fbuild\u002Freadiness plan without SSH:\n\n```bash\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user] git-ff --dry-run\n```\n\nOn a prepared Pi workspace, `remote_build.sh` now runs the same 1B readiness\ngate as `scripts\u002Fpi\u002Fready-1b.sh`: it prefers the Pi-local\n`Llama-3.2-1B-Instruct-Q4_0.gguf`, falls back to `...Q8_0.gguf`, then runs\ninspect, scalar-vs-selected smoke validation, and one direct chat turn by\ndefault. Disable that model-backed gate explicitly with:\n\n```bash\nNANOCAMELID_REMOTE_SMOKE=0 .\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\n`false` and `no` are also accepted falsey values for\n`NANOCAMELID_REMOTE_SMOKE`.\n\nTo run the capped 1B context-pack smoke sweep after the readiness gate, set:\n\n```bash\nNANOCAMELID_REMOTE_CONTEXT_PACKS=512,1024,2048,4096,8192 \\\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nTo add the real 1B prefill batch sweep after readiness, set:\n\n```bash\nNANOCAMELID_REMOTE_PREFILL_BENCH=1 \\\nNANOCAMELID_REMOTE_PREFILL_BATCHES=1,16,32,64 \\\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nTo force a specific GGUF path that already exists on the Pi:\n\n```bash\nNANOCAMELID_REMOTE_SMOKE_GGUF=\u002Fpath\u002Fon\u002Fpi\u002Fmodel.gguf \\\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nTo override the default smoke kind, prompt, or token budget:\n\n```bash\nNANOCAMELID_REMOTE_SMOKE_KIND=model \\\nNANOCAMELID_SMOKE_PROMPT=\"Hello\" \\\nNANOCAMELID_SMOKE_TOKENS=1 \\\n.\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nTo keep the remote gate at inspect+smoke only:\n\n```bash\nNANOCAMELID_REMOTE_READY_CHAT=0 .\u002Fscripts\u002Fremote_build.sh \u003Cpi-host> [ssh-key] [pi-user]\n```\n\nWhen direct chat is disabled, `remote_build.sh` does not forward\n`NANOCAMELID_READY_PROMPT`, `NANOCAMELID_READY_TOKENS`, or\n`NANOCAMELID_READY_TEMP` into the Pi readiness command.\n\nRemote builds default to clean git fast-forward deployment so Pi-local edits are\nnot overwritten by validation runs. For explicit snapshot deployment, pass\n`rsync` as the fourth argument or set `NANOCAMELID_DEPLOY_MODE=rsync`.\n\n## Raspberry Pi Clustering\n\nNanoCamelid includes experimental pipeline-parallel cluster tools for splitting\nlarge dense models across multiple Raspberry Pi-class nodes. The cluster path is\nintended for smoke tests and runtime experiments, not polished interactive chat\nyet. Use the single-node `chat 1b` or `tui 1b` path for the current practical\n1B experience.\n\nCluster assumptions:\n\n- Each Pi has the same NanoCamelid checkout and release build.\n- Each Pi can read the same GGUF path, either from copied local files or a shared\n  mount.\n- Nodes can reach each other over TCP on the chosen ports.\n- Large-context GGUFs should use `NANOCAMELID_CLUSTER_CONTEXT_LIMIT` until full\n  advertised-context memory behavior is validated.\n- Split layers must be inside the model layer range. Passing `0` to the\n  two-node worker\u002Fmaster tools means \"use the midpoint.\"\n\nOn each Pi, prepare the repo and build release binaries:\n\n```bash\ncd \u002Fmnt\u002Fnanocamelid\u002Fsrc\u002FNanoCamelid\nexport CARGO_TARGET_DIR=\u002Fmnt\u002Fnanocamelid\u002Ftarget\ncargo build --release --bins\n```\n\nBefore using multiple Pis, validate that split execution matches full execution\ninside one process. This loads a full model plus two partial views and should end\nwith `result: PASS`:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=4 \\\ncargo run --release --bin cluster_split_smoke -- \u002Fpath\u002Fto\u002Fmodel.gguf 1 0\n```\n\nCheck network latency between Pis with the TCP benchmark. On the target worker:\n\n```bash\ncargo run --release --bin cluster_bench -- server 5005\n```\n\nOn the client\u002Fmaster Pi:\n\n```bash\ncargo run --release --bin cluster_bench -- client \u003Cworker-ip> 5005 1000\n```\n\nFor a two-Pi token-level smoke, start the upper-layer worker first. Bind to\n`0.0.0.0` so the master can connect from another Pi:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  worker \u002Fpath\u002Fto\u002Fmodel.gguf 0.0.0.0:5005 0\n```\n\nThen run the master on the other Pi. `master` compares the worker result against\na local full forward pass and fails if the split token diverges:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  master \u002Fpath\u002Fto\u002Fmodel.gguf \u003Cworker-ip>:5005 1 0 2\n```\n\nFor prompt text generation through the same two-node split, use\n`master-generate`:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  master-generate \u002Fpath\u002Fto\u002Fmodel.gguf \u003Cworker-ip>:5005 \"fn hello_world\" 0 8\n```\n\nFor chat-template prompt generation, use `master-chat`. This renders the GGUF\nchat template when NanoCamelid recognizes it, including the Mistral\u002FMixtral\n`[INST] ... [\u002FINST]` format:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  master-chat \u002Fpath\u002Fto\u002Fmodel.gguf \u003Cworker-ip>:5005 \"Write one short sentence.\" 0 8\n```\n\nFor a three-Pi pipeline, choose explicit layer ranges. For a 48-layer model,\n`0..16`, `16..32`, and `32..48` is the usual first split to try. Start the final\nworker first:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  worker \u002Fpath\u002Fto\u002Fmodel.gguf 0.0.0.0:5007 32\n```\n\nStart the middle worker next. It accepts the master connection and forwards to\nthe final worker:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  middle-worker \u002Fpath\u002Fto\u002Fmodel.gguf 0.0.0.0:5006 \u003Cfinal-worker-ip>:5007 16 32\n```\n\nRun the master against the middle worker:\n\n```bash\nNANOCAMELID_CLUSTER_CONTEXT_LIMIT=128 \\\ncargo run --release --bin cluster_tcp_smoke -- \\\n  master-generate \u002Fpath\u002Fto\u002Fmodel.gguf \u003Cmiddle-worker-ip>:5006 \"fn hello_world\" 16 8\n```\n\nUse `master-chat` instead of `master-generate` for chat-template rows such as\nMixtral.\n\nFor Strand Rust Coder 14B v1 Q6_K specifically, use the 48-layer split that has\nbeen Pi-smoked with Q6_K SDOT enabled: `0..16` on the master, `16..32` on the\nmiddle worker, and `32..48` on the final worker. The launcher below wraps the\nsame commands and keeps the node addresses in environment variables:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fstrand-cluster.sh --dry-run\n```\n\nStart the final worker:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fstrand-cluster.sh final \u002Fpath\u002Fto\u002FFortytwo_Strand-Rust-Coder-14B-v1-Q6_K.gguf\n```\n\nStart the middle worker:\n\n```bash\nNANOCAMELID_CLUSTER_FINAL_ADDR=\u003Cfinal-worker-host>:5007 \\\n.\u002Fscripts\u002Fpi\u002Fstrand-cluster.sh middle \u002Fpath\u002Fto\u002FFortytwo_Strand-Rust-Coder-14B-v1-Q6_K.gguf\n```\n\nRun chat generation from the master:\n\n```bash\nNANOCAMELID_CLUSTER_MIDDLE_ADDR=\u003Cmiddle-worker-host>:5006 \\\n.\u002Fscripts\u002Fpi\u002Fstrand-cluster.sh master \u002Fpath\u002Fto\u002FFortytwo_Strand-Rust-Coder-14B-v1-Q6_K.gguf\n```\n\nFor Mixtral 8x7B Q4_0 specifically, use the 32-layer split that has been\nPi-smoked: `0..11` on the master, `11..22` on the middle worker, and `22..32`\non the final worker. The launcher below wraps the same commands and keeps the\nnode addresses in environment variables:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fmixtral-cluster.sh --dry-run\n```\n\nStart the final worker:\n\n```bash\n.\u002Fscripts\u002Fpi\u002Fmixtral-cluster.sh final \u002Fpath\u002Fto\u002Fmixtral-8x7b-instruct-v0.1.Q4_0.gguf\n```\n\nStart the middle worker:\n\n```bash\nNANOCAMELID_CLUSTER_FINAL_ADDR=\u003Cfinal-worker-host>:5007 \\\n.\u002Fscripts\u002Fpi\u002Fmixtral-cluster.sh middle \u002Fpath\u002Fto\u002Fmixtral-8x7b-instruct-v0.1.Q4_0.gguf\n```\n\nRun chat generation from the master:\n\n```bash\nNANOCAMELID_CLUSTER_MIDDLE_ADDR=\u003Cmiddle-worker-host>:5006 \\\n.\u002Fscripts\u002Fpi\u002Fmixtral-cluster.sh master \u002Fpath\u002Fto\u002Fmixtral-8x7b-instruct-v0.1.Q4_0.gguf\n```\n\nUseful output to check:\n\n- `result: PASS` from `cluster_split_smoke`\n- `*_cluster_peer` handshake lines from every TCP role\n- `generated_tokens`, streamed generated text, and `cluster_tokens_per_sec` from\n  `master-generate` or `master-chat`\n- `worker_generated_tokens` from the final worker\n- `middle_feedback_tokens` from a middle worker\n- `cluster_tcp_round_trip_total_ms` when comparing network overhead\n\n## Project Status\n\n- Host feature probing is available.\n- GGUF metadata and tensor layout inspection are available.\n- Q8_0 scalar, NEON, and auto-selected SDOT dot-product paths are available.\n- Q4_0 loading and Q4_0 weight x Q8_0 activation matmul paths are available.\n- Q5_0, Q5_1, Q2_K, Q3_K, Q4_K, Q5_K, Q8_K, and IQ4_NL loading plus\n  Q8-activation matmul paths are available through scalar reference kernels.\n- Q6_K loading and Q8-activation matmul paths are available through scalar and\n  Pi SDOT kernels.\n- `inspect` reports runtime-supported and unsupported tensor types before a\n  model is marked ready.\n- Single-turn chat prompt rendering is available for recognized instruct templates.\n- Interactive terminal chat is available with model\u002Fkernel, token, TTFT, and throughput telemetry.\n- The TUI can switch GGUFs at runtime with `\u002Fmodel \u003Cpath>`.\n- The default Pi fast profile enables SDOT, Q4 swizzling, Q4\u002FQ6 SDOT matmuls,\n  head-parallel attention, and NEON activation quantization when the host\n  supports them.\n- The Pi 1B chat launcher preserves scalar-vs-selected-kernel parity through\n  the smoke gate.\n- Q8_0 and Q4_0 model smoke validation is available for the tested GGUF rows above.\n- Broader model support and performance claims require Pi-local artifacts and row-specific validation.\n\n## More Details\n\n- [Pi porting notes](docs\u002FPI_PORTING.md)\n- [Camelid porting map](docs\u002FCAMELID_PORTING_MAP.md)\n\n## License\n\nNanoCamelid is licensed under the MIT License. See [LICENSE](LICENSE).\n","NanoCamelid 是一个专为树莓派和ARM64架构设计的高性能Rust本地大语言模型推理引擎。它支持多种量化级别的GGUF文件加载与运行，并提供模型元数据和张量布局检查功能，能够在终端中进行聊天测试。该引擎特别针对低资源环境优化，自动选择适合硬件特性的内核以提高性能。适用于需要在边缘设备上高效运行自然语言处理任务的场景，如基于树莓派的家庭自动化、教育工具开发或轻量级AI应用部署。","2026-06-11 04:06:50","CREATED_QUERY"]