[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-75133":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":42,"readmeContent":43,"aiSummary":44,"trendingCount":16,"starSnapshotCount":16,"syncStatus":45,"lastSyncTime":46,"discoverSource":47},75133,"lucebox-hub","Luce-Org\u002Flucebox-hub","Luce-Org","Fast LLM speculative inference server for consumer hardware.","https:\u002F\u002Fwww.lucebox.com\u002Fblog",null,"C++",2381,220,20,23,0,45,82,449,135,29.03,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],"cuda","cuda-kernels","dflash","kernel","llama-cpp","local-ai","luce","lucebox","megakernel","nvidia-cuda","pflash","qwen","rtx3090","speculative-decoding","speculative-prefill","2026-06-12 02:03:33","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fbanner.png\" alt=\"Lucebox\" width=\"85%\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Flucebox.com\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flucebox.com-f5c842?style=for-the-badge&logo=safari&logoColor=f5c842&labelColor=090909\" alt=\"lucebox.com\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FyHfswqZmJQ\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-f5c842?style=for-the-badge&logo=discord&logoColor=f5c842&labelColor=090909\" alt=\"Discord\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Flucebox.com\u002Fblog\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-f5c842?style=for-the-badge&logo=rss&logoColor=f5c842&labelColor=090909\" alt=\"Blog\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"LICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-e8e8ed?style=for-the-badge&labelColor=090909\" alt=\"Apache 2.0\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-toolkit\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCUDA-12%2B-76b900?style=for-the-badge&logo=nvidia&logoColor=76b900&labelColor=090909\" alt=\"CUDA 12+\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Frocm.docs.amd.com\u002Fprojects\u002FHIP\u002Fen\u002Flatest\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHIP-7%2B-ed1c24?style=for-the-badge&logo=amd&logoColor=ed1c24&labelColor=090909\" alt=\"HIP 7+\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fisocpp.org\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FC%2B%2B-17-e8e8ed?style=for-the-badge&logo=cplusplus&logoColor=e8e8ed&labelColor=090909\" alt=\"C++17\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Local LLM inference server built for speed. Custom kernels, speculative prefill & decoding, quantized GGUF paths.\u003C\u002Fstrong>\u003Cbr\u002F>\n  Each project is a new optimization to our engine for a specific model family and hardware target.\n\u003C\u002Fp>\n\n---\n\n## Projects\n\nEach directory is a self-contained project with setup instructions and benchmark notes.\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"megakernel\u002F\">\u003Cimg src=\"assets\u002Fsvg\u002Fcard-megakernel-dark.svg\" alt=\"Megakernel\" width=\"46%\">\u003C\u002Fa>\n  &nbsp;&nbsp;\n  \u003Ca href=\"dflash\u002F\">\u003Cimg src=\"assets\u002Fsvg\u002Fcard-dflash-dark.svg\" alt=\"DFlash 27B\" width=\"46%\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"pflash\u002F\">\u003Cimg src=\"assets\u002Fsvg\u002Fcard-pflash-dark.svg\" alt=\"PFlash speculative prefill\" width=\"46%\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## Supported models\n\nAll speedups measured vs vendored llama.cpp (`-fa 1`, matching KV quant).\n\n| GPU | Model | TTFT speedup | Decode speedup |\n|-----|-------|:------------:|:--------------:|\n| RTX 3090 | Qwen 3.5-0.8B (Megakernel) | — | **~2×** vs F16 |\n| RTX 3090 | Qwen 3.5-27B Q4_K_M (DFlash + DDTree) | — | **3.43×** vs AR |\n| RTX 3090 | Qwen 3.6-27B Q4_K_M (DFlash + PFlash) | **10.4×** @ 128K | **~3×** vs AR |\n| RTX 3090 | Laguna-XS.2 33B-A3B Q4_K_M (DFlash + PFlash) | **5.4×** @ 128K | AR (draft pending) |\n| RTX 5090 | Qwen 3.6-27B Q4_K_M (DFlash + DDTree) | — | **4.84×** vs AR (205 tok\u002Fs) |\n| Ryzen AI MAX+ 395 (gfx1151) | Qwen 3.5-27B Q4_K_M (DFlash + PFlash, HIP) | **2.24×** @ 16K | **3.08×** vs llama.cpp HIP AR (37 tok\u002Fs) |\n\n## Client harnesses\n\n[`harness\u002F`](harness\u002F) contains RTX 3090 client launchers and regression tests\nfor Lucebox server compatibility. Use it to run Lucebox inside Claude Code,\nCodex, OpenCode, Hermes, Pi, OpenClaw, or Open WebUI, or to check that a server\nchange still works with those clients.\n\n```bash\nharness\u002Fclients\u002Frun_codex.sh\nharness\u002Fclients\u002Frun_claude_code.sh\npython3 harness\u002Fclient_test_runner.py probe --url http:\u002F\u002F127.0.0.1:8000\n```\n\n## 01 · Megakernel Qwen3.5 0.8B on RTX 3090\n\nSingle-kernel CUDA inference for Qwen 3.5-0.8B on RTX 3090. All 24 layers run in one persistent dispatch.\n\n```bash\n# 1. clone + enter\ngit clone https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fmegakernel\n\n# 2. install (Python 3.10+, CUDA 12+, PyTorch 2.0+). Weights stream from HF on first run.\npython -m venv .venv && source .venv\u002Fbin\u002Factivate   # required on Ubuntu 24+ system Python (PEP 668)\npip install --upgrade pip\npip install torch                          # install BEFORE the next step; setup.py imports torch at build time\npip install -e . --no-build-isolation      # --no-build-isolation lets the build see the torch you just installed\n\n# 3. run the benchmark (prefill pp520 + decode tg128 vs llama.cpp BF16 + PyTorch HF)\npython final_bench.py\n```\n\n| Method | Prefill pp520 | Decode tg128 | tok\u002FJ |\n|--------|:-------------:|:------------:|:-----:|\n| **Megakernel** `@220W` | **21,347** | **413** | **1.87** |\n| llama.cpp BF16 `@350W` | 11,247 | 267 | 0.76 |\n| PyTorch HF | 7,578 | 108 | n\u002Fa |\n\nImplementation notes: 82 blocks, 512 threads, cooperative grid sync, no CPU round trips between layers, and weights streamed from Hugging Face on first run.\n\n[Full writeup →](megakernel\u002FREADME.md) · [Benchmarks →](megakernel\u002FRESULTS.md) · [Blog post →](https:\u002F\u002Flucebox.com\u002Fblog\u002Fmegakernel)\n\n> **Blackwell (RTX 5090, DGX Spark \u002F GB10):** auto-detected by setup; NVFP4 decode path lands ~194 tok\u002Fs tg128 on GB10. See [megakernel\u002FREADME.md#blackwell-sm_120--sm_121a](megakernel\u002FREADME.md).\n\n---\n\n## 02 · DFlash DDtree Qwen3.5 & Qwen3.6 27B GGUF on RTX 3090\n\nDFlash speculative decoding for Qwen3.5\u002FQwen3.6 27B GGUF targets on a single GPU. The default setup uses Qwen3.6-27B Q4_K_M plus the Lucebox Q8_0 GGUF DFlash draft.\n\n- **Up to 207 tok\u002Fs** in the demo (207.6 tok\u002Fs DFlash vs 38.0 tok\u002Fs AR, 5.46×)\n- **129.5 tok\u002Fs mean** on the HumanEval 10-prompt bench\n- **3.43× faster than autoregressive** (+15% over chain speculative decoding)\n- **2.8× faster than SGLang AWQ** on the same hardware\n- **Up to 256K context in 24 GB** via TurboQuant TQ3_0 KV cache (128K Q4_0 bench: 134.78 tok\u002Fs at ctx=131072)\n\n```bash\n# 1. clone with submodules (pulls the pinned Luce-Org\u002Fllama.cpp@luce-dflash fork)\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fdflash\n\n# 2. build the C++\u002FCUDA decoder (CUDA 12+, CMake 3.18+)\n# Default compiles for Pascal\u002FVolta\u002FTuring\u002FAmpere (60\u002F61\u002F62\u002F70\u002F75\u002F86; +120 on CUDA 12.8+, +sm_121\u002FDGX Spark on CUDA 12.9+, +sm_110\u002FThor on CUDA 13.0+) so the binary runs on every supported card.\n# 3090-only users can add -DCMAKE_CUDA_ARCHITECTURES=86 to skip the other archs and build faster (~3 min).\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release\ncmake --build build --target test_dflash -j\ncmake --build build --target test_generate -j\n\n# 3. fetch weights: ~16 GB Q4_K_M target + 1.84 GB Lucebox Q8_0 GGUF DFlash draft\nhf download unsloth\u002FQwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models\u002F\nhf download Lucebox\u002FQwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models\u002Fdraft\u002F\n\n# 4a. one-shot streaming generate\npython3 scripts\u002Frun.py --prompt \"def fibonacci(n):\"\n\n# 4b. or reproduce the paper-style bench (HumanEval + GSM8K + Math500, ~15 min)\npython3 scripts\u002Fbench_llm.py\n```\n\n| Benchmark | AR (tok\u002Fs) | DFlash+DDTree (tok\u002Fs) | Speedup |\n|-----------|:----------:|:---------------------:|:-------:|\n| **HumanEval** | 37.8 | **129.5** | **3.43×** |\n| Math500 | 37.7 | 110.5 | 2.93× |\n| GSM8K | 37.7 | 96.2 | 2.55× |\n\n**Why GGUF\u002FQ4_K_M:** on 24 GB GPUs, the target, draft, DDTree verify state, and KV cache need to fit together. The default Qwen3.6 setup uses a ~16 GB Q4_K_M target and a 1.84 GB GGUF draft.\n\nAlgorithms used:\n- [**DFlash**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036) (z-lab, 2026): block-diffusion draft conditioned on target hidden states.\n- [**DDTree**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.12989) (Ringel et al., 2026): tree-structured verify that beats chain verify at the same compute budget.\n\nImplemented here:\n- C++\u002FCUDA decode engine on top of ggml (no libllama, no Python runtime, Q4_K_M target path).\n- Three custom CUDA kernels for tree-aware SSM state rollback: `ggml_ssm_conv_tree`, `ggml_gated_delta_net_tree`, `ggml_gated_delta_net_tree_persist`.\n- DDTree budget swept for RTX 3090 + Q4_K_M target: **budget=22** is the sweet spot.\n- TQ3_0 KV cache (TurboQuant 3.5 bpv, default) + sliding `target_feat` ring to fit up to 256K context in 24 GB (Q4_0 available as legacy, tops out near 128K).\n\n### Running on other GPUs (4090, 5090, DGX Spark \u002F GB10, Jetson AGX Thor)\n\nSupported out of the box; the build just needs the right CUDA toolkit. `dflash\u002FCMakeLists.txt` already auto-adds Blackwell archs when your nvcc is new enough, so the main quickstart above works as-is on newer cards.\n\n| GPU | Arch | Min CUDA | Status |\n|-----|:----:|:--------:|--------|\n| Tesla P40 Pascal | `sm_61` | 12.0 | supported with scalar F16 fallback; needs 24 GB for the 27B stack |\n| Tesla V100 Volta | `sm_70` | 12.0 | supported with F16 WMMA kernels |\n| RTX 3090 Ampere | `sm_86` | 12.0 | **reference, all numbers above** |\n| RTX 2080 Ti Turing | `sm_75` | 12.0 | supported, 53 tok\u002Fs DFlash verified (FP16 draft) |\n| RTX 4090 Ada | `sm_89` | 12.0 | should work, unverified, pass `-DCMAKE_CUDA_ARCHITECTURES=89` |\n| RTX 5090 Blackwell consumer | `sm_120` | 12.8 | **205 tok\u002Fs DFlash, 4.84× vs AR** (Q4_K_M, budget=40) |\n| DGX Spark \u002F GB10 | `sm_121` (compute capability 12.1) | 12.9 | supported, auto-added by CMake |\n| Jetson AGX Thor | `sm_110` | 13.0 | supported, auto-added by CMake |\n\nVerify your target:\n```bash\npython -c \"import torch; p=torch.cuda.get_device_properties(0); print(p.name, 'sm_%d%d'%(p.major,p.minor), p.multi_processor_count,'SMs', round(p.total_memory\u002F1e9,1),'GB')\"\nnvcc --version\n```\n\n**DGX Spark \u002F GB10 quick start:**\n```bash\n# CUDA 12.9+ required for sm_121\nnvcc --version  # must show >= 12.9\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fdflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release   # CMake auto-adds sm_121\ncmake --build build --target test_dflash -j\n```\n\n**Jetson AGX Thor quick start:**\n```bash\n# CUDA 13.0+ required for sm_110 \u002F AGX Thor.\nnvcc --version\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fdflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release   # CMake auto-adds the Thor arch your nvcc supports\ncmake --build build --target test_dflash -j\n```\n\n**Retune per GPU:**\n- **DDTree `budget=22`** tuned for 3090 + Q4_K_M + 24 GB. On the RTX 5090, budget=40 is optimal (swept). On GB10 (128 GB unified), re-sweep — larger tree = more verify throughput until memory bandwidth saturates. `scripts\u002Fbench_llm.py --budget N` has the sweep hooks.\n- **TQ3_0 KV cache + sliding `target_feat` ring** was shaped by 24 GB (fits up to 256K context on a 3090). On GB10 (128 GB unified) \u002F 5090 (32 GB) you can push context further or skip quantization entirely and keep F16 KV.\n- **Perf numbers** (207 tok\u002Fs demo, 129.5 HumanEval, 2.8× vs SGLang AWQ) are RTX 3090 @ stock. RTX 5090 numbers (205 tok\u002Fs HumanEval, 4.84×) are in [RESULTS.md](dflash\u002FRESULTS.md). Ada\u002FGB10\u002FThor not yet swept, PRs with `RESULTS.md` entries welcome.\n\n[Full writeup →](dflash\u002FREADME.md) · [Benchmarks →](dflash\u002FRESULTS.md) · [Blog post →](https:\u002F\u002Flucebox.com\u002Fblog\u002Fdflash27b)\n\n---\n\n## 03 · PFlash speculative prefill on RTX 3090\n\nSpeculative prefill for long prompts. A Qwen3-0.6B BF16 drafter scores token importance, then the 27B target prefills only the retained spans. Runtime is C++\u002FCUDA through the dflash binaries; no PyTorch is required at serving time.\n\n- **~10.4× TTFT** on 128K context: **24.8 s** dflash daemon vs **~257 s** llama.cpp (FA on, Q4_0 KV).\n- **10.0× TTFT** on 64K context: **13.5 s** dflash vs **134.95 s** llama.cpp.\n- **NIAH single-needle retrieved** at every measured context (32K → 128K), `keep_ratio=0.05`, `DFLASH_FP_ALPHA=0.85`.\n\n```bash\n# 1. build dflash + BSA kernel (sm_80+ required for BSA, ~10 min cold compile)\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fdflash\ncmake -B build -S . -DCMAKE_BUILD_TYPE=Release \\\n                    -DCMAKE_CUDA_ARCHITECTURES=86 \\\n                    -DDFLASH27B_ENABLE_BSA=ON\ncmake --build build --target test_dflash test_flashprefill_kernels -j\n\n# 2. fetch weights: 27B Q4_K_M target + 0.6B BF16 drafter (GGUF) + DFlash spec-decode draft\nhf download unsloth\u002FQwen3.6-27B-GGUF Qwen3.6-27B-Q4_K_M.gguf --local-dir models\u002F\nhf download unsloth\u002FQwen3-0.6B-GGUF Qwen3-0.6B-BF16.gguf --local-dir models\u002F\nhf download Lucebox\u002FQwen3.6-27B-DFlash-GGUF dflash-draft-3.6-q8_0.gguf --local-dir models\u002Fdraft\u002F\n\n# 3. run the daemon: compress (drafter scoring) + generate (target spec decode)\nDFLASH_FP_USE_BSA=1 DFLASH_FP_ALPHA=0.85 \\\n.\u002Fbuild\u002Ftest_dflash models\u002FQwen3.6-27B-Q4_K_M.gguf models\u002Fdraft\u002Fdflash-draft-3.6-q8_0.gguf --daemon\n# stdin protocol: `compress \u003Cids.bin> \u003Ckeep_x1000> \u003Cdrafter.gguf>` →\n#                 stream of compressed token ids, then `generate \u003C…>` →\n#                 stream of generated tokens.\n```\n\n| Source S | dflash TTFT | llama.cpp baseline | Speedup | NIAH |\n|----------|:-----------:|:------------------:|:-------:|:----:|\n| **64K**  | **13.5 s** | 134.95 s (FA off, dense) | **10.0×** | ✅ |\n| **128K** | **24.8 s** | ~257 s (FA on, Q4_0 KV)  | **~10.4×** | ✅ |\n\nDaemon stdin commands: `compress` runs the drafter with FlashPrefill block-sparse attention and returns the compressed token-id stream; `generate` runs the target on that stream with normal speculative decode + DDTree. `park` \u002F `unpark` \u002F `free drafter` swap weights in and out of VRAM so target + drafter coexist on a 24 GB card.\n\n**Runtime tunables** (full list in [`dflash\u002Fsrc\u002Fflashprefill.h`](dflash\u002Fsrc\u002Fflashprefill.h)):\n```\nDFLASH_FP_USE_BSA=1     # dispatch sparse FA forward through BSA (sm_80+)\nDFLASH_FP_ALPHA=0.85    # block-selection threshold; higher = stricter = fewer K-blocks per Q-row\nDFLASH_FP_PROFILE=1     # log mean \u002F score \u002F select \u002F forward stage timings\n```\n\n**What's ours, what isn't.** Algorithms are from [Cross-Family Speculative Prefill (Liu et al., ICLR 2026)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.02631) for the scoring + selection layer and [FlashPrefill (Fan et al., 2026)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06199) for the drafter sparse-attention forward. What we built:\n- C++\u002FCUDA daemon-resident speculative prefill in front of a quantized GGUF target — no PyTorch, no Triton, no per-request subprocess.\n- BSA wired without `libtorch` via a 3-header ATen\u002Fc10 stub set under `dflash\u002Fdeps\u002Fbsa_stubs\u002F`.\n- Custom Qwen3-0.6B forward (`qwen3_0p6b_*`) so the drafter runs through the same ggml allocator as the 27B target.\n- 4 CUDA kernels (`flashprefill_kernels.cu`) for the FlashPrefill `mean_K \u002F score \u002F select \u002F sparse_fwd` algorithm.\n\n[Full writeup →](pflash\u002FREADME.md) · [Daemon-side build \u002F tunables →](dflash\u002Fdocs\u002FSPEC_PREFILL.md) · [Blog post →](https:\u002F\u002Flucebox.com\u002Fblog\u002Fpflash)\n\n---\n\n## AMD Strix Halo (HIP backend)\n\n**Same DFlash + PFlash stack on an AMD iGPU.** PR #119 ports the Phase 2 rocWMMA flashprefill kernels to HIP. End-to-end on a single Ryzen AI MAX+ 395 box (Radeon 8060S iGPU, gfx1151, 128 GiB LPDDR5X-8000 unified): **37.0 tok\u002Fs** DFlash decode on Qwen3.5-27B Q4_K_M, **27.6 s** TTFT at 16K context with NIAH retrieval intact. That is **3.08×** decode and **2.24×** prefill over llama.cpp HIP AR on the same iGPU. End-to-end wall clock at a realistic 16K prompt + 1K generation workload: **2.66×** faster than vanilla llama.cpp.\n\n```bash\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub && cd lucebox-hub\u002Fdflash\n\n# Build for gfx1151 (Strix Halo). Swap the arch for gfx1100 \u002F gfx1201.\ncmake -B build -S . \\\n  -DCMAKE_BUILD_TYPE=Release \\\n  -DDFLASH27B_GPU_BACKEND=hip \\\n  -DDFLASH27B_HIP_ARCHITECTURES=gfx1151 \\\n  -DDFLASH27B_HIP_SM80_EQUIV=ON\ncmake --build build --target test_dflash -j\n```\n\n`DFLASH27B_HIP_SM80_EQUIV=ON` enables the rocWMMA Phase 2 flashprefill kernels (the path that delivers the prefill speedup). `OFF` falls back to ggml's `flash_attn_ext` (slower but no rocwmma headers needed).\n\n**Per-arch DDTree tuning**: gfx1151 (Strix Halo iGPU, bandwidth-bound on LPDDR5X) peaks at `--ddtree-budget=22`. gfx1100 (7900 XTX, GDDR6) prefers `budget=8` per the [PR #156 cross-arch perf plan](https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub\u002Fpull\u002F156). Run `scripts\u002Fbench_he.py --ddtree-budget N` to verify on your card.\n\n**Drafter recipe for max decode**: target = Qwen3.5-27B Q4_K_M, drafter = same gen quantized to Q8_0 via `dflash\u002Fscripts\u002Fquantize_draft_q8.py`. The matching Q8_0 GGUF on the unsloth Qwen3.6 target needs `DFLASH27B_DRAFT_SWA=2048` for sliding-window correctness.\n\n[Blog post →](https:\u002F\u002Flucebox.com\u002Fblog\u002Famd) · [PR #119 →](https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub\u002Fpull\u002F119) · [PR #156 cross-arch perf plan →](https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub\u002Fpull\u002F156)\n\n---\n\n## Why this exists\n\nLocal AI should be a default, not a privilege: private data, no per-token bill, no vendor lock-in. The hardware to run capable models already sits on desks. The software to run those chips well doesn't.\n\nGeneral-purpose frameworks dominated the last decade because hand-tuning kernels per chip was too expensive to justify. One stack, decent on everything, great on nothing. Most of the silicon's capability stays on the floor.\n\nAI-assisted development flips that calculus. Rewrites that took a quarter now fit in a release cycle. Lucebox is where we publish them, one chip and one model family at a time. Apache 2.0 source, full writeup, reproducible benchmarks.\n\n---\n\n## Requirements\n\nAll experiments in this repo are built, tuned, and benchmarked on NVIDIA RTX 3090 (2020), the reference target. Supported GPU families:\n\n- **Ampere** (sm_86, RTX 3090 \u002F A-series): reference, CUDA 12+.\n- **Ada** (sm_89, RTX 40xx): should work, unverified, CUDA 12+.\n- **Blackwell consumer** (sm_120, RTX 50xx incl. 5090): supported, CUDA 12.8+.\n- **DGX Spark \u002F GB10** (sm_121, compute capability 12.1): supported, CUDA 12.9+.\n- **Jetson AGX Thor** (sm_110): supported, CUDA 13+.\n- **Turing** (sm_75, RTX 2080): supported, CUDA 12+.\n\nPyTorch 2.0+. `dflash\u002F` needs CMake 3.18+ and `--recurse-submodules` for the pinned `Luce-Org\u002Fllama.cpp@luce-dflash` fork (three tree-mode ggml ops); multi-arch build is automatic (see [Running on other GPUs](#running-on-other-gpus-4090-5090-dgx-spark--gb10-jetson-agx-thor)).\n\n**Megakernel porting note.** `megakernel\u002Fsetup.py` auto-detects the GPU arch and SM count at build time via `torch.cuda.get_device_capability()`. The decode grid is persistent (one block per SM) and is clamped to the resident-block ceiling at runtime, so no manual tuning is needed. On SM \u003C 80 (Turing), the kernel uses FP16 instead of BF16 via a compile-time `TARGET_SM` flag; on SM ≥ 80 (Ampere+), BF16 is used. Just `pip install -e . --no-build-isolation` and the right code path is selected automatically.\n\n**Optional, find your GPU's sweet spot:** `sudo nvidia-smi -pl 220` (megakernel hits best tok\u002FJ at 220 W on 3090; re-sweep for other cards).\n\n---\n\n## Repository layout\n\n```\nlucebox-hub\u002F\n├── megakernel\u002F    · fused forward pass for Qwen 3.5-0.8B\n├── dflash\u002F        · DFlash speculative decoding port for Qwen 3.5\u002F3.6-27B on RTX 3090\n├── pflash\u002F        · speculative-prefill harness in front of dflash (12.5× TTFT at 128K)\n└── assets\u002F        · banners, cards, diagrams\n```\n\n---\n\n## Roadmap\n\n```\n  Q1 2026    ▮▮▮▮▮▮▮▮▮▮    RTX 3090 kernels & optimizations\n  Q2 2026    ▮▮▮▮▮▯▯▯▯▯    Ryzen AI MAX+ 395 optimizations\n  Q2 2026    ▮▮▯▯▯▯▯▯▯▯    Heterogeneous CPU + GPU latency optimizations\n  Q2 2026    ▮▯▯▯▯▯▯▯▯▯    Lucebox OS for local AI machines\n  Q3 2026    ▯▯▯▯▯▯▯▯▯▯    Lucebox official launch\n```\n\n---\n\n## Citation\n\n```bibtex\n@software{lucebox_2026,\n  title  = {Lucebox: Open LLM Inference, Rewritten by Hand for One Specific Chip at a Time},\n  author = {Lucebox},\n  url    = {https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub},\n  year   = {2026}\n}\n```\n\nPer-project citations live in each subproject's README.\n\n---\n\n## Inspired by\n\n- [Hazy Research](https:\u002F\u002Fhazyresearch.stanford.edu\u002Fblog\u002F2025-05-27-no-bubbles): megakernel idea and the intelligence-per-watt methodology.\n- [z-lab\u002FDFlash](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06036) (Wang et al., 2026): block-diffusion speculative decoding algorithm. We use their published Qwen3.5\u002FQwen3.6-27B-DFlash draft weights as-is.\n- [DDTree](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.12989) (Ringel & Romano, 2026): tree-structured verify that DFlash 27B uses for its 3.5× speedup over chain spec decoding. [liranringel\u002Fddtree](https:\u002F\u002Fgithub.com\u002Fliranringel\u002Fddtree).\n- [AlpinDale\u002Fqwen_megakernel](https:\u002F\u002Fgithub.com\u002FAlpinDale\u002Fqwen_megakernel), [Infatoshi\u002FMegaQwen](https:\u002F\u002Fgithub.com\u002FInfatoshi\u002FMegaQwen): prior art on fused Qwen kernels.\n\n---\n\n## Community\n\n- **Discord**: [discord.gg\u002FyHfswqZmJQ](https:\u002F\u002Fdiscord.gg\u002FyHfswqZmJQ)\n- **Website**: [lucebox.com](https:\u002F\u002Flucebox.com)\n- **Issues**: [github.com\u002FLuce-Org\u002Flucebox-hub\u002Fissues](https:\u002F\u002Fgithub.com\u002FLuce-Org\u002Flucebox-hub\u002Fissues)\n- **Blog**: [lucebox.com\u002Fblog](https:\u002F\u002Flucebox.com\u002Fblog)\n\n---\n\n\u003Cp align=\"center\">\n  \u003Csub>\u003Ca href=\"LICENSE\">Apache 2.0\u003C\u002Fa> · \u003Ca href=\"https:\u002F\u002Flucebox.com\">Lucebox.com\u003C\u002Fa>\u003C\u002Fsub>\n\u003C\u002Fp>\n","Lucebox Hub 是一个针对特定消费级硬件优化的大规模语言模型（LLM）推理引擎。项目通过定制内核、推测性预填充和解码以及量化GGUF路径等技术手段，显著提升了在本地运行LLM的速度与效率。它支持包括NVIDIA CUDA 12+和AMD HIP 7+在内的多种GPU平台，并且对Qwen、Laguna等多个流行模型进行了专门优化。特别适用于需要高性能LLM推理但又受限于硬件条件的个人开发者或小型团队，在游戏开发、内容生成等领域具有广泛的应用潜力。",2,"2026-06-11 03:52:28","high_star"]