[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83892":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":14,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":15,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":8,"pushedAt":8,"updatedAt":22,"readmeContent":23,"aiSummary":8,"trendingCount":13,"starSnapshotCount":13,"syncStatus":15,"lastSyncTime":24,"discoverSource":25},83892,"InstinctRazor","General-Instinct\u002FInstinctRazor","General-Instinct",null,"Python",55,7,1,0,4,2,2.71,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:04:36","# InstinctRazor\n\n**Sub-4-bit quantization of a frontier MoE — and a reproducible way to prove it didn't lose anything that matters.**\n\n> 📦 **Ready-to-run weights:** the deployable **48 GB IQ3_XXS GGUF** (runs on one 80 GB GPU, or a small card + CPU expert-offload) is on the Hugging Face Hub — **[General-Instinct\u002FInstinctRazor-Qwen3.5-122B-A10B-GGUF](https:\u002F\u002Fhuggingface.co\u002FGeneral-Instinct\u002FInstinctRazor-Qwen3.5-122B-A10B-GGUF)**. The framework in this repo reproduces it from the original BF16. *(The Python package\u002Fmodule namespace is `moe-lowbit`.)*\n\nWe take **Qwen3.5-122B-A10B** (a 122B hybrid Gated-DeltaNet MoE, ~245 GB BF16) down to a **~47 GB**\n3-bit-expert \u002F 4-bit-backbone checkpoint (`q122_ptq3b_clip`), and show that on a shared, validation-gated\nharness it **beats the footprint-matched Gemma-4-26B-A4B** (~52 GB) on knowledge, reasoning, multilingual,\nand multimodal-MMMU — tracking the uncompressed BF16 teacher within ~1 point, **with no training**. Where\nit doesn't win (LiveCodeBench v6, MATH-Vision — both *truncation*-driven, not capability loss), there's an\noptional **Lightning-OPD** on-policy-distillation path.\n\n> The headline isn't just \"we compressed it.\" It's: a 122B MoE *inverts* the dense low-bit regime — at ~3\n> effective bits, PTQ is near-lossless because only 8\u002F256 experts fire per token and the ~6% always-on path\n> is protected. A 122B model that needed 4×80 GB now fits on **one 80 GB GPU** and out-scores a same-size dense-ish MoE.\n\n## Headline result\n\n~47 GB clip vs the ~52 GB A4B baseline, same harness (vLLM 0.22, TP=4, thinking mode, seed 0). Full\nprovenance + caveats in [`results\u002FRESULTS.md`](results\u002FRESULTS.md).\n\n| Benchmark | teacher BF16 | **clip (~47 GB)** | A4B (~52 GB) | verdict |\n|-----------|-------------|-------------------|--------------|---------|\n| MMLU-Pro | 87.6 | **88.5** | 85.6 | ✅ clip ≥ A4B |\n| GPQA-Diamond | 83.8 | **84.8** | 79.3 | ✅ clip ≥ A4B |\n| MMMLU | 88.8 | **87.2** | 85.4 | ✅ clip ≥ A4B |\n| MMMU-Pro | — | **80.8** | 73.8† | ✅ clip ≥ A4B |\n| LiveCodeBench v6 | 65.5 | 57.0 (75.7 finished) | 66.0 | ⚠️ recoverable gap (clip truncates 30%) |\n| MATH-Vision | — | 70.0 → 77.5 hi-budget | 82.4† | ⚠️ recoverable gap (truncation) |\n| HLE (no-tools) | 18.0 | 13.3 | 12.3 | ✅ clip ≥ A4B (below teacher) |\n| τ²-Bench | — | — | — | ⏳ no in-tree harness |\n\n†A4B official figure (a same-harness A4B multimodal run was not done). All comparisons are subject to the\n**validation gate** — see caveats below and [`docs\u002FEVAL_PROTOCOL.md`](docs\u002FEVAL_PROTOCOL.md).\n\n## Quickstart\n\n### A. Verify the whole framework on ONE GPU (~15 min, no 122B, no 4×H100)\n\nRuns the *exact same* load → PTQ → dequant-save → vLLM-eval code path on a small dense model:\n\n```bash\npython3.12 -m venv vllm_venv && source vllm_venv\u002Fbin\u002Factivate && pip install -r requirements.txt\nMOE_LOWBIT_VENV=$PWD\u002Fvllm_venv source env.sh\nbash pipelines\u002Fsmoke.sh        # quantizes Qwen3.5-2B to 4-bit, evals it; prints \"SMOKE OK\"\n```\n\n### B. Reproduce the headline (4×H100-80GB)\n\nThree commands: set up env → build the 47 GB clip from BF16 → eval clip and the A4B baseline.\n\n```bash\n# 0. env (once): create the eval venv, then source env.sh (sets PYTHONPATH, HF token, caches)\npython3.12 -m venv vllm_venv && source vllm_venv\u002Fbin\u002Factivate && pip install -r requirements.txt\nHF_TOKEN=hf_xxx MOE_LOWBIT_VENV=$PWD\u002Fvllm_venv source env.sh\n\n# 1. BF16 -> ~47 GB clip  (recipe in configs\u002Fclip_122b.env)\nbash pipelines\u002Fquantize.sh\n\n# 2. eval clip vs the footprint-matched A4B on the gated axes\nbash pipelines\u002Feval.sh --model models\u002Fq122_ptq3b_clip --tag clip --benchmarks mmlu_pro,gpqa,mmmlu\nbash pipelines\u002Feval.sh --model google\u002Fgemma-4-26b-a4b-it --tag a4b  --benchmarks mmlu_pro,gpqa,mmmlu\n```\n\nThen read `results\u002Fvllm_eval\u002F{clip,a4b}.json` (or regenerate `results\u002FRESULTS.md`). For the long-context\naxes add `--budget 64k` (math\u002Fcode\u002FHLE). **Optional on-policy distillation** for the LCB code gap:\n\n```bash\nbash pipelines\u002Fdistill.sh --smoke   # validate the 4-GPU FSDP path first\nbash pipelines\u002Fdistill.sh           # gen -> score -> FSDP train -> merge+requant -> ~47 GB\n```\n\n## Repo layout\n\n```\nInstinctRazor\u002F           # (Python module namespace: moe-lowbit)\n  env.sh                 # source first: PYTHONPATH(quant:eval:distill) + venv + HF\u002Fcaches (overridable)\n  requirements.txt       # EVAL\u002Fquantize venv (torch 2.11, vllm 0.22, transformers 5.9)\n  requirements-train.txt # OPD FSDP train venv (torch 2.7 + flash-linear-attention) — separate by necessity\n  src\u002F\n    quant\u002F   moe_quant.py (fake-quant\u002FSTE\u002Fclip-search) · quant_save.py (the builder) ·\n             moe_probe.py + moe_alloc.py + moe_study.py (the salience research path) · moe_ptq.py\n    eval\u002F    vllm_eval8.py (text harness) · mm_eval.py (multimodal) · vllm_eval.py · bench8_loaders.py ·\n             moe_eval.py · bench_mm.py\n    distill\u002F opd_gen.py · opd_score.py · opd_train_fsdp.py · merge_adapter.py · moe_lora.py ·\n             cache_teacher.py · fsdp_setup.py (reference; live logic is inline in opd_train_fsdp.py)\n  pipelines\u002F smoke.sh · quantize.sh · eval.sh · distill.sh\n  configs\u002F   clip_122b.env · smoke_2b.env · eval.env · opd_r1.env (+ README explaining each)\n  results\u002F   the result JSONs + RESULTS.md (each number -> its source JSON + command)\n  docs\u002F      EVAL_PROTOCOL.md + MOE_FINDINGS\u002FMOE_PIPELINE\u002FLADDER_RESULTS\u002FOPD_INTEGRATION\u002FPOSITIONING\u002FSURVEY_MOE\n  archive\u002F   quarantined dead-ends (device_map OPD, EP\u002FFSDP smokes, diagnostics) — kept, marked deprecated\n```\n\n**Note on imports:** the modules cross-import by bare name (`import moe_quant`, `import vllm_eval`, …),\nso `env.sh` puts all three `src\u002F` subdirs on `PYTHONPATH`. That is the *only* packaging change — no source\nlogic was rewritten.\n\n## The recipe (what actually ships)\n\n```\nexpert_bits = 3.0   (int3, all 256 routed experts, both gate_up + down blocks)\nlinear_bits = 4.0   (int4, all non-expert nn.Linear + embed\u002Flm_head)\ngroup = 128 · per-block symmetric · MSE clip-search 16 steps\nprotected @ bf16:  vision tower · router\u002Fgate · all norms\n=>  ~3.05 effective bits  ~=  ~47 GB packed  (under A4B's ~52 GB)\n```\n\nThis is **uniform** quantization, deliberately. We built per-expert salience allocation\n(`moe_probe`+`moe_alloc`+`moe_study`) and the study (finding **F4\u002FF6**, `docs\u002FMOE_FINDINGS.md`) showed that\non a load-balanced router (0\u002F256 cold experts) allocation is a real lever for *distribution fidelity* but\n**second-order for downstream capability** — uniform 3b already lands near-lossless. So the probe\u002Falloc path\nis the *research that justified uniform*, run optionally via `PROBE=1 bash pipelines\u002Fquantize.sh`.\n\n## Deployment: GGUF\u002Fllama.cpp (the packed, single-GPU-runnable form) — VALIDATED\n\nThe fake-quant checkpoint above is the **capability ceiling** (what `eval.sh` measures), stored dequantized.\nThe **shipped deployable artifact** is a real low-bit GGUF, quantized **from the original BF16** (not the\ndequant clip — that would double-quantize) with an **imatrix**, via llama.cpp (`qwen3_5_moe` support merged\nupstream, PR #19468):\n\n**Don't want to rebuild? Download it.** The built artifact (+ the `mmproj` vision projector) is on the Hub:\n[**General-Instinct\u002FInstinctRazor-Qwen3.5-122B-A10B-GGUF**](https:\u002F\u002Fhuggingface.co\u002FGeneral-Instinct\u002FInstinctRazor-Qwen3.5-122B-A10B-GGUF).\nTo rebuild it from the original BF16 instead:\n\n```bash\nBASE_TYPE=IQ3_XXS bash pipelines\u002Fpack_gguf.sh   # convert orig BF16 → imatrix → IQ3_XXS protected\n```\n\n**Validated artifact** `q122_orig-IQ3XXS-protected.gguf` — **48.04 GiB** (IQ3_XXS 3.06 bpw experts; protected\npath shared int8 \u002F attn int4 \u002F router+SSM f16 \u002F embed+lm_head q8_0; recipe `configs\u002Fgguf_tensor_types_iq3.txt`):\n\n| metric | value | vs A4B | vs fake-quant clip |\n|--------|-------|--------|--------------------|\n| MMLU-Pro | **90.7** (n=150, 0 trunc) | ≥ 85.6 ✅ | tracks (88.5–90) |\n| GPQA-D | **80.8** (n=198, 0 trunc) | ≥ 79.3 ✅ | ~4 pt under (84.8) |\n| decode tok\u002Fs, 1×80 GB H100 | **115.9** (prefill 2541) | — | — |\n| decode tok\u002Fs, expert-offload (peak ~7.6 GiB VRAM, fits 8 GB to 8k ctx) | **45.7 ± 0.5** (prefill ~154) | — | runs on an 8 GB card + ~48 GiB RAM |\n\nThe deployable GGUF **preserves the win** (≥ A4B on both axes) and **runs on one 80 GB GPU at ~116 tok\u002Fs** (or\na small card + CPU at ~47 tok\u002Fs with all routed experts offloaded). GPQA is ~4 pt below the fake-quant ceiling\n— an honest, small i-quant loss, not a collapse. Accuracy is measured by `src\u002Feval\u002Fllamacpp_eval8.py`\n(llama-server + our exact prompt-build\u002Fgrading — vLLM can't load this arch's GGUF). Pipeline notes baked in:\n`--no-mtp` (no `mtp.*` tensors despite config); `#` comments stripped before `--tensor-type-file`; vision tower\n→ separate `*-mmproj-f16.gguf`.\n\n*Superseded fallback:* an earlier `q122_clip-Q3K-protected.gguf` (57 GB) was built from the **dequant clip**\n(double-quantized, Q3_K 3.4 bpw, **no imatrix**). The IQ3_XXS-from-original build above replaces it as the\nshipped artifact.\n\n## Hardware\n\n- **4× NVIDIA H100-80 GB** (NVLink), driver 580.105.08, CUDA 13.0, ~885 GB CPU RAM. Quantize + eval use\n  `device_map=\"auto\"` \u002F vLLM TP=4; OPD training uses FSDP2 across all 4 GPUs via `torchrun`.\n- The **packed ~47 GB** recipe is designed to run on a **single 80 GB GPU**; the BF16 teacher needs 4.\n- The one-GPU smoke (path A) runs on any ≥16 GB GPU.\n\n## Methodology & roadmap\n\n- **Eval numbers are weight-only fake-quant — a capability *ceiling*, measured at full fidelity.**\n  `quant_save.py` bakes the 3b\u002F4b quantization, then saves a *dequantized BF16* checkpoint so vLLM measures\n  capability exactly (no low-bit eval kernel). The **deployable artifact is the GGUF\u002Fllama.cpp pack**\n  (`docs\u002FMOE_PIPELINE.md`) — the ~47 GB on-disk, single-GPU-runnable form, and the one\n  [published on the Hub](https:\u002F\u002Fhuggingface.co\u002FGeneral-Instinct\u002FInstinctRazor-Qwen3.5-122B-A10B-GGUF).\n- **Same-harness deltas are the comparison.** Some axes (e.g. LiveCodeBench v6 extraction) read below their\n  official absolutes for *all* models; the gated quantity is the same-harness *delta* vs A4B, not the\n  absolute. See `docs\u002FEVAL_PROTOCOL.md` for the validation gate.\n- **Remaining gaps are an OPD target, not a wall.** Where the clip still trails A4B — code (LCB v6) and\n  math \u002F multimodal-math — the loss is largely token-inefficiency introduced by quantization. We close it\n  with **OPD (on-policy distillation)**, a separate framework we'll open-source later; `pipelines\u002Fdistill.sh`\n  is the in-tree reference path (gen → score → FSDP train → merge + requant), trained on the gap's own\n  domain (e.g. code CoT for LCB).\n\n## Where the numbers come from\n\nEvery value in `results\u002FRESULTS.md` maps to a JSON in `results\u002F` and the `eval.sh` command that\nproduced it. Start with [`docs\u002FEVAL_PROTOCOL.md`](docs\u002FEVAL_PROTOCOL.md) (the validation gate), then\n[`docs\u002FMOE_FINDINGS.md`](docs\u002FMOE_FINDINGS.md) (why the recipe is what it is).\n","2026-06-11 04:11:46","CREATED_QUERY"]