[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1483":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":16,"starSnapshotCount":16,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},1483,"structured-cot","andthattoo\u002Fstructured-cot","andthattoo","Structured Chain-of-Thought","",null,"Python",218,15,5,3,0,7,3.61,false,"main",true,[],"2026-06-12 02:00:28","# structured-cot\n\n**Grammar-constrained chain-of-thought for reasoning LMs. Zero training. 22× compression on HumanEval+; +14 pp on a recent LiveCodeBench public-test slice.**\n\n## The finding\n\nOn HumanEval+ (164 problems) with `unsloth\u002FQwen3.6-35B-A3B-GGUF` at Q4_K_M, running on one H100 via `llama-cpp-python`:\n\n| Mode | pass@1 | mean thinking tokens | mean total tokens |\n|---|---|---|---|\n| Free-form `\u003Cthink>...\u003C\u002Fthink>` | **151 \u002F 164 = 92.1%** | 3087 | 3410 |\n| Grammar-constrained `\u003Cthink>GOAL\u002FAPPROACH\u002FEDGE\u003C\u002Fthink>` | **152 \u002F 164 = 92.7%** | **138** | **408** |\n| Prompt-only terse `GOAL\u002FAPPROACH\u002FEDGE` | **153 \u002F 164 = 93.3%** | 2298 | 2764 |\n| **FSM vs FREE Δ** | **+0.6 pp** | **22.4× shorter** | **8.4× shorter** |\n\nNo distillation. No fine-tuning. No reward-model. A ~20-line GBNF grammar applied to the `\u003Cthink>` block at inference time matches full-thinking accuracy with an order-of-magnitude fewer tokens. The prompt-only terse control slightly improves pass@1 on this run, but still uses 2298 thinking tokens on average; the grammar is what reliably enforces the compact token regime.\n\nOn a harder recent LiveCodeBench v6 LeetCode slice (50 problems, `contest_date >= 2025-01-01`, public functional tests), the richer [`grammars\u002Ffsm_grammar_lcb_plan.gbnf`](grammars\u002Ffsm_grammar_lcb_plan.gbnf) grammar improves both reliability and token use:\n\n| Mode | pass@1 | mean thinking tokens | mean total tokens | extraction issues |\n|---|---:|---:|---:|---|\n| Free-form `\u003Cthink>...\u003C\u002Fthink>` | 25 \u002F 50 = 50.0% | 11553 | 13632 | empty_code=18 |\n| FSM_PLAN `GOAL\u002FSTATE\u002FALGO\u002FEDGE\u002FVERIFY` | **32 \u002F 50 = 64.0%** | **267** | **2743** | **none** |\n| **FSM_PLAN vs FREE Δ** | **+14.0 pp** | **43.3× shorter** | **5.0× shorter** | - |\n\nThe LiveCodeBench result is a public-test result, not an official private leaderboard score. It also reveals a useful caveat: on harder tasks, the model can move some deliberation into comments or post-think answer text, so total tokens and comment bloat need to be tracked alongside `\u003Cthink>` tokens.\n\nSee [RESULTS.md](RESULTS.md) for the full experimental writeup, per-problem breakdown, and discussion of limitations.\n\n## Why this matters\n\nReasoning models like Qwen3, DeepSeek-R1, QwQ spend thousands of tokens in verbose prose thinking — exploring alternatives, restating, hedging. This work shows, on one benchmark, that for a large chunk of that reasoning, the verbose scaffolding isn't doing real work. The model already has the reasoning capability internally; grammar constraint just extracts it in a denser form.\n\nThe engineering implication is direct: **inference-time thinking compute can be cut dramatically via a grammar file alone**, with no training pipeline or serving changes beyond a GBNF argument. HumanEval+ shows the clean compression case; LiveCodeBench shows the harder-task behavior, where the grammar also prevents many no-code failures but can displace some reasoning into the answer channel.\n\n## How it works\n\nSingle GBNF grammar (see [`grammars\u002Ffsm_grammar.gbnf`](grammars\u002Ffsm_grammar.gbnf)):\n\n```gbnf\nroot  ::= think code\nthink ::= \"\u003Cthink>\\n\" \"GOAL: \" line \"APPROACH: \" line \"EDGE: \" line \"\u003C\u002Fthink>\\n\\n\"\nline  ::= [^\\n]+ \"\\n\"\ncode  ::= [\\x09\\x0A\\x0D\\x20-\\x7E]+\n```\n\nThree lines of structured plan. Code after is unconstrained.\n\nThe grammar is applied via llama-cpp-python's server which accepts `grammar` in the request body. At each generation step inside `\u003Cthink>`, logits for tokens that violate the grammar are masked to -∞; the model samples from the constrained distribution.\n\n## Setup\n\nTested on an H100 with CUDA 12.4.\n\n```bash\nuv sync\n# llama-cpp-python is pulled from the prebuilt cu124 index via tool.uv.sources.\n# CUDA runtime libs pinned so it coexists with torch's cu13 default.\n\n# Before running, expose the bundled CUDA libs to the dynamic loader:\nexport LD_LIBRARY_PATH=$(uv run python -c \"\nimport site, os\nsp = site.getsitepackages()[0]\nnv = os.path.join(sp, 'nvidia')\nprint(':'.join(os.path.join(nv, d, 'lib') for d in os.listdir(nv)\n               if os.path.isdir(os.path.join(nv, d, 'lib'))))\n\")\n```\n\n### Native llama-server (recommended for speed)\n\nFor long generations, native `llama-server` is faster than `llama-cpp-python`, and can use llama.cpp speculative decoding:\n\n```bash\nsudo apt-get update\nsudo apt-get install -y git cmake build-essential libcurl4-openssl-dev\n\n.\u002Fscripts\u002Fbuild_llama_cpp.sh native\n.local\u002Fbin\u002Fllama-server --help | head\n```\n\nThe build script clones llama.cpp into `.deps\u002F`, builds `llama-server`, and\ninstalls the binary under `.local\u002Fbin\u002F`. Set\n`LLAMA_CPP_REF=\u003Cbranch|tag|sha>` to build a specific upstream revision.\n\nThe repo includes [`run_llama_server.sh`](run_llama_server.sh), or the unified\n[`run_server.sh`](run_server.sh) mode:\n\n```bash\nSERVER_MODE=llama.cpp BACKGROUND=1 .\u002Frun_server.sh\n```\n\nThis defaults to:\n\n```bash\nllama-server -hf ggml-org\u002FQwen3.6-27B-GGUF --spec-default \\\n    --host 127.0.0.1 --port 8000 -c 32768 -ngl 999 \\\n    --flash-attn on --reasoning-format none\n```\n\nBy default, [`run_server.sh`](run_server.sh) still starts `llama-cpp-python`\nand is kept for reproducing the original run.\n\n### Experimental tools + grammar llama.cpp build\n\nUpstream `llama-server` currently rejects requests that combine OpenAI-style\n`tools` with a custom `grammar`, because tool calling uses its own generated\ngrammar. This repo vendors the experimental discussion patch at\n[`third_party\u002Fpatches\u002Fllama.cpp\u002Fpre-trigger-grammar-78433f6.patch`](third_party\u002Fpatches\u002Fllama.cpp\u002Fpre-trigger-grammar-78433f6.patch).\nIt adds a pre-trigger grammar slot that constrains the reasoning phase and then\nyields to llama.cpp's tool grammar.\n\nBuild and run the patched server:\n\n```bash\n.\u002Fscripts\u002Fbuild_llama_cpp.sh pre-trigger-grammar\nSERVER_MODE=pre-trigger-grammar BACKGROUND=1 .\u002Frun_server.sh\n```\n\nPatched mode defaults to `--reasoning-format deepseek`, so reasoning arrives in\nthe OpenAI-compatible `reasoning_content` field rather than visible\n`\u003Cthink>...\u003C\u002Fthink>` text. The eval harness preserves that field for token\naccounting, but treat this mode as experimental: overly tight reasoning\ngrammars can improve structure while hurting answer quality.\n\n### Model download\n\n```bash\nexport HF_TOKEN=hf_...\nhuggingface-cli download unsloth\u002FQwen3.6-35B-A3B-GGUF \\\n    --include \"*Q4_K_M*\" --local-dir ~\u002Fmodels\u002Fqwen3.6-gguf\n```\n\n## Run\n\nThe server can run in the foreground in one pane, or in the background from a single terminal.\n\n### Start the server\n\n```bash\nBACKGROUND=1 SERVER_MODE=llama.cpp .\u002Frun_server.sh\ntail -f server.log\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fmodels\n```\n\nBy default this downloads\u002Fserves `ggml-org\u002FQwen3.6-27B-GGUF` through native `llama-server` with `--spec-default` and `--reasoning-format none`, so `\u003Cthink>...\u003C\u002Fthink>` remains visible for token accounting. Override with env vars:\n\n```bash\nHF_REPO=ggml-org\u002FQwen3.6-27B-GGUF N_CTX=32768 BACKGROUND=1 SERVER_MODE=llama.cpp .\u002Frun_server.sh\nMODEL_PATH=\u002Fpath\u002Fto\u002Fmodel.gguf BACKGROUND=1 SERVER_MODE=llama.cpp .\u002Frun_server.sh\nKV_TYPE=q8_0 BACKGROUND=1 SERVER_MODE=llama.cpp .\u002Frun_server.sh\nREASONING_FORMAT=deepseek-legacy BACKGROUND=1 SERVER_MODE=llama.cpp .\u002Frun_server.sh\nBACKGROUND=1 SERVER_MODE=pre-trigger-grammar .\u002Frun_server.sh\n```\n\nBackground mode writes `server.log` and `server.pid`. Stop it with:\n\n```bash\nkill \"$(cat server.pid)\"\n```\n\nIf you prefer the original Python server, use `.\u002Frun_server.sh` without\n`SERVER_MODE`. It auto-discovers the GGUF from `~\u002F.cache\u002Fhuggingface\u002Fhub\u002F` or\n`~\u002Fmodels\u002F`, uses 8-bit KV cache (`q8_0`), flash attention, and `n_ctx=65536`.\n\n### Run the comparison\n\n```bash\n# Smoke test\nuv run python fsm_vs_free_eval.py --n-problems 10 --max-tokens 4096\n\n# Full HumanEval+ with FREE + FSM + PROMPT_TERSE controls\nuv run python fsm_vs_free_eval.py --n-problems 164 --max-tokens 8192 --only all\n\n# MBPP+\nuv run python fsm_vs_free_eval.py --n-problems 100 --dataset mbpp --max-tokens 8192\n\n# LiveCodeBench v6 recent subset (public functional tests)\nuv run python fsm_vs_free_eval.py --dataset livecodebench \\\n    --lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \\\n    --n-problems 50 --max-tokens 16384 --only all \\\n    --out-dir lcb_v6_2025_01_01_n50_all\n\n# LiveCodeBench FSM-only baseline grammar\nuv run python fsm_vs_free_eval.py --dataset livecodebench \\\n    --lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \\\n    --n-problems 50 --max-tokens 16384 --only fsm \\\n    --grammar-file grammars\u002Ffsm_grammar.gbnf \\\n    --out-dir lcb_v6_2025_01_01_fsm_base_grammar\n\n# LiveCodeBench FSM-only plan grammar\nuv run python fsm_vs_free_eval.py --dataset livecodebench \\\n    --lcb-version release_v6 --date-cutoff 2025-01-01 --platform leetcode \\\n    --n-problems 50 --max-tokens 16384 --only fsm \\\n    --grammar-file grammars\u002Ffsm_grammar_lcb_plan.gbnf \\\n    --out-dir lcb_v6_2025_01_01_fsm_lcb_plan\n```\n\nEach run produces in `fsm_vs_free\u002F`:\n- `results.jsonl` — per-problem raw generations, extracted think\u002Fcode, pass\u002Ffail, errors, extraction metadata\n- `summary.json` — aggregate stats, pass-set overlap, and failure accounting\n- `per_problem.md` — human-readable report with outcome tags (🔺 \u002F 🔻 \u002F 🟰 \u002F ❌)\n\nTo make a side-by-side generation animation for one problem:\n\n```bash\nuv run python make_tps_animation.py \\\n    --task-id 3781 \\\n    --left-results lcb_v6_2025_01_01_free_n50\u002Fresults.jsonl \\\n    --left-mode free --left-label FREE --left-seconds 237 \\\n    --right-results lcb_v6_2025_01_01_fsm_lcb_plan_n50\u002Fresults.jsonl \\\n    --right-mode fsm --right-label FSM_PLAN --right-seconds 279 \\\n    --out animations\u002Flcb_3781_free_vs_fsm_plan.html\n```\n\nThe summary also reports `post_think_tokens_mean`, `answer_channel_bloat`,\n`code_comment_tokens_mean`, and `comment_bloat`. Those are useful on\nLiveCodeBench because a model can obey a short `\u003Cthink>` grammar while moving\nsome of the missing scratchpad into comments or other post-think answer text.\n\n## Architecture notes\n\nQwen3.6-35B-A3B is a **MoE hybrid**:\n- 40 layers total, pattern `10 × (3 × GatedDeltaNet + 1 × GatedAttention)`\n- 256 experts, 9 active per token (3B active params)\n- Native 262K context, extensible to 1M via YaRN\n\nFor the FSM experiment, the base-LM architecture doesn't matter much — we're constraining what tokens are emitted, not how they're processed internally. The same GBNF approach should work with any model served through llama.cpp, vLLM, SGLang, TGI, etc., as long as the server accepts a grammar parameter.\n\n## Limitations and open questions\n\n1. **Contamination.** HumanEval has been in training corpora for years. All modes may be recalling solutions, not reasoning. The \"FSM matches FREE\" result could mean \"grammar extracts the same memorized solution in fewer tokens\" rather than \"grammar preserves reasoning capability.\" LiveCodeBench release v6 currently reaches April 2025, so it is a useful recent\u002Flower-contamination check but not a strict post-Qwen3.6-release cutoff.\n\n2. **Grammar specificity.** The GOAL\u002FAPPROACH\u002FEDGE format was tuned for short coding tasks. The active LiveCodeBench grammar, [`grammars\u002Ffsm_grammar_lcb_plan.gbnf`](grammars\u002Ffsm_grammar_lcb_plan.gbnf), adds `GOAL\u002FSTATE\u002FALGO\u002FEDGE\u002FVERIFY` while leaving the answer permissive. Math \u002F logic \u002F planning domains may need different symbolic formats. Unclear whether a single \"universal\" compressed-thinking grammar exists or whether each domain needs its own.\n\n3. **Reasoning depth.** The current result is on problems solvable in one forward pass. For multi-step problems (SWE-Bench, long-horizon planning, agentic tasks), whether grammar compression preserves capability is untested.\n\n4. **Model dependency.** Run with one model (Qwen3.6-35B-A3B). We don't know yet whether smaller models (1B-7B) can be grammar-constrained the same way or whether the capability requires scale.\n\n## Status\n\n- ✅ HumanEval+ full (164 problems): FSM 152\u002F164 vs FREE 151\u002F164, 22× think-token compression\n- ✅ LiveCodeBench v6 recent LeetCode subset, public tests (50 problems): FREE 25\u002F50 vs `grammars\u002Ffsm_grammar_lcb_plan.gbnf` 32\u002F50; FSM_PLAN improves pass@1 while cutting mean total tokens from 13632 to 2743\n- ⏳ MBPP+ (planned)\n- 🔲 Other domains (math, logic, planning)\n- 🔲 Cross-model transfer (smaller models)\n\n## References\n\nThe direct technical antecedent is [Coconut: Training Large Language Models to Reason in a Continuous Latent Space (Hao et al., 2024)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06769), which compresses CoT into continuous latents via fine-tuning. This work tests whether a much cheaper intervention — grammar constraint at inference, no training — captures a meaningful fraction of the same benefit.\n","structured-cot 是一个用于增强语言模型推理能力的项目，通过在思考过程中应用语法约束来提高效率。其核心功能在于使用GBNF语法文件对思考过程进行结构化限制，在不进行任何训练或微调的情况下显著减少所需的思考令牌数量。例如，在HumanEval+数据集上实现了22倍的压缩率，并在LiveCodeBench的一个子集上提高了14个百分点的准确率。该项目特别适用于需要高效利用计算资源进行复杂问题解决的场景，如代码生成与调试等任务。通过简单地引入一个约20行的GBNF文件，即可大幅降低推理时的计算成本，同时保持甚至提升模型性能。",2,"2026-06-11 02:44:05","CREATED_QUERY"]