[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11255":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},11255,"ds4","antirez\u002Fds4","antirez","DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm",null,"C",13458,1182,104,57,0,73,532,5947,375,119.22,"MIT License",false,"main",true,[],"2026-06-12 04:00:54","# DwarfStar 4\n\nDwarfStar 4 is a small native inference engine specific for **DeepSeek V4 Flash**. It is\nintentionally narrow: not a generic GGUF runner, not a wrapper around another\nruntime: it is completely self-contained. Other than running the model in a\ncorrect and fast way, the project goal is to provide DS4 specific loading,\nprompt rendering, tool calling, KV state handling (RAM and on-disk), and server\nAPI, all ready to work with coding agents or with the provided CLI interface.\nThere are also tools for GGUF and imatrix generation, and for quality and\nspeed testing.\n\nWe support the following backends:\n* **Metal** is our primary target. Starting from MacBooks with 96GB of RAM.\n* **NVIDIA CUDA** with special care for the DGX Spark.\n* **AMD ROCm** is only supported in the [rocm](https:\u002F\u002Fgithub.com\u002Fantirez\u002Fds4\u002Ftree\u002Frocm) branch. It is kept separate from main since I (antirez) don't have direct hardware access, so the community rebases the branch as needed.\n\nThis project would not exist without **llama.cpp and GGML**, make sure to read\nthe acknowledgements section, a big thank you to Georgi Gerganov and all the\nother contributors.\n\n## Motivations\n\nNow, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special\nmodel deserving a standalone engine? Because after comparing it with powerful smaller\ndense models, we can report that:\n\n1. DeepSeek v4 Flash is faster because of less active parameters.\n2. In thinking mode, if you avoid *max thinking*, it produces a thinking section that is a lot shorter than other models, even 1\u002F5 of other models in many cases, and crucially, the thinking section length is **proportional to the problem complexity**. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.\n3. The model features a context window of **1 million tokens**.\n4. Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.\n5. It writes much better English and Italian. It *feels* a quasi-frontier model.\n6. The KV cache is incredibly compressed, allowing long context inference on local computers and **on disk KV cache persistence**.\n7. It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM (and many people reported it working with 96GB as well, even at 250k context window!).\n8. We expect DeepSeek to release **updated versions of v4 Flash** in the future, even better than the current one.\n\nThat said, a few important things about this project:\n\n* The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 96\u002F128GB of memory.\n* This software is developed with **strong assistance from GPT 5.5** and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without `llama.cpp` and GGML, largely written by hand.\n* This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. **The KV cache is actually a first-class disk citizen**.\n* Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there.\n* The optimized graph path targets **Metal on macOS** and **CUDA on Linux**. The CPU path is only for correctness checks and model\u002Ftokenizer diagnostics. For CPU-only Linux builds, use `make cpu`; it builds the normal `.\u002Fds4` and `.\u002Fds4-server` binaries without CUDA or Metal. On macOS, **warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel** if you try to run the CPU code. Remember? Software sucks. It was not possible to fix the CPU inference to avoid crashing, since each time you have to restart the computer, which is not funny. Help us, if you have the guts.\n\n## Acknowledgements to llama.cpp and GGML\n\n`ds4.c` does not link against GGML, but it **exists thanks to the path opened by the\nllama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won\nengineering knowledge developed there**.\nWe are thankful and indebted to [`llama.cpp`](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp)\nand its contributors. Their implementation, kernels, tests, and design choices were\nan essential reference while building this DeepSeek V4 Flash-specific inference path.\nSome source-level pieces are retained or adapted here under the MIT license: GGUF\nquant layouts and tables, CPU quant\u002Fdot logic, and certain kernels. For this\nreason, and because we are genuinely grateful, we keep the GGML authors copyright\nnotice in our `LICENSE` file.\n\n## Status\n\nThe code and GGUF files are to be considered of **alpha quality** because\ninference and model serving is a complicated matter and all this exists\nonly for a few days. It will take months to reach a more stable form.\nHowever, we try to keep the project in a usable state, and we are making\nprogresses. If you have issues, make sure to use `--trace` to log the\nsessions, and open issues including the full trace.\n\n## More Documentation\n\nIf you are looking for very specific things, we have other\nsub-README files. Otherwise for normal usage keep reading the\nnext sections.\n\n- [CONTRIBUTING.md](CONTRIBUTING.md): correctness and speed regression testing\n  guide for contributors. **Read this before sending a pull request**.\n- [gguf-tools\u002FREADME.md](gguf-tools\u002FREADME.md): offline GGUF generation,\n  imatrix collection, quantization tooling, and quality checks.\n- [gguf-tools\u002Fimatrix\u002FREADME.md](gguf-tools\u002Fimatrix\u002FREADME.md): how the\n  routed-MoE imatrix is collected and used.\n- [gguf-tools\u002Fimatrix\u002Fdataset\u002FREADME.md](gguf-tools\u002Fimatrix\u002Fdataset\u002FREADME.md):\n  how the calibration prompt corpus is generated.\n- [gguf-tools\u002Fquality-testing\u002FREADME.md](gguf-tools\u002Fquality-testing\u002FREADME.md):\n  how local GGUFs are scored against official DeepSeek V4 Flash continuations.\n- [dir-steering\u002FREADME.md](dir-steering\u002FREADME.md): directional steering data,\n  vector generation, and usage.\n- [speed-bench\u002FREADME.md](speed-bench\u002FREADME.md): benchmark CSV files and graph\n  generation.\n- [tests\u002Ftest-vectors\u002FREADME.md](tests\u002Ftest-vectors\u002FREADME.md): official\n  continuation vectors used for regression checks.\n\n## Model Weights\n\nThis implementation only works with the DeepSeek V4 Flash GGUFs published for\nthis project. It is not a general GGUF loader, and arbitrary DeepSeek\u002FGGUF files\nwill not have the tensor layout, quantization mix, metadata, or optional MTP\nstate expected by the engine. The 2 bit quantizations provided here are not\na joke: they behave well, work under coding agents, call tools in a reliable way.\nThe 2 bit quants use a very asymmetrical quantization: only the routed MoE\nexperts are quantized, up\u002Fgate at `IQ2_XXS`, down at `Q2_K`. They are the\nmajority of all the model space: the other components (shared experts,\nprojections, routing) are left untouched to guarantee quality.\n\nDownload one main model. **Prefer the imatrix versions.**\n\n```sh\n.\u002Fdownload_model.sh q2-imatrix   # 96\u002F128 GB RAM machines, imatrix-tuned q2\n.\u002Fdownload_model.sh q4-imatrix   # >= 256 GB RAM machines, imatrix-tuned q4\n```\n\nLegacy GGUF files are still available if you specifically need the older\nnon-imatrix quants:\n\n```sh\n.\u002Fdownload_model.sh q2           # 96\u002F128 GB RAM machines, legacy non-imatrix\n.\u002Fdownload_model.sh q4           # >= 256 GB RAM machines, legacy non-imatrix\n```\n\nThe script downloads from `https:\u002F\u002Fhuggingface.co\u002Fantirez\u002Fdeepseek-v4-gguf`,\nstores files under `.\u002Fgguf\u002F`, resumes partial downloads with `curl -C -`, and\nupdates `.\u002Fds4flash.gguf` to point at the selected q2-imatrix\u002Fq4-imatrix\u002Fq2\u002Fq4\nmodel. The plain q2 XXS weights are produced with the weights importance vector\nonly, without an imatrix. The imatrix variants are preferred.\nAuthentication is optional for public downloads, but `--token TOKEN`,\n`HF_TOKEN`, or the local Hugging Face token cache are used when present.\n\nIf you want to regenerate GGUF files or collect a new imatrix, see\n[gguf-tools\u002FREADME.md](gguf-tools\u002FREADME.md). Those tools are meant for offline\nmodel-building work and can take a long time on the full DeepSeek V4 Flash\nweights.\n\n`.\u002Fdownload_model.sh mtp` fetches the optional speculative decoding support\nGGUF. It can be used with q2-imatrix, q4-imatrix, q2, and q4, but must be\nenabled explicitly with `--mtp`. The current MTP\u002Fspeculative decoding path is\nstill experimental: it is correctness-gated and currently provides at most a\nslight speedup, not a meaningful generation-speed win.\n\nThen build:\n\n```sh\nmake                  # macOS Metal\nmake cuda-spark       # Linux CUDA, DGX Spark \u002F GB10\nmake cuda-generic     # Linux CUDA, other local CUDA GPUs\nmake cpu              # CPU-only diagnostics build\n```\n\n`.\u002Fds4flash.gguf` is the default model path used by both binaries. Pass `-m` to\nselect another supported GGUF from `.\u002Fgguf\u002F`. Run `.\u002Fds4 --help` and\n`.\u002Fds4-server --help` for the full flag list.\n\n## Speed\n\nThese are single-run Metal CLI numbers with `--ctx 32768`, `--nothink`, greedy\ndecoding, and `-n 256`. The short prompt is a normal small Italian story\nprompt. The long prompts exercise chunked prefill plus long-context decode.\nQ4 requires the larger-memory machine class, so M3 Max Q4 numbers are `N\u002FA`.\n\n| Machine | Quant | Prompt | Prefill | Generation |\n| --- | ---: | ---: | ---: | ---: |\n| MacBook Pro M3 Max, 128 GB | q2 | short | 58.52 t\u002Fs | 26.68 t\u002Fs |\n| MacBook Pro M3 Max, 128 GB | q2 | 11709 tokens | 250.11 t\u002Fs | 21.47 t\u002Fs |\n| MacBook Pro M3 Max, 128 GB | q4 | short | N\u002FA | N\u002FA |\n| MacBook Pro M3 Max, 128 GB | q4 | long | N\u002FA | N\u002FA |\n| Mac Studio M3 Ultra, 512 GB | q2 | short | 84.43 t\u002Fs | 36.86 t\u002Fs |\n| Mac Studio M3 Ultra, 512 GB | q2 | 11709 tokens | 468.03 t\u002Fs | 27.39 t\u002Fs |\n| Mac Studio M3 Ultra, 512 GB | q4 | short | 78.95 t\u002Fs | 35.50 t\u002Fs |\n| Mac Studio M3 Ultra, 512 GB | q4 | 12018 tokens | 448.82 t\u002Fs | 26.62 t\u002Fs |\n| DGX Spark GB10, 128 GB | q2 | 7047 tokens | 343.81 t\u002Fs | 13.75 t\u002Fs |\n\n![M3 Max t\u002Fs](speed-bench\u002Fm3_max_ts.svg)\n\n## Benchmarking\n\n`ds4-bench` measures instantaneous prefill and generation throughput at context\nfrontiers instead of reporting one whole-run average. It loads the model once,\nwalks a fixed token sequence to frontiers such as 2048, 4096, 6144, and uses\nincremental prefill so each row measures only the newly-added token interval.\nAfter each frontier it saves the live KV state to memory, generates a fixed\ngreedy non-EOS probe, restores the memory snapshot, and continues prefill.\n\n```sh\n.\u002Fds4-bench \\\n  -m ds4flash.gguf \\\n  --prompt-file speed-bench\u002Fpromessi_sposi.txt \\\n  --ctx-start 2048 \\\n  --ctx-max 65536 \\\n  --step-incr 2048 \\\n  --gen-tokens 128\n```\n\nThe example file is a cleaned public-domain Project Gutenberg text of\nAlessandro Manzoni's *I Promessi Sposi* (ebook #45334), with the Gutenberg\nheader and footer removed: \u003Chttps:\u002F\u002Fwww.gutenberg.org\u002Febooks\u002F45334>.\n\nUse `--step-incr N` for different linear spacing, or `--step-mul F` for\nexponential sweeps. Output is CSV with one row per frontier: latest prefill\ninterval tokens\u002Fsec, generation tokens\u002Fsec at that frontier, and\n`kvcache_bytes`.\n\n## Capability Evaluation\n\n`ds4-eval` is a small real-model integration benchmark. It is not a leaderboard\nrunner and should not be reported as an official GPQA, SuperGPQA, AIME, or\nsecurity benchmark score: the questions are an embedded 92-item subset chosen\nto make local regression testing useful and visually inspectable. The program\nloads the real GGUF,\nrenders DS4 chat prompts, streams sampled tokens in a split-screen TUI, grades\nthe final answer, and prints a per-question report with prompt tokens,\ngenerated tokens, pass\u002Ffail state, the model answer, and the correct answer.\n\n```sh\n.\u002Fds4-eval -m ds4flash.gguf --trace \u002Ftmp\u002Fds4-eval.txt\n```\n\nThe default run uses `--tokens 16000`, thinking mode enabled, and a soft\u002Fhard\n`\u003C\u002Fthink>` budget cutoff so the model has room to produce a visible answer.\n`ds4-eval` sizes the context internally from the largest selected prompt plus\nthe generation budget, and refuses runs that would need more than 1M context\ntokens. Press `p` to pause, `q` to exit and print the report, Up\u002FDown to\ninspect or select another question, and Enter to run the selected question next.\n`--plain` disables the TUI.\n\nThe first 75 embedded questions are interleaved as 25 GPQA Diamond, 25 audited\nSuperGPQA, and 25 AIME 2025 problems. The final 17 are an audited COMPSEC\nsubset of reduced single-function C\u002FC++ vulnerability-localization questions.\nThe model is asked for the single best source line, or the smallest exact line\nset only when the bug cannot be localized to one line; the scorer accepts small\naudited ranges only when adjacent lines are equivalent locations for the same\nbug. The order is\nintentionally progressive: early questions are useful smoke tests, while later\nquestions are hard enough that a strong reasoning model should still miss some\nof them. The SuperGPQA slice is curated rather than blind: upstream rows with\nwrong keys, missing figures, or underspecified prompts are replaced with cleaner\nrows.\n\nFor a model like DeepSeek V4 Flash, the set should be treated as a hard\ncapability regression suite rather than a pass\u002Ffail unit test:\n\n- **GPQA Diamond** contributes graduate-level science questions with\n  multiple-choice answers. DeepSeek's model card reports strong Flash results\n  on full GPQA Diamond in thinking mode, but individual items still require\n  careful physics, chemistry, or biology reasoning and are easy to lose with a\n  small prompt\u002Frendering or sampling regression.\n- **SuperGPQA** contributes broad specialist knowledge and domain-transfer\n  questions. The model-card SuperGPQA number is much lower than GPQA Diamond,\n  so these items are expected to be uneven: some look mundane, others require\n  niche professional knowledge or exact interpretation of a translated-style\n  exam question.\n- **AIME 2025** contributes exact-answer contest math. These are often the most\n  unforgiving items in the set: no multiple-choice prior, no partial credit, and\n  a single arithmetic or algebraic slip changes the grade.\n- **COMPSEC** contributes single-function C\u002FC++ security reasoning items\n  reduced from public CVE writeups. These are not exploit prompts: the task is\n  to identify the best source line where the defensive code flaw is introduced,\n  or return `0` for a safe function.\n\nIn practice this means `ds4-eval` should not be expected to produce a perfect\n92\u002F92 run. It is meant to answer a more useful engineering question: after a\nkernel, quantization, prompt-rendering, KV-cache, or tool-streaming change, does\nDeepSeek V4 Flash still solve a representative mix of hard science, broad\nknowledge, exact math, and security-code problems while using the same inference\npath users run?\n\n## CLI\n\nOne-shot prompt:\n\n```sh\n.\u002Fds4 -p \"Explain Redis streams in one paragraph.\"\n```\n\nNo `-p` starts the interactive prompt:\n\n```sh\n.\u002Fds4\nds4>\n```\n\nThe interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat\ntranscript and the live graph KV checkpoint, so each turn extends the previous\nconversation. Useful commands are `\u002Fhelp`, `\u002Fthink`, `\u002Fthink-max`, `\u002Fnothink`,\n`\u002Fctx N`, `\u002Fread FILE`, and `\u002Fquit`. Ctrl+C interrupts the current generation\nand returns to `ds4>`.\n\nThe CLI defaults to thinking mode. Use `\u002Fnothink` or `--nothink` for direct\nanswers. `--mtp MTP.gguf --mtp-draft 2` enables the optional MTP speculative\npath; it is useful only for greedy decoding, currently uses a confidence gate\n(`--mtp-margin`) to avoid slow partial accepts, and should be treated as an\nexperimental slight-speedup path.\n\n## Server\n\nStart a local OpenAI\u002FAnthropic-compatible server:\n\n```sh\n.\u002Fds4-server --ctx 100000 --kv-disk-dir \u002Ftmp\u002Fds4-kv --kv-disk-space-mb 8192\n```\n\nUse `--chdir \u002Fpath\u002Fto\u002Fds4` when launching `ds4-server` from another directory,\nso relative runtime files such as `metal\u002F*.metal` resolve from the project tree.\n\nThe server keeps one mutable backend\u002FKV checkpoint in memory,\nso stateless clients that resend a longer version of the same prompt can reuse\nthe shared prefix instead of pre-filling from token zero.\n\nRequest parsing and sockets run in client threads, but inference itself is\nserialized through one graph worker. The current server does not batch multiple\nindependent requests together; concurrent requests wait their turn on the single\nlive graph\u002Fsession.\n\nSupported endpoints:\n\n- `GET \u002Fv1\u002Fmodels`\n- `GET \u002Fv1\u002Fmodels\u002Fdeepseek-v4-flash`\n- `POST \u002Fv1\u002Fchat\u002Fcompletions`\n- `POST \u002Fv1\u002Fresponses`\n- `POST \u002Fv1\u002Fcompletions`\n- `POST \u002Fv1\u002Fmessages`\n\n`\u002Fv1\u002Fchat\u002Fcompletions` accepts the usual OpenAI-style `messages`,\n`max_tokens`\u002F`max_completion_tokens`, `temperature`, `top_p`, `top_k`, `min_p`,\n`seed`, `stream`, `stream_options.include_usage`, `tools`, and `tool_choice`.\nTool schemas are rendered into DeepSeek's DSML tool format, and generated DSML\ntool calls are mapped back to OpenAI tool calls.\n\n`\u002Fv1\u002Fresponses` accepts OpenAI Responses-style `input`, `instructions`,\n`tools`, `tool_choice`, `max_output_tokens`, `temperature`, `top_p`, `stream`,\nand `reasoning`. It is the preferred endpoint for Codex CLI. The server keeps\nResponses continuations bound to live state when possible, and can fall back to\nthe same DSML rendering and KV prefix reuse used by chat completions.\n\n`\u002Fv1\u002Fmessages` is the Anthropic-compatible endpoint used by Claude Code style\nclients. It accepts `system`, `messages`, `tools`, `tool_choice`, `max_tokens`,\n`temperature`, `top_p`, `top_k`, `stream`, `stop_sequences`, and thinking\ncontrols. Tool uses are returned as Anthropic `tool_use` blocks.\n\nDefault sampled API generation uses `temperature=1`, `top_p=1`, and\n`min_p=0.05`, so the default filter is relative probability rather than\nnucleus mass. In thinking mode DS4 uses those fixed sampling defaults and\nignores client sampling knobs, matching DeepSeek's fixed-thinking API behavior.\n\nThe chat, Responses, and Anthropic endpoints support SSE streaming. In thinking\nmode, reasoning is streamed in the native API shape instead of being mixed into\nfinal text. OpenAI chat streaming\nalso streams tool calls as soon as the DSML invocation is recognized: the tool\nheader is sent first, then parameter bytes are forwarded as\n`tool_calls[].function.arguments` deltas while generation continues. The\nAnthropic endpoint streams thinking and text live, then emits structured\n`tool_use` blocks when the generated tool block is complete.\nThe Responses endpoint streams the Responses event lifecycle expected by Codex,\nincluding `response.output_text.delta`, function-call argument events, and\nterminal `response.completed` \u002F `response.incomplete` \u002F `response.failed`\nevents.\n\nFor browser JavaScript clients served from another origin, start the server with\n`--cors` to emit `Access-Control-Allow-*` headers. This only changes HTTP\nheaders; it does not expose the server on the LAN. Use `--host 0.0.0.0`\nexplicitly when remote machines should be able to connect.\n\n### Tool call handling and canonicalization\n\nDeepSeek V4 Flash emits tool calls as [DSML text](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-V4-Pro\u002Fblob\u002Fmain\u002Fencoding\u002FREADME.md). Agent clients do not send that\nsame text back on the next request: they send normalized OpenAI\u002FAnthropic JSON\ntool-call objects. **If the server re-rendered those objects slightly\ndifferently, the rendered byte prefix would no longer match the live KV\ncheckpoint** and the next turn would have to be rebuilt.\n\nThe first line of defense is exact replay. Every tool call gets an unguessable\nAPI tool ID, and the server remembers `tool id -> exact sampled DSML block` in\na bounded in-memory map backed by radix trees. When the client later sends that\ntool ID back, the prompt renderer uses the exact DSML bytes the model sampled,\nnot a freshly formatted approximation. This map can also be saved inside KV\ncache files, so exact replay survives server restarts for cached histories.\n\n**Canonicalization is only the backup path**. If the exact DSML block is missing,\nor exact replay is disabled with `--disable-exact-dsml-tool-replay`, the server\nrenders a deterministic DSML form from the JSON tool object. After a tool-call\nturn, it compares the live sampled token stream with the prompt that the next\nclient request will render. If needed, it rewrites the live checkpoint, or\nfalls back to an older disk KV snapshot and replays only the suffix. This keeps\nthe model continuation aligned with the stateless API transcript.\n\nDuring generation, the server also treats DSML syntax differently from payload.\nWhen the model is emitting stable protocol structure such as DSML tags,\nparameter headers, JSON punctuation, or closing markers, sampling is forced to\n`temperature=0` so the tool call stays parseable. This greedy mode does **not**\napply to argument payloads: `string=true` parameter bodies and JSON string\nvalues, including file contents and edit text, use the request's normal sampling\nsettings. That separation is important: deterministic decoding is helpful for\nsyntax, but can create repeated text when applied to long code or file bodies.\n\nMinimal OpenAI example:\n\n```sh\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H 'Content-Type: application\u002Fjson' \\\n  -d '{\n    \"model\":\"deepseek-v4-flash\",\n    \"messages\":[{\"role\":\"user\",\"content\":\"List three Redis design principles.\"}],\n    \"stream\":true\n  }'\n```\n\n### Agent Client Usage\n\n`ds4-server` can be used by local coding agents that speak OpenAI-compatible\nchat completions. Start the server first, and set the client context limit no\nhigher than the `--ctx` value you started the server with:\n\n```sh\n.\u002Fds4-server --ctx 100000 --kv-disk-dir \u002Ftmp\u002Fds4-kv --kv-disk-space-mb 8192\n```\n\nYou can use larger context and larger cache if you wish. Full context of\n1M tokens is going to use more or less 26GB of memory (compressed indexer\nalone will be like 22GB), so configure a context which makes sense in\nyour system. With 128GB of RAM you would run the 2-bit quants, which are\nalready 81GB, 26GB are going to be likely too much, so a context window\nof 100~300k tokens is wiser. However users reported being able to run 2bit\nquants with 250k ctx window in a Macs with just 96GB of system memory: make sure\nto kill processes that use too much memory, if you plan doing so ;)\n\nThe `384000` output limit below avoids token caps since the model is able\nto generate very long replies otherwise (up to 384k tokens). The server\nstill stops when the configured context window is full.\n\nFor **opencode**, add a provider and agent entry to\n`~\u002F.config\u002Fopencode\u002Fopencode.json`:\n\n```json\n{\n  \"$schema\": \"https:\u002F\u002Fopencode.ai\u002Fconfig.json\",\n  \"provider\": {\n    \"ds4\": {\n      \"name\": \"ds4.c (local)\",\n      \"npm\": \"@ai-sdk\u002Fopenai-compatible\",\n      \"options\": {\n        \"baseURL\": \"http:\u002F\u002F127.0.0.1:8000\u002Fv1\",\n        \"apiKey\": \"dsv4-local\"\n      },\n      \"models\": {\n        \"deepseek-v4-flash\": {\n          \"name\": \"DeepSeek V4 Flash (ds4.c local)\",\n          \"limit\": {\n            \"context\": 100000,\n            \"output\": 384000\n          }\n        }\n      }\n    }\n  },\n  \"agent\": {\n    \"ds4\": {\n      \"description\": \"DeepSeek V4 Flash served by local ds4-server\",\n      \"model\": \"ds4\u002Fdeepseek-v4-flash\",\n      \"temperature\": 0\n    }\n  }\n}\n```\n\nFor **Pi**, add a provider to `~\u002F.pi\u002Fagent\u002Fmodels.json`:\n\n```json\n{\n  \"providers\": {\n    \"ds4\": {\n      \"name\": \"ds4.c local\",\n      \"baseUrl\": \"http:\u002F\u002F127.0.0.1:8000\u002Fv1\",\n      \"api\": \"openai-completions\",\n      \"apiKey\": \"dsv4-local\",\n      \"compat\": {\n        \"supportsStore\": false,\n        \"supportsDeveloperRole\": false,\n        \"supportsReasoningEffort\": true,\n        \"supportsUsageInStreaming\": true,\n        \"maxTokensField\": \"max_tokens\",\n        \"supportsStrictMode\": false,\n        \"thinkingFormat\": \"deepseek\",\n        \"requiresReasoningContentOnAssistantMessages\": true\n      },\n      \"models\": [\n        {\n          \"id\": \"deepseek-v4-flash\",\n          \"name\": \"DeepSeek V4 Flash (ds4.c local)\",\n          \"reasoning\": true,\n          \"thinkingLevelMap\": {\n            \"off\": null,\n            \"minimal\": \"low\",\n            \"low\": \"low\",\n            \"medium\": \"medium\",\n            \"high\": \"high\",\n            \"xhigh\": \"xhigh\"\n          },\n          \"input\": [\"text\"],\n          \"contextWindow\": 100000,\n          \"maxTokens\": 384000,\n          \"cost\": {\n            \"input\": 0,\n            \"output\": 0,\n            \"cacheRead\": 0,\n            \"cacheWrite\": 0\n          }\n        }\n      ]\n    }\n  }\n}\n```\n\nOptionally make it the default Pi model in `~\u002F.pi\u002Fagent\u002Fsettings.json`:\n\n```json\n{\n  \"defaultProvider\": \"ds4\",\n  \"defaultModel\": \"deepseek-v4-flash\"\n}\n```\n\nFor **Codex CLI**, use the Responses wire API:\n\n```toml\n[model_providers.ds4]\nname = \"DS4\"\nbase_url = \"http:\u002F\u002F127.0.0.1:8000\u002Fv1\"\nwire_api = \"responses\"\nstream_idle_timeout_ms = 1000000\n```\n\nThen run:\n\n```sh\ncodex --model deepseek-v4-flash -c model_provider=ds4\n```\n\nFor **Claude Code**, use the Anthropic-compatible endpoint. A wrapper like this\nmatches the local `~\u002Fbin\u002Fclaude-ds4` setup:\n\n```sh\n#!\u002Fbin\u002Fsh\nunset ANTHROPIC_API_KEY\n\nexport ANTHROPIC_BASE_URL=\"${DS4_ANTHROPIC_BASE_URL:-http:\u002F\u002F127.0.0.1:8000}\"\nexport ANTHROPIC_AUTH_TOKEN=\"${DS4_API_KEY:-dsv4-local}\"\nexport ANTHROPIC_MODEL=\"deepseek-v4-flash\"\n\nexport ANTHROPIC_CUSTOM_MODEL_OPTION=\"deepseek-v4-flash\"\nexport ANTHROPIC_CUSTOM_MODEL_OPTION_NAME=\"DeepSeek V4 Flash local ds4\"\nexport ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION=\"ds4.c local GGUF\"\n\nexport ANTHROPIC_DEFAULT_SONNET_MODEL=\"deepseek-v4-flash\"\nexport ANTHROPIC_DEFAULT_HAIKU_MODEL=\"deepseek-v4-flash\"\nexport ANTHROPIC_DEFAULT_OPUS_MODEL=\"deepseek-v4-flash\"\nexport CLAUDE_CODE_SUBAGENT_MODEL=\"deepseek-v4-flash\"\n\nexport CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1\nexport CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1\nexport CLAUDE_STREAM_IDLE_TIMEOUT_MS=600000\n\nexec \"$HOME\u002F.local\u002Fbin\u002Fclaude\" \"$@\"\n```\n\nClaude Code may send a large initial prompt, often around 25k tokens, before it\nstarts doing useful work. Keep `--kv-disk-dir` enabled: after the first expensive\nprefill, the disk KV cache lets later continuations or restarted sessions reuse\nthe saved prefix instead of processing the whole prompt again.\n\n## Thinking Modes\n\nDeepSeek V4 Flash has distinct non-thinking, thinking, and Think Max modes.\nThe server defaults to thinking mode. `reasoning_effort=max` requests Think\nMax, but it is only applied when the context size is large enough for the model\ncard recommendation; smaller contexts fall back to normal thinking. OpenAI\n`reasoning_effort=xhigh` still maps to normal thinking, not Think Max.\n\nFor direct replies, use `thinking: {\"type\":\"disabled\"}`, `think:false`, or a\nnon-thinking model alias such as `deepseek-chat`.\n\n## Disk KV Cache\n\nChat\u002Fcompletion APIs are stateless: agent clients usually resend the whole\nconversation every request. `ds4-server` first tries the cheap exact token-prefix\ncheck, then falls back to comparing rendered prompt bytes with decoded\ncheckpoint bytes. The live in-memory checkpoint covers the current session; the\ndisk KV cache makes useful prefixes survive session switches and server\nrestarts.\n\nFor RAM reasons there is currently only one live KV cache in memory. When a new\nunrelated session replaces it, the old checkpoint can only be resumed without\nre-processing if it was written to the disk KV cache. In other words, memory\ncache handles the active session; disk cache is the resume mechanism for\ndifferent sessions.\n\nEnable it with:\n\n```sh\n.\u002Fds4-server --kv-disk-dir \u002Ftmp\u002Fds4-kv --kv-disk-space-mb 8192\n```\n\nThe cache key is the SHA1 of the rendered byte prefix, and files are named\n`\u003Csha1>.kv`. The DS4 payload still stores the exact token IDs and graph state\nfor that prefix. This matters for continued chats: the model may have generated\none token whose decoded text is later sent back by a client as two canonical\nprompt tokens. A rendered byte-prefix hit can still reuse the checkpoint and\ntokenize only the new suffix.\nThe file is intentionally written with ordinary `read`\u002F`write` I\u002FO, not\n`mmap`, so restoring cache entries does not add more VM mappings to a process\nthat already maps the model.\n\nTool calls also keep a bounded exact-DSML replay map keyed by unguessable tool\nIDs, so client JSON history can be rendered back to the exact sampled text. The\nRAM map keeps up to 100000 IDs by default; tune it with `--tool-memory-max-ids`.\nUse `--disable-exact-dsml-tool-replay` to disable this and fall back to\ncanonical JSON-to-DSML rendering.\n\nOn disk, a cache file is:\n\n```text\nKVC fixed header, 48 bytes\nu32 rendered_text_bytes\nrendered_text_bytes of UTF-8-ish token text\nDS4 session payload, payload_bytes from the KVC header\noptional tool-id map section\n```\n\nThe fixed header is little-endian:\n\n```text\n0   u8[3]  magic = \"KVC\"\n3   u8     version = 1\n4   u8     routed expert quant bits, currently 2 or 4\n5   u8     save reason: 0 unknown, 1 cold, 2 continued, 3 evict, 4 shutdown\n6   u8     extension flags, bit 0 = appended tool-id map\n7   u8     reserved\n8   u32    cached token count\n12  u32    hit count\n16  u32    context size the snapshot was written for\n20  u8[4]  reserved\n24  u64    creation Unix time\n32  u64    last-used Unix time\n40  u64    DS4 session payload byte count\n```\n\nThe rendered text is the tokenizer-decoded text for the cached token prefix.\nIt is both the human-inspectable prefix and the lookup identity: its SHA1 is\nthe filename, and a file is reusable only when those bytes are a prefix of the\nincoming rendered prompt. After load, the exact checkpoint tokens from the DS4\npayload remain authoritative, and only the incoming text suffix after the cached\nbytes is tokenized.\n\nThe optional tool-id map is present only when header extension bit 0 is set.\nAppended sections use fixed bit order, so future extension bits can add fields\nwithout ambiguity. The map stores unguessable API tool call IDs back to the\nexact DSML block the model sampled. Only mappings whose DSML block is present\nin the rendered cached text are stored. This lets restarted servers render\nlater client history byte-for-byte like the original model output, even if the\nclient reorders JSON arguments.\n\nThe current tool-id map section is:\n\n```text\n0   u8[3]  magic = \"KTM\"\n3   u8     version = 1\n4   u32    entry count\n\nFor each entry:\n0   u32    tool id byte length\n4   u32    sampled DSML byte length\n8   bytes  tool id\n... bytes  exact sampled DSML block\n```\n\nThe section is auxiliary replay memory, not model state. A cache hit restores\nthe session payload first, then loads the map if present. Before rendering a\nrequest, the server can also scan cache files for the tool IDs present in the\nclient history and load just those mappings, so an exact DSML replay can survive\nserver restarts even when the matching KV snapshot is not the one ultimately\nused for the rendered-prefix hit.\n\nThe DS4 session payload starts with thirteen little-endian `u32` fields:\n\n```text\n0   magic = \"DSV4\"\n1   payload version = 1\n2   saved context size\n3   prefill chunk size\n4   raw KV ring capacity\n5   raw sliding-window length\n6   compressed KV capacity\n7   checkpoint token count\n8   layer count\n9   raw\u002Fhead KV dimension\n10  indexer head dimension\n11  vocabulary size\n12  live raw rows serialized below\n```\n\nThen it stores:\n\n- `u32[token_count]` checkpoint token IDs.\n- `float32[vocab_size]` logits for the next token after that checkpoint.\n- `u32[layer_count]` compressed attention row counts.\n- `u32[layer_count]` ratio-4 indexer row counts.\n- For every layer: the live raw sliding-window KV rows, written in logical\n  position order rather than physical ring order.\n- For compressed layers: live compressed KV rows and compressor frontier\n  tensors.\n- For ratio-4 compressed layers: live indexer compressed rows and indexer\n  frontier tensors.\n\nThe logits are raw IEEE-754 `float32` values from the host `ds4_session`\nbuffer. They are saved immediately after the checkpoint tokens so a loaded\nsnapshot can sample or continue from the exact next-token distribution without\nrunning one extra decode step. MTP draft logits\u002Fstate are not persisted; after\nloading a disk checkpoint the draft state is invalidated and rebuilt by normal\ngeneration.\n\nThe tensor payload is DS4-specific KV\u002Fsession state, not a generic inference\ngraph dump. It is expected to be portable only across compatible `ds4.c`\nbuilds for this model layout.\n\nThe cache stores checkpoints at four moments:\n\n- `cold`: after a long first prompt reaches a stable prefix, before generation.\n- `continued`: when prefill or generation reaches the next absolute aligned frontier.\n- `evict`: before an unrelated request replaces the live in-memory session.\n- `shutdown`: when the server exits cleanly.\n\nCold saves intentionally trim a small token suffix and align down to a prefill\nchunk boundary. This avoids common BPE boundary retokenization misses when a\nfuture request appends text to the same prompt. The defaults are conservative:\nstore prefixes of at least 512 tokens, cold-save prompts up to 30000 tokens,\ntrim 32 tail tokens, and align to 2048-token chunks. The important knobs are:\n\nContinued saves use the same alignment and are written only when the live graph\nnaturally reaches an absolute frontier. With the defaults this means roughly\nevery 10k tokens, independent of where the first cold checkpoint landed, so long\ngenerations leave restart points behind without persisting the fragile final few\ntokens.\n\n- `--kv-cache-min-tokens`\n- `--kv-cache-cold-max-tokens`\n- `--kv-cache-continued-interval-tokens`\n- `--kv-cache-boundary-trim-tokens`\n- `--kv-cache-boundary-align-tokens`\n- `--tool-memory-max-ids`\n- `--disable-exact-dsml-tool-replay`\n\nBy default, checkpoints may be reused across the 2-bit and 4-bit routed-expert\nvariants if the rendered prefix matches. Use `--kv-cache-reject-different-quant`\nwhen you want strict same-quant reuse only.\n\nThe cache directory is disposable. If behavior looks suspicious, stop the\nserver and remove it. You can investigate what is cached with hexdump as\nthe kv cache files include the verbatim prompt cached.\n\n## Backends\n\nThe default graph backend is Metal on macOS and CUDA in CUDA builds:\n\n```sh\n.\u002Fds4 -p \"Hello\" --metal\n.\u002Fds4 -p \"Hello\" --cuda\n```\n\nOn Linux, plain `make` prints the available build targets instead of selecting a\nCUDA target implicitly. Use `make cuda-spark` for DGX Spark \u002F GB10. It omits an\nexplicit `nvcc -arch` because that is currently the fastest path on GB10. Use\n`make cuda-generic` for a normal local CUDA build, or set `CUDA_ARCH` explicitly\nwhen cross-building or when you need a known target:\n\n```sh\nmake cuda CUDA_ARCH=sm_120\nmake cuda CUDA_ARCH=native\n```\n\nThere is also a CPU reference\u002Fdebug path:\n\n```sh\n.\u002Fds4 -p \"Hello\" --cpu\nmake cpu\n.\u002Fds4\n.\u002Fds4 -p \"Hello\"\n```\n\nDo not treat the CPU path as the production target. The CLI and `ds4-server`\nsupport the CPU backend for reference\u002Fdebug use and share the same KV session\nand snapshot format as Metal and CUDA, but normal inference should use Metal or\nCUDA.\n\n## Steering\n\nThis project supports steering with single-vector activation directions; see the\n`dir-steering` directory for more information. This follows the core idea of the\n[Refusal in Language Models Is Mediated by a Single Direction](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11717)\npaper. You can use it to make the model more or less verbose, less likely to\nanswer programming questions if it is a chatbot for your car rental web site,\nand so forth, much faster than fine-tuning.\nThis is also useful for cybersecurity researchers who want to reduce a model's\nwillingness to provide dual-use or offensive security guidance.\n\n## Test Vectors\n\n`tests\u002Ftest-vectors` contains short and long-context continuation vectors\ncaptured from the official DeepSeek V4 Flash API. The requests use\n`deepseek-v4-flash`, greedy decoding, thinking disabled, and the maximum\n`top_logprobs` slice exposed by the API. Local vectors are generated with\n`.\u002Fds4 --dump-logprobs` and compared by token bytes, so tokenizer\u002Ftemplate or\nattention regressions show up before they become long generation failures.\n\nAll project tests are driven by the C runner:\n\n```sh\nmake test                  # .\u002Fds4_test --all\n.\u002Fds4_test --logprob-vectors\n.\u002Fds4_test --server\n```\n\n## Debugging Notes\n\nWhen a generation looks wrong, three small tools are usually enough to get a\nfirst answer:\n\n```sh\n.\u002Fds4 --dump-tokens -p \"...\"\n.\u002Fds4 --dump-logprobs \u002Ftmp\u002Fout.json --logprobs-top-k 20 --temp 0 -p \"...\"\n.\u002Fds4-server --trace \u002Ftmp\u002Fds4-trace.txt ...\n```\n\n- `--dump-tokens` tokenizes the `-p` or `--prompt-file` string exactly as\n  written, recognizes DS4 protocol specials, and then exits before inference\n  starts. For example, the DSML tool close marker starts as two tokens: `\u003C\u002F`\n  and `｜DSML｜`.\n- `--dump-logprobs` stores a greedy continuation with the top local\n  alternatives at each step, which helps separate sampling choices from\n  logit\u002Fmodel issues.\n- `ds4-server --trace` writes the rendered prompts, cache decisions, generated\n  text, and tool-parser events for a whole agent session.\n","DwarfStar 4 是一个专为 DeepSeek V4 Flash 设计的本地推理引擎，主要针对 Metal 后端进行了优化。该项目完全自包含，不依赖于其他运行时环境，旨在高效且正确地运行模型，并提供特定的加载、提示渲染、工具调用、KV状态管理（内存和磁盘）以及服务器API等功能，适用于编码代理或通过提供的CLI接口操作。此外，还支持NVIDIA CUDA和AMD ROCm后端（后者仅在单独分支中维护），并具备生成GGUF和imatrix文件及质量速度测试的能力。其特点包括：高效的参数利用使得模型运行更快；在思考模式下，思考部分长度与问题复杂度成正比，显著短于同类模型；拥有1百万token的上下文窗口；支持2位量化以适应有限内存环境；以及高度压缩的KV缓存支持长时间上下文推理。适合需要高性能本地推理能力的应用场景，尤其是对大容量文本处理有需求的情况。",2,"2026-06-11 03:31:34","CREATED_QUERY"]