[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73352":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":33,"discoverSource":34},73352,"deepseek-ocr.rs","TimmyOVO\u002Fdeepseek-ocr.rs","TimmyOVO","Rust multi‑backend OCR\u002FVLM engine (DeepSeek‑OCR-1\u002F2, PaddleOCR‑VL, DotsOCR) with DSQ quantization and an OpenAI‑compatible server & CLI – run locally without Python.","",null,"Rust",2167,169,12,15,0,2,8,28.69,"Apache License 2.0",false,"master",true,[25,26,27,28,29],"candle","ocr","ocr-recognition","openai","rust","2026-06-12 02:03:12","# deepseek-ocr.rs 🚀\n\nRust implementation of the DeepSeek-OCR inference stack with a fast CLI and an OpenAI-compatible HTTP server. The workspace packages multiple OCR backends, prompt tooling, and a serving layer so you can build document understanding pipelines that run locally on CPU, Apple Metal, or (alpha) NVIDIA CUDA GPUs.\n\n> 中文文档请看 [README_CN.md](README_CN.md)。  \n\n> Want ready-made binaries? Latest macOS (Metal-enabled) and Windows bundles live in the [build-binaries workflow artifacts](https:\u002F\u002Fgithub.com\u002FTimmyOVO\u002Fdeepseek-ocr.rs\u002Factions\u002Fworkflows\u002Fbuild-binaries.yml). Grab them from the newest green run.\n\n## Choosing a Model 🔬\n\n| Model | Memory footprint* | Best on | When to pick it |\n| --- | --- | --- | --- |\n| **DeepSeek‑OCR** | **≈6.3GB** FP16 weights, **≈13GB** RAM\u002FVRAM with cache & activations (512-token budget) | Apple Silicon + Metal (FP16), high-VRAM NVIDIA GPUs, 32GB+ RAM desktops | Highest accuracy, SAM+CLIP global\u002Flocal context, MoE DeepSeek‑V2 decoder (3B params, ~570M active per token). Use when latency is secondary to quality. |\n| **PaddleOCR‑VL** | **≈4.7GB** FP16 weights, **≈9GB** RAM\u002FVRAM with cache & activations | 16GB laptops, CPU-only boxes, mid-range GPUs | Dense 0.9B Ernie decoder with SigLIP vision tower. Faster startup, lower memory, great for batch jobs or lightweight deployments. |\n| **DotsOCR** | **≈9GB** FP16 weights, but expect **30–50GB** RAM\u002FVRAM for high-res docs due to huge vision tokens | Apple Silicon + Metal BF16, ≥24GB CUDA cards, or 64GB RAM CPU workstations | Unified VLM (DotsVision + Qwen2) that nails layout, reading order, grounding, and multilingual math if you can tolerate the latency and memory bill. |\n\n\\*Measured from the default FP16 safetensors. Runtime footprint varies with sequence length.\n\nGuidance:\n\n- **Need maximum fidelity, multi-region reasoning, or already have 16–24GB VRAM?** Use **DeepSeek‑OCR**. The hybrid SAM+CLIP tower plus DeepSeek‑V2 MoE decoder handles complex layouts best, but expect higher memory\u002Flatency.\n- **Deploying to CPU-only nodes, 16GB laptops, or latency-sensitive services?** Choose **PaddleOCR‑VL**. Its dense Ernie decoder (18 layers, hidden 1024) activates fewer parameters per token and keeps memory under 10GB while staying close in quality on most docs.\n- **Chasing reading-order accuracy, layout grounding, or multi-page multilingual PDFs on roomy hardware?** Pick **DotsOCR** with BF16 on Metal\u002FCUDA. Prefill runs around 40–50 tok\u002Fs on M-series GPUs but can fall to ~12 tok\u002Fs on CPU because of the heavy vision tower.\n\n## Why Rust? 💡\n\nThe original DeepSeek-OCR ships as a Python + Transformers stack—powerful, but hefty to deploy and awkward to embed. Rewriting the pipeline in Rust gives us:\n\n- Smaller deployable artifacts with zero Python runtime or conda baggage.\n- Memory-safe, thread-friendly infrastructure that blends into native Rust backends.\n- Unified tooling (CLI + server) running on Candle + Rocket without the Python GIL overhead.\n- Drop-in compatibility with OpenAI-style clients while tuned for single-turn OCR prompts.\n\n## Technical Stack ⚙️\n\n- **Candle** for tensor compute, with Metal and CUDA backends and FlashAttention support.\n- **Rocket** + async streaming for OpenAI-compatible `\u002Fv1\u002Fresponses` and `\u002Fv1\u002Fchat\u002Fcompletions`.\n- **tokenizers** (upstream DeepSeek release) wrapped by `crates\u002Fassets` for deterministic caching via Hugging Face and ModelScope mirrors.\n- **Pure Rust vision\u002Fprompt pipeline** shared by CLI and server to avoid duplicated logic.\n\n## Advantages over the Python Release 🥷\n\n- Faster cold-start on Apple Silicon, lower RSS, and native binary distribution.\n- Deterministic dual-source (Hugging Face + ModelScope) asset download + verification built into the workspace.\n- Automatic single-turn chat compaction so OCR outputs stay stable even when clients send history.\n- Ready-to-use OpenAI compatibility for tools like Open WebUI without adapters.\n\n## Highlights ✨\n\n- **One repo, two entrypoints** – a batteries-included CLI for batch jobs and a Rocket-based server that speaks `\u002Fv1\u002Fresponses` and `\u002Fv1\u002Fchat\u002Fcompletions`.\n- **Works out of the box** – pulls model weights, configs, and tokenizer from whichever of Hugging Face or ModelScope responds fastest on first run.\n- **Optimised for Apple Silicon** – optional Metal backend with FP16 execution for real-time OCR on laptops.\n- **CUDA (alpha)** – experimental support via `--features cuda` + `--device cuda --dtype f16`; expect rough edges while we finish kernel coverage.\n- **Intel MKL (preview)** – faster BLAS on x86 via `--features mkl` (install Intel oneMKL beforehand).\n- **OpenAI client compatibility** – drop-in replacement for popular SDKs; the server automatically collapses chat history to the latest user turn for OCR-friendly prompts.\n\n## Model Matrix 📦\n\nThe workspace exposes three base model IDs plus DSQ-quantized variants for DeepSeek‑OCR, PaddleOCR‑VL, and DotsOCR:\n\n| Model ID | Base Model | Precision | Suggested Use Case |\n| --- | --- | --- | --- |\n| `deepseek-ocr` | `deepseek-ocr` | FP16 (select via `--dtype`) | Full-fidelity DeepSeek‑OCR stack with SAM+CLIP + MoE decoder; use when you prioritise quality on capable Metal\u002FCUDA\u002FCPU hosts. |\n| `deepseek-ocr-q4k` | `deepseek-ocr` | `Q4_K` | Tight VRAM, local deployments, and batch jobs that still want DeepSeek’s SAM+CLIP pipeline. |\n| `deepseek-ocr-q6k` | `deepseek-ocr` | `Q6_K` | Day‑to‑day balance of quality and size on mid‑range GPUs. |\n| `deepseek-ocr-q8k` | `deepseek-ocr` | `Q8_0` | Stay close to full‑precision quality with manageable memory savings. |\n| `paddleocr-vl` | `paddleocr-vl` | FP16 (select via `--dtype`) | Default choice for lighter hardware; 0.9B Ernie + SigLIP tower with strong doc\u002Ftable OCR and low latency. |\n| `paddleocr-vl-q4k` | `paddleocr-vl` | `Q4_K` | Heavily compressed doc\u002Ftable deployments with aggressive memory budgets. |\n| `paddleocr-vl-q6k` | `paddleocr-vl` | `Q6_K` | Common engineering setups; blends accuracy and footprint. |\n| `paddleocr-vl-q8k` | `paddleocr-vl` | `Q8_0` | Accuracy‑leaning deployments that still want a smaller footprint than FP16. |\n| `dots-ocr` | `dots-ocr` | FP16 \u002F BF16 (via `--dtype`) | DotsVision + Qwen2 VLM for high‑precision layout, reading order, grounding, and multilingual docs; expect high memory (30–50GB on large pages). |\n| `dots-ocr-q4k` | `dots-ocr` | `Q4_K` | Sidecar DSQ snapshot over the DotsOCR baseline; reduces weight memory\u002Fcompute while keeping the heavy vision token profile unchanged. |\n| `dots-ocr-q6k` | `dots-ocr` | `Q6_K` | Recommended balance of size and quality when you already accept DotsOCR’s memory footprint but want cheaper weights. |\n| `dots-ocr-q8k` | `dots-ocr` | `Q8_0` | Accuracy‑leaning DotsOCR deployment that stays close to FP16\u002FBF16 quality with modest memory savings. |\n\n\n## Quick Start 🏁\n\n### Prerequisites\n\n- Rust 1.78+ (edition 2024 support)\n- Git\n- Optional: Apple Silicon running macOS 13+ for Metal acceleration\n- Optional: CUDA 12.2+ toolkit + driver for experimental NVIDIA GPU acceleration on Linux\u002FWindows\n- Optional: Intel oneAPI MKL for preview x86 acceleration (see below)\n- (Recommended) Hugging Face account with `HF_TOKEN` when pulling from the `deepseek-ai\u002FDeepSeek-OCR` repo (ModelScope is used automatically when it’s faster\u002Freachable).\n\n### Clone the Workspace\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTimmyOVO\u002Fdeepseek-ocr.rs.git\ncd deepseek-ocr.rs\ncargo fetch\n```\n\n### Model Assets\n\nThe first invocation of the CLI or server downloads the config, tokenizer, and `model-00001-of-000001.safetensors` (~6.3GB) into `DeepSeek-OCR\u002F`. To prefetch manually:\n\n```bash\ncargo run -p deepseek-ocr-cli --release -- --help # dev profile is extremely slow; always prefer --release\n```\n\n> Always include `--release` when running from source; debug builds on this model are extremely slow.\nSet `HF_HOME`\u002F`HF_TOKEN` if you store Hugging Face caches elsewhere (ModelScope downloads land alongside the same asset tree). The full model package is ~6.3GB on disk and typically requires ~13GB of RAM headroom during inference (model + activations).\n\n## Configuration & Overrides 🗂️\n\nThe CLI and server share the same configuration. On first launch we create a `config.toml` populated with defaults; later runs reuse it so both entrypoints stay in sync.\n\n| Platform | Config file (default) | Model cache root |\n| --- | --- | --- |\n| Linux | `~\u002F.config\u002Fdeepseek-ocr\u002Fconfig.toml` | `~\u002F.cache\u002Fdeepseek-ocr\u002Fmodels\u002F\u003Cid>\u002F…` |\n| macOS | `~\u002FLibrary\u002FApplication Support\u002Fdeepseek-ocr\u002Fconfig.toml` | `~\u002FLibrary\u002FCaches\u002Fdeepseek-ocr\u002Fmodels\u002F\u003Cid>\u002F…` |\n| Windows | `%APPDATA%\\deepseek-ocr\\config.toml` | `%LOCALAPPDATA%\\deepseek-ocr\\models\\\u003Cid>\\…` |\n\n- Override the location with `--config \u002Fpath\u002Fto\u002Fconfig.toml` (available on both CLI and server). Missing files are created automatically.\n- Each `[models.entries.\"\u003Cid>\"]` record can point to custom `config`, `tokenizer`, or `weights` files. When omitted we fall back to the cache directory above and download\u002Fupdate assets as required.\n- Runtime values resolve in this order: command-line flags → values stored in `config.toml` → built-in defaults. The HTTP API adds a final layer where request payload fields (for example `max_tokens`) override everything else for that call.\n\nThe generated file starts with the defaults below; adjust them to persistently change behaviour:\n\n```toml\n[models]\nactive = \"deepseek-ocr\"\n\n[models.entries.deepseek-ocr]\n\n[inference]\ndevice = \"cpu\"\ntemplate = \"plain\"\nbase_size = 1024\nimage_size = 640\ncrop_mode = true\nmax_new_tokens = 512\nuse_cache = true\n\n[server]\nhost = \"0.0.0.0\"\nport = 8000\n```\n\n- `[models]` picks the active model and lets you add more entries (each entry can point to its own config\u002Ftokenizer\u002Fweights).\n- `[inference]` controls notebook-friendly defaults shared by the CLI and server (device, template, vision sizing, decoding budget, cache usage).\n- `[server]` sets the network binding and the model identifier reported by `\u002Fv1\u002Fmodels`.\n\nSee `crates\u002Fcli\u002FREADME.md` and `crates\u002Fserver\u002FREADME.md` for concise override tables.\n\n## Benchmark Snapshot 📊\n\nSingle-request Rust CLI (Accelerate backend on macOS) compared with the reference Python pipeline on the same prompt and image:\n\n| Stage                                             | ref total (ms) | ref avg (ms) | python total | python\u002Fref |\n|---------------------------------------------------|----------------|--------------|--------------|------------|\n| Decode – Overall (`decode.generate`)              | 30077.840      | 30077.840    | 56554.873    | 1.88x      |\n| Decode – Token Loop (`decode.iterative`)          | 26930.216      | 26930.216    | 39227.974    | 1.46x      |\n| Decode – Prompt Prefill (`decode.prefill`)        | 3147.337       | 3147.337     | 5759.684     | 1.83x      |\n| Prompt – Build Tokens (`prompt.build_tokens`)     | 0.466          | 0.466        | 45.434       | 97.42x     |\n| Prompt – Render Template (`prompt.render`)        | 0.005          | 0.005        | 0.019        | 3.52x      |\n| Vision – Embed Images (`vision.compute_embeddings`)| 6391.435      | 6391.435     | 3953.459     | 0.62x      |\n| Vision – Prepare Inputs (`vision.prepare_inputs`) | 62.524         | 62.524       | 45.438       | 0.73x      |\n\n## Command-Line Interface 🖥️\n\nBuild and run directly from the workspace:\n\n```bash\ncargo run -p deepseek-ocr-cli --release -- \\\n  --prompt \"\u003Cimage>\\n\u003C|grounding|>Convert this receipt to markdown.\" \\\n  --image baselines\u002Fsample\u002Fimages\u002Ftest.png \\\n  --device cpu --max-new-tokens 512\n```\n\n> Tip: `--release` is required for reasonable throughput; debug builds can be 10x slower.\n\n> macOS tip: append `--features metal` to the `cargo run`\u002F`cargo build` commands to compile with Accelerate + Metal backends.\n>\n> CUDA tip (Linux\u002FWindows): append `--features cuda` and run with `--device cuda --dtype f16` to target NVIDIA GPUs—feature is still alpha, so be ready for quirks.\n>\n> Intel MKL preview: install Intel oneMKL, then build with `--features mkl` for faster CPU matmuls on x86.\n\nInstall the CLI as a binary:\n\n```bash\ncargo install --path crates\u002Fcli\ndeepseek-ocr-cli --help\n```\n\nKey flags:\n\n- `--prompt` \u002F `--prompt-file`: text with `\u003Cimage>` slots\n- `--image`: path(s) matching `\u003Cimage>` placeholders\n- `--device` and `--dtype`: choose `metal` + `f16` on Apple Silicon or `cuda` + `f16` on NVIDIA GPUs\n- `--max-new-tokens`: decoding budget\n- Sampling controls: `--do-sample`, `--temperature`, `--top-p`, `--top-k`, `--repetition-penalty`, `--no-repeat-ngram-size`, `--seed`\n  - By default decoding stays deterministic (`do_sample=false`, `temperature=0.0`, `no_repeat_ngram_size=20`)\n  - To use stochastic sampling set `--do-sample true --temperature 0.8` (and optionally adjust the other knobs)\n\n### Switching Models\n\nThe autogenerated `config.toml` now lists three entries:\n\n- `deepseek-ocr` (default) – the original DeepSeek vision-language stack.\n- `paddleocr-vl` – the PaddleOCR-VL 0.9B SigLIP + Ernie release.\n- `dots-ocr` – the Candle port of dots.ocr with DotsVision + Qwen2 (use BF16 on Metal\u002FCUDA if possible; see the release matrix for memory notes).\n\nPick which one to load via `--model`:\n\n```bash\ndeepseek-ocr-cli --model paddleocr-vl --prompt \"\u003Cimage> Summarise\"\n```\n\nThe CLI (and server) will download the matching config\u002Ftokenizer\u002Fweights from the appropriate repository (`deepseek-ai\u002FDeepSeek-OCR`, `PaddlePaddle\u002FPaddleOCR-VL`, or `dots-ocr`) into your cache on first use. You can still override paths with `--model-config`, `--tokenizer`, or `--weights` if you maintain local fine-tunes.\n\n## HTTP Server ☁️\n\nLaunch an OpenAI-compatible endpoint:\n\n```bash\ncargo run -p deepseek-ocr-server --release -- \\\n  --host 0.0.0.0 --port 8000 \\\n  --device cpu --max-new-tokens 512\n```\n\n> Keep `--release` on the server as well; the debug profile is far too slow for inference workloads.\n> macOS tip: add `--features metal` to the `cargo run -p deepseek-ocr-server` command when you want the server binary to link against Accelerate + Metal (and pair it with `--device metal` at runtime).\n>\n> CUDA tip: add `--features cuda` and start the server with `--device cuda --dtype f16` to offload inference to NVIDIA GPUs (alpha-quality support).\n>\n> Intel MKL preview: install Intel oneMKL before building with `--features mkl` to accelerate CPU workloads on x86.\n\nNotes:\n\n- Use `data:` URLs or remote `http(s)` links; local paths are rejected.\n- The server collapses multi-turn chat inputs to the latest user message to keep prompts OCR-friendly.\n- Works out of the box with tools such as [Open WebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui) or any OpenAI-compatible client—just point the base URL to your server (`http:\u002F\u002Flocalhost:8000\u002Fv1`) and select either the `deepseek-ocr` or `paddleocr-vl` model ID exposed in `\u002Fv1\u002Fmodels`.\n- Adjust the request body limit with Rocket config if you routinely send large images.\n\n![Open WebUI connected to deepseek-ocr.rs](.\u002Fassets\u002Fsample_1.png)\n\n## GPU Acceleration ⚡\n\n- **Metal (macOS 13+ Apple Silicon)** – pass `--device metal --dtype f16` and build binaries with `--features metal` so Candle links against Accelerate + Metal.\n- **CUDA (alpha, NVIDIA GPUs)** – install CUDA 12.2+ toolkits, build with `--features cuda`, and launch the CLI\u002Fserver with `--device cuda --dtype f16`; still experimental.\n- **Intel MKL (preview)** – install Intel oneMKL and build with `--features mkl` to speed up CPU workloads on x86.\n- For either backend, prefer release builds (e.g. `cargo build --release -p deepseek-ocr-cli --features metal|cuda`) to maximise throughput.\n- Combine GPU runs with `--max-new-tokens` and crop tuning flags to balance latency vs. quality.\n\n## Repository Layout 🗂️\n\n- `crates\u002Fcore` – shared inference pipeline, model loaders, conversation templates.\n- `crates\u002Fcli` – command-line frontend (`deepseek-ocr-cli`).\n- `crates\u002Fserver` – Rocket server exposing OpenAI-compatible endpoints.\n- `crates\u002Fassets` – asset management (configuration, tokenizer, Hugging Face + ModelScope download helpers).\n- `baselines\u002F` – reference inputs and outputs for regression testing.\n\nDetailed CLI usage lives in [`crates\u002Fcli\u002FREADME.md`](crates\u002Fcli\u002FREADME.md). The server’s OpenAI-compatible interface is covered in [`crates\u002Fserver\u002FREADME.md`](crates\u002Fserver\u002FREADME.md).\n\n## Troubleshooting 🛠️\n\n- **Where do assets come from?** – downloads automatically pick between Hugging Face and ModelScope based on latency; the CLI prints the chosen source for each file.\n- **Slow first response** – model load and GPU warm-up (Metal\u002FCUDA alpha) happen on the initial request; later runs are faster.\n- **Large image rejection** – increase Rocket JSON limits in `crates\u002Fserver\u002Fsrc\u002Fmain.rs` or downscale the input.\n\n## Roadmap 🗺️\n\n- ✅ Apple Metal backend with FP16 support and CLI\u002Fserver parity on macOS.\n- ✅ NVIDIA CUDA backend (alpha) – build with `--features cuda`, run with `--device cuda --dtype f16` for Linux\u002FWindows GPUs; polishing in progress.\n- 🔄 **Parity polish** – finish projector normalisation + crop tiling alignment; extend intermediate-tensor diff suite beyond the current sample baseline.\n- 🔄 **Grounding & streaming** – port the Python post-processing helpers (box extraction, markdown polish) and refine SSE streaming ergonomics.\n- 🔄 **Cross-platform acceleration** – continue tuning CUDA kernels, add automatic device detection across CPU\u002FMetal\u002FCUDA, and publish opt-in GPU benchmarks.\n- 🔄 **Packaging & Ops** – ship binary releases with deterministic asset checksums, richer logging\u002Fmetrics, and Helm\u002Fdocker references for server deploys.\n- 🔜 **Structured outputs** – optional JSON schema tools for downstream automation once parity gaps close.\n\n## License 📄\n\nThis repository inherits the licenses of its dependencies and the upstream DeepSeek-OCR model. Refer to `DeepSeek-OCR\u002FLICENSE` for model terms and apply the same restrictions to downstream use.\n","deepseek-ocr.rs 是一个用 Rust 编写的多后端 OCR\u002FVLM 引擎，支持 DeepSeek-OCR-1\u002F2、PaddleOCR-VL 和 DotsOCR，并且集成了 DSQ 量化技术以及与 OpenAI 兼容的服务器和命令行工具。该项目的核心功能包括多种 OCR 模型选择、快速命令行接口及 HTTP 服务支持，能够根据不同的硬件条件（如 CPU、Apple Metal 或 NVIDIA CUDA GPU）进行本地部署。它特别适合需要在没有 Python 环境的情况下运行高精度文档理解任务的场景，例如企业级文档处理、学术研究或个人项目中对 OCR 技术有较高要求的应用。","2026-06-11 03:45:09","high_star"]