[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74697":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":16,"starSnapshotCount":16,"syncStatus":39,"lastSyncTime":40,"discoverSource":41},74697,"pdf-inspector","firecrawl\u002Fpdf-inspector","firecrawl","Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.","https:\u002F\u002Fwww.npmjs.com\u002Fpackage\u002F@firecrawl\u002Fpdf-inspector",null,"Rust",1450,135,6,4,0,64,86,443,192,19.4,false,"main",true,[26,27,28,29,30,31,32,33,34,35],"markdown","nodejs","ocr-routing","pdf","pdf-classification","pdf-extraction","pdf-parser","python","rust","text-extraction","2026-06-12 02:03:27","# pdf-inspector\n\nFast Rust library for PDF classification and text extraction. Detects whether a PDF is text-based or scanned, extracts text with position awareness, and converts to clean Markdown — all without OCR. Includes bindings for [Python](docs\u002Fpython.md) and [Node.js](napi\u002FREADME.md).\n\nBuilt by [Firecrawl](https:\u002F\u002Ffirecrawl.dev) to handle text-based PDFs locally in under 200ms, skipping expensive OCR services for the ~54% of PDFs that don't need them.\n\n## Features\n\n- **Smart classification** — Detect TextBased, Scanned, ImageBased, or Mixed PDFs in ~10-50ms by sampling content streams. Returns a confidence score (0.0-1.0) and per-page OCR routing.\n- **Text extraction** — Position-aware extraction with font info, X\u002FY coordinates, and automatic multi-column reading order.\n- **Markdown conversion** — Headings (H1-H4 via font size ratios), bullet\u002Fnumbered\u002Fletter lists, code blocks (monospace font detection), tables (rectangle-based and heuristic), bold\u002Fitalic formatting, URL linking, and page breaks.\n- **Table detection** — Dual-mode: rectangle-based detection from PDF drawing ops, plus heuristic detection from text alignment. Handles financial tables, footnotes, and continuation tables across pages.\n- **CID font support** — ToUnicode CMap decoding for Type0\u002FIdentity-H fonts, UTF-16BE, UTF-8, and Latin-1 encodings.\n- **Multi-column layout** — Automatic detection of newspaper-style columns, sequential reading order, and RTL text support.\n- **Encoding issue detection** — Automatically flags broken font encodings so callers can fall back to OCR.\n- **Single document load** — The document is parsed once and shared between detection and extraction, avoiding redundant I\u002FO.\n- **Lightweight** — Pure Rust, no ML models, no external services. Single dependency on `lopdf` for PDF parsing.\n\n## Benchmark\n\nEvaluated on the [opendataloader-bench](https:\u002F\u002Fgithub.com\u002Fopendataloader-project\u002Fopendataloader-bench) corpus (200 PDFs). Only direct text extraction engines are shown — no OCR, no ML models. Scores are 0-1, higher is better.\n\n| Engine | Overall | Reading Order (NID) | Tables (TEDS) | Headings (MHS) | Speed (200 docs) |\n|---|---|---|---|---|---|\n| pdf-inspector | 0.78 | 0.87 | 0.59 | 0.57 | 4s |\n| opendataloader | 0.84 | 0.91 | 0.49 | 0.74 | 11s |\n| pymupdf4llm | 0.73 | 0.89 | 0.40 | 0.41 | 18s |\n| markitdown | 0.58 | 0.88 | 0.00 | 0.00 | 8s |\n\nFor context, engines that use OCR\u002FML (docling, marker, mineru) score 0.83-0.88 overall but take 2-180 minutes on the same corpus.\n\n**Where we do well:** Speed (fastest of all engines), reading order, table detection vs other direct-text tools.\n\n**Where we lag:** Heading detection trails opendataloader — many PDFs use bold text at body font size for headings, or headings that are only slightly larger than body text. Table detection trails OCR-based engines that can see visual table structure.\n\n## Quick start\n\n### Python\n\n```bash\npip install maturin\nmaturin develop --release\n```\n\n```python\nimport pdf_inspector\n\nresult = pdf_inspector.process_pdf(\"document.pdf\")\nprint(result.pdf_type)   # \"text_based\", \"scanned\", \"image_based\", \"mixed\"\nprint(result.markdown)   # Markdown string or None\n```\n\n> Full API reference: [docs\u002Fpython.md](docs\u002Fpython.md)\n\n### Node.js\n\n```bash\nnpm install @firecrawl\u002Fpdf-inspector\n```\n\n```javascript\nimport { readFileSync } from 'fs';\nimport { processPdf, classifyPdf } from '@firecrawl\u002Fpdf-inspector';\n\nconst result = processPdf(readFileSync('document.pdf'));\nconsole.log(result.pdfType);   \u002F\u002F \"TextBased\", \"Scanned\", \"ImageBased\", \"Mixed\"\nconsole.log(result.markdown);  \u002F\u002F Markdown string or null\n```\n\n> Full API reference: [napi\u002FREADME.md](napi\u002FREADME.md)\n\n### Rust\n\n```toml\n[dependencies]\npdf-inspector = { git = \"https:\u002F\u002Fgithub.com\u002Ffirecrawl\u002Fpdf-inspector\" }\n```\n\n```rust\nuse pdf_inspector::process_pdf;\n\nlet result = process_pdf(\"document.pdf\")?;\nprintln!(\"Type: {:?}\", result.pdf_type);\nif let Some(markdown) = &result.markdown {\n    println!(\"{}\", markdown);\n}\n```\n\n> Full API reference: [docs\u002Frust-api.md](docs\u002Frust-api.md)\n\n### CLI\n\n```bash\n# Convert PDF to Markdown\ncargo run --bin pdf2md -- document.pdf\n\n# JSON output (for piping)\ncargo run --bin pdf2md -- document.pdf --json\n\n# Raw markdown only (no headers)\ncargo run --bin pdf2md -- document.pdf --raw\n\n# Insert page break markers (\u003C!-- Page N -->)\ncargo run --bin pdf2md -- document.pdf --pages\n\n# Process only specific pages\ncargo run --bin pdf2md -- document.pdf --select-pages 1,3,5-10\n\n# Detection only (no extraction)\ncargo run --bin detect-pdf -- document.pdf\ncargo run --bin detect-pdf -- document.pdf --json\n\n# Detection + layout analysis (tables, columns)\ncargo run --bin detect-pdf -- document.pdf --analyze --json\n```\n\n## Architecture\n\n```\nPDF bytes\n  │\n  ├─► detector         → PdfType (TextBased \u002F Scanned \u002F ImageBased \u002F Mixed)\n  │\n  └─► extractor\n        ├─ fonts        → font widths, encodings\n        ├─ content_stream → walk PDF operators → TextItems + PdfRects\n        ├─ xobjects     → Form XObject text, image placeholders\n        ├─ links        → hyperlinks, AcroForm fields\n        └─ layout       → column detection → line grouping → reading order\n              │\n              ├─► tables\n              │     ├─ detect_rects      → rectangle-based tables (union-find)\n              │     ├─ detect_heuristic  → alignment-based tables\n              │     ├─ grid              → column\u002Frow assignment → cells\n              │     └─ format            → cells → Markdown table\n              │\n              └─► markdown\n                    ├─ analysis     → font stats, heading tiers\n                    ├─ preprocess   → merge headings, drop caps\n                    ├─ convert      → line loop + table\u002Fimage insertion\n                    ├─ classify     → captions, lists, code\n                    └─ postprocess  → cleanup → final Markdown\n```\n\nThe document is loaded **once** via `load_document_from_path` \u002F `load_document_from_mem` and shared between the detection and extraction stages, so there's no redundant parsing.\n\n### Project structure\n\n```\nsrc\u002F\n  lib.rs                — Public API, PdfOptions builder, convenience functions\n  python.rs             — PyO3 Python bindings\n  types.rs              — Shared types: TextItem, TextLine, PdfRect, ItemType\n  text_utils.rs         — Character\u002Ftext helpers (CJK, RTL, ligatures, bold\u002Fitalic)\n  process_mode.rs       — ProcessMode enum (DetectOnly, Analyze, Full)\n  detector.rs           — Fast PDF type detection without full document load\n  glyph_names.rs        — Adobe Glyph List → Unicode mapping\n  tounicode.rs          — ToUnicode CMap parsing for CID-encoded text\n  extractor\u002F            — Text extraction pipeline\n  tables\u002F               — Table detection and formatting\n  markdown\u002F             — Markdown conversion and structure detection\n  bin\u002F                  — CLI tools (pdf2md, detect_pdf)\nnapi\u002F                   — Node.js\u002FBun bindings (napi-rs)\n```\n\n## How classification works\n\n1. Parse the xref table and page tree (no full object load)\n2. Select pages based on `ScanStrategy` (default: all pages with early exit)\n3. Look for `Tj`\u002F`TJ` (text operators) and `Do` (image operators) in content streams\n4. Classify based on text operator presence across sampled pages\n\nThis detects 300+ page PDFs in milliseconds. The result includes `pages_needing_ocr` — a list of specific page numbers that lack text, enabling per-page OCR routing instead of all-or-nothing.\n\n### Scan strategies\n\n| Strategy | Behavior | Best for |\n|---|---|---|\n| `EarlyExit` (default) | Scan all pages, stop on first non-text page | Pipelines routing TextBased PDFs to fast extraction |\n| `Full` | Scan all pages, no early exit | Accurate Mixed vs Scanned classification |\n| `Sample(n)` | Sample `n` evenly distributed pages (first, last, middle) | Very large PDFs where speed matters more than precision |\n| `Pages(vec)` | Only scan specific 1-indexed page numbers | When the caller knows which pages to check |\n\n## Markdown output\n\nThe converter handles:\n\n| Element | How it's detected |\n|---|---|\n| Headings (H1-H4) | Font size tiers relative to body text, with 0.5pt clustering |\n| Bold\u002Fitalic | Font name patterns (Bold, Italic, Oblique) |\n| Bullet lists | `*`, `-`, `*`, `○`, `●`, `◦` prefixes |\n| Numbered lists | `1.`, `1)`, `(1)` patterns |\n| Letter lists | `a.`, `a)`, `(a)` patterns |\n| Code blocks | Monospace fonts (Courier, Consolas, Monaco, Menlo, Fira Code, JetBrains Mono) and keyword detection |\n| Tables | Rectangle-based detection from PDF drawing ops + heuristic detection from text alignment |\n| Financial tables | Token splitting for consolidated numeric values |\n| Captions | \"Figure\", \"Table\", \"Source:\" prefix detection |\n| Sub\u002Fsuperscript | Font size and Y-offset relative to baseline |\n| URLs | Converted to Markdown links |\n| Hyphenation | Rejoins words broken across lines |\n| Page numbers | Filtered from output |\n| Drop caps | Large initial letters merged with following text |\n| Dot leaders | TOC-style dots collapsed to \" ... \" |\n\n## Use case: smart PDF routing\n\npdf-inspector was built for pipelines that process PDFs at scale. Instead of sending every PDF through OCR:\n\n```\nPDF arrives\n  → pdf-inspector classifies it (~20ms)\n  → TextBased + high confidence?\n      YES → extract locally (~150ms), done\n      NO  → send to OCR service (2-10s)\n```\n\nThis saves cost and latency for the majority of PDFs that are already text-based (reports, papers, invoices, legal docs).\n\n## Debugging\n\nSee [docs\u002Fdebugging.md](docs\u002Fdebugging.md) for `RUST_LOG` environment variable usage.\n\n## License\n\nMIT\n","pdf-inspector 是一个快速的 Rust 库，用于 PDF 文件的分类、文本提取和智能路由决策。它能够区分扫描版和文本版PDF文件，并提供位置感知的文本提取及Markdown转换功能，无需依赖OCR技术。项目支持Python和Node.js绑定，适用于需要高效处理大量PDF文档并进行智能路由的应用场景。其核心特点包括智能分类、精确的表格检测与多列布局识别等。特别适合在本地环境中对文本为主的PDF文件进行快速处理，以节省使用OCR服务的成本。",2,"2026-06-11 03:50:28","high_star"]