[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72080":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},72080,"pdf-craft","oomol-lab\u002Fpdf-craft","oomol-lab","PDF craft can convert PDF files into various other formats. This project will focus on processing PDF files of scanned books.","https:\u002F\u002Finkora.oomol.com\u002F",null,"Python",5748,397,21,46,0,16,37,123,48,38.8,"MIT License",false,"main",true,[27,28,29,30],"deepseek-ocr","document","ocr","pdf","2026-06-12 02:02:58","\u003Cdiv align=center>\n  \u003Ch1>PDF Craft\u003C\u002Fh1>\n  \u003Cp>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Foomol-lab\u002Fpdf-craft\u002Factions\u002Fworkflows\u002Fmerge-build.yml\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Foomol-lab\u002Fpdf-craft\u002Fmerge-build.yml\" alt=\"ci\" \u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpdf-craft\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpip_install-pdf--craft-blue\" alt=\"pip install pdf-craft\" \u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpdf-craft\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fpdf-craft.svg\" alt=\"pypi pdf-craft\" \u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpdf-craft\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fpdf-craft.svg\" alt=\"python versions\" \u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdeepwiki.com\u002Foomol-lab\u002Fpdf-craft\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fdeepwiki.com\u002Fbadge.svg\" alt=\"Ask DeepWiki\" \u002F>\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Foomol-lab\u002Fpdf-craft\u002Fblob\u002Fmain\u002FLICENSE\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Foomol-lab\u002Fpdf-craft\" alt=\"license\" \u002F>\u003C\u002Fa>\n  \u003C\u002Fp>\n  \u003Cp>\u003Ca href=\"https:\u002F\u002Fhub.oomol.com\u002Fpackage\u002Fpdf-craft?open=true\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fstatic.oomol.com\u002Fassets\u002Fbutton.svg\" alt=\"Open in OOMOL Studio\" \u002F>\u003C\u002Fa>\u003C\u002Fp>\n  \u003Cp>English | \u003Ca href=\".\u002FREADME_zh-CN.md\">中文\u003C\u002Fa>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## Introduction\n\npdf-craft converts PDF files into various other formats, with a focus on handling scanned book PDFs.\n\nThis project is based on [DeepSeek OCR](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR) for document recognition. It supports the recognition of complex content such as tables and formulas. With GPU acceleration, pdf-craft can complete the entire conversion process from PDF to Markdown or EPUB locally. During the conversion, pdf-craft automatically identifies document structure, accurately extracts body text, and filters out interfering elements like headers and footers. For academic or technical documents containing footnotes, formulas, and tables, pdf-craft handles them properly, preserving these important elements (including images and other assets within footnotes). When converting to EPUB, the table of contents is automatically generated. The final Markdown or EPUB files maintain the content integrity and readability of the original book.\n\n## Lightweight and Fast\n\nStarting from the official v1.0.0 release, pdf-craft fully embraces [DeepSeek OCR](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR) and no longer relies on LLM for text correction. This change brings significant performance improvements: the entire conversion process is completed locally without network requests, eliminating the long waits and occasional network failures of the old version.\n\nHowever, the new version has also removed the LLM text correction feature. If your use case still requires this functionality, you can continue using the old version [v0.2.8](https:\u002F\u002Fgithub.com\u002Foomol-lab\u002Fpdf-craft\u002Ftree\u002Fv0.2.8).\n\n### Online Version\n\nIf you'd like to explore pdf-craft without setting it up locally, you can try [Inkora - PDF Craft](https:\u002F\u002Finkora.oomol.com\u002Fpdf-craft\u002F), an online app built around the same PDF conversion workflow. It lets you upload PDF files and try the main experience directly in your browser.\n\n[![PDF Craft Online Version](docs\u002Fimages\u002Fwebsite-en.png)](https:\u002F\u002Finkora.oomol.com\u002Fpdf-craft\u002F)\n\n## Quick Start\n\n### Installation\n\n```bash\npip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\npip install pdf-craft\n```\n\nThe above commands are for quick setup only. To actually use pdf-craft, you need to **install Poppler** for PDF parsing (required for all use cases) and **configure a CUDA environment** for OCR recognition (required for actual conversion). Please refer to the [Installation Guide](docs\u002FINSTALLATION.md) for detailed instructions.\n\n### Quick Start\n\n#### Convert to Markdown\n\n```python\nfrom pdf_craft import transform_markdown\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    markdown_assets_path=\"images\",\n)\n```\n\n![mdmd](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fd7082496-13b8-4728-9e79-44e2888e57fd)\n\n#### Convert to EPUB\n\n```python\nfrom pdf_craft import transform_epub, BookMeta\n\ntransform_epub(\n    pdf_path=\"input.pdf\",\n    epub_path=\"output.epub\",\n    book_meta=BookMeta(\n        title=\"Book Title\",\n        authors=[\"Author\"],\n    ),\n)\n```\n\n![20251218-162533](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7f6df04a-1fa7-48b3-aa5e-d2d056304ad6)\n\n## Detailed Usage\n\n### Convert to Markdown\n\n```python\nfrom pdf_craft import transform_markdown\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    markdown_assets_path=\"images\",\n    analysing_path=\"temp\",  # Optional: specify temporary folder\n    ocr_size=\"gundam\",  # Optional: tiny, small, base, large, gundam\n    models_cache_path=\"models\",  # Optional: model cache path\n    dpi=300,  # Optional: DPI for rendering PDF pages (default: 300)\n    max_page_image_file_size=None,  # Optional: max image file size in bytes, auto-adjust DPI if exceeded\n    includes_cover=False,  # Optional: include cover\n    includes_footnotes=True,  # Optional: include footnotes\n    ignore_pdf_errors=False,  # Optional: continue on PDF rendering errors\n    ignore_ocr_errors=False,  # Optional: continue on OCR recognition errors\n    generate_plot=False,  # Optional: generate visualization charts\n    toc_llm=None,  # Optional: LLM instance for enhanced TOC extraction\n    toc_assumed=False,  # Optional: whether to assume TOC pages exist (default: False)\n)\n```\n\n### Convert to EPUB\n\n```python\nfrom pdf_craft import transform_epub, BookMeta, TableRender, LaTeXRender\n\ntransform_epub(\n    pdf_path=\"input.pdf\",\n    epub_path=\"output.epub\",\n    analysing_path=\"temp\",  # Optional: specify temporary folder\n    ocr_size=\"gundam\",  # Optional: tiny, small, base, large, gundam\n    models_cache_path=\"models\",  # Optional: model cache path\n    dpi=300,  # Optional: DPI for rendering PDF pages (default: 300)\n    max_page_image_file_size=None,  # Optional: max image file size in bytes, auto-adjust DPI if exceeded\n    includes_cover=True,  # Optional: include cover\n    includes_footnotes=True,  # Optional: include footnotes\n    ignore_pdf_errors=False,  # Optional: continue on PDF rendering errors\n    ignore_ocr_errors=False,  # Optional: continue on OCR recognition errors\n    generate_plot=False,  # Optional: generate visualization charts\n    toc_llm=None,  # Optional: LLM instance for enhanced TOC extraction\n    toc_assumed=True,  # Optional: whether to assume TOC pages exist (default: True for EPUB)\n    book_meta=BookMeta(\n        title=\"Book Title\",\n        authors=[\"Author 1\", \"Author 2\"],\n        publisher=\"Publisher\",\n        language=\"en\",\n    ),\n    lan=\"en\",  # Optional: language (zh\u002Fen)\n    table_render=TableRender.HTML,  # Optional: table rendering method\n    latex_render=LaTeXRender.MATHML,  # Optional: formula rendering method\n    inline_latex=True,  # Optional: preserve inline LaTeX expressions\n)\n```\n\n### Model Management\n\npdf-craft depends on DeepSeek OCR models, which are automatically downloaded from Hugging Face on first run. You can control model storage and loading behavior through the `models_cache_path` and `local_only` parameters.\n\n#### Pre-download Models\n\nIn production environments, it is recommended to download models in advance to avoid downloading on first run:\n\n```python\nfrom pdf_craft import predownload_models\n\npredownload_models(\n    models_cache_path=\"models\",  # Specify model cache directory\n    revision=None,  # Optional: specify model version\n)\n```\n\n#### Specify Model Cache Path\n\nBy default, models are downloaded to the system's Hugging Face cache directory. You can customize the cache location through the `models_cache_path` parameter:\n\n```python\nfrom pdf_craft import transform_markdown\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    models_cache_path=\".\u002Fmy_models\",  # Custom model cache directory\n)\n```\n\n#### Offline Mode\n\nIf you have pre-downloaded the models, you can use `local_only=True` to disable network downloads and ensure only local models are used:\n\n```python\nfrom pdf_craft import transform_markdown\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    models_cache_path=\".\u002Fmy_models\",\n    local_only=True,  # Use local models only, do not download from network\n)\n```\n\n## API Reference\n\n### OCR Models\n\nThe `ocr_size` parameter accepts a `DeepSeekOCRSize` type:\n\n- `tiny` - Smallest model, fastest speed\n- `small` - Small model\n- `base` - Base model\n- `large` - Large model\n- `gundam` - Largest model, highest quality (default)\n\n### Table Rendering Methods\n\n- `TableRender.HTML` - HTML format (default)\n- `TableRender.CLIPPING` - Clipping format (directly clips table images from the original PDF scan)\n\n### Formula Rendering Methods\n\n- `LaTeXRender.MATHML` - MathML format (default)\n- `LaTeXRender.SVG` - SVG format\n- `LaTeXRender.CLIPPING` - Clipping format (directly clips formula images from the original PDF scan)\n\n### Inline LaTeX\n\nThe `inline_latex` parameter (EPUB only, default: `True`) controls whether to preserve inline LaTeX expressions in the output. When enabled, inline mathematical formulas are preserved as LaTeX code, which can be rendered by compatible EPUB readers.\n\n### Table of Contents Detection\n\nThe `toc_assumed` parameter controls how pdf-craft handles table of contents extraction:\n\n- `False` (default for Markdown): Assumes no TOC pages exist. The conversion generates TOC based on document headings only, without detecting or processing TOC pages.\n- `True` (default for EPUB): Assumes TOC pages exist. The conversion uses statistical analysis to detect TOC pages and extract chapter structure.\n\nFor books with complex chapter hierarchies, you can configure the optional `toc_llm` parameter to enable LLM-powered chapter title analysis, which provides more accurate TOC hierarchy detection.\n\n#### LLM-Enhanced TOC Extraction\n\nTo use LLM-enhanced TOC extraction, you need to configure an LLM instance:\n\n```python\nfrom pdf_craft import transform_epub, BookMeta, LLM\n\n# Configure LLM for TOC extraction\ntoc_llm = LLM(\n    key=\"your-api-key\",\n    url=\"https:\u002F\u002Fapi.openai.com\u002Fv1\",  # Or your LLM provider URL\n    model=\"gpt-4\",\n    token_encoding=\"cl100k_base\",\n    timeout=60.0,\n    retry_times=3,\n    retry_interval_seconds=5.0,\n)\n\ntransform_epub(\n    pdf_path=\"input.pdf\",\n    epub_path=\"output.epub\",\n    toc_assumed=True,  # Enable TOC detection\n    toc_llm=toc_llm,  # Enable LLM-powered chapter title analysis\n    book_meta=BookMeta(\n        title=\"Book Title\",\n        authors=[\"Author\"],\n    ),\n)\n```\n\n### Custom PDF Handler\n\nBy default, pdf-craft uses Poppler (via `pdf2image`) for PDF parsing and rendering. If Poppler is not in your system PATH, you can specify a custom path:\n\n```python\nfrom pdf_craft import transform_markdown, DefaultPDFHandler\n\n# Specify custom Poppler path\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    pdf_handler=DefaultPDFHandler(poppler_path=\"\u002Fpath\u002Fto\u002Fpoppler\u002Fbin\"),\n)\n```\n\nIf not specified, pdf-craft will use Poppler from your system PATH. For advanced use cases, you can also implement the `PDFHandler` protocol to use alternative PDF libraries.\n\n### Error Handling\n\nThe `ignore_pdf_errors` and `ignore_ocr_errors` parameters provide flexible error handling options. You can use them in two ways:\n\n**1. Boolean Mode** - Simple on\u002Foff control:\n\n```python\nfrom pdf_craft import transform_markdown\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    ignore_pdf_errors=True,  # Ignore all PDF rendering errors\n    ignore_ocr_errors=True,  # Ignore all OCR recognition errors\n)\n```\n\nWhen set to `True`, processing continues when errors occur on individual pages, inserting a placeholder message instead of stopping the entire conversion.\n\n**2. Custom Function Mode** - Fine-grained control:\n\n```python\nfrom pdf_craft import transform_markdown, OCRError, PDFError\n\ndef should_ignore_ocr_error(error: OCRError) -> bool:\n    # Only ignore specific types of OCR errors\n    return error.kind == \"recognition_failed\"\n\ndef should_ignore_pdf_error(error: PDFError) -> bool:\n    # Custom logic to decide which PDF errors to ignore\n    return \"timeout\" in str(error)\n\ntransform_markdown(\n    pdf_path=\"input.pdf\",\n    markdown_path=\"output.md\",\n    ignore_ocr_errors=should_ignore_ocr_error,  # Pass custom function\n    ignore_pdf_errors=should_ignore_pdf_error,  # Pass custom function\n)\n```\n\nThis allows you to implement custom logic for deciding which specific errors should be ignored during conversion.\n\n## Related Projects\n\n- [EPUB Translator](https:\u002F\u002Fgithub.com\u002Foomol-lab\u002Fepub-translator): If you want to translate the EPUB generated by PDF Craft into a bilingual edition, EPUB Translator preserves the original layout, images, and table of contents. See this [demo video](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV1tMQZY5EYY\u002F) for the full scanned PDF to bilingual EPUB workflow.\n- [SpineDigest](https:\u002F\u002Fgithub.com\u002Foomol-lab\u002Fspinedigest): If you want to distill the converted book into a structured digest, SpineDigest can turn EPUB or Markdown into summaries, chapter topology, and a knowledge graph.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](.\u002FLICENSE) file for details.\n\nStarting from v1.0.0, pdf-craft has fully migrated to DeepSeek OCR (MIT license), removing the previous AGPL-3.0 dependency, allowing the entire project to be released under the more permissive MIT license. Note that pdf-craft has a transitive dependency on easydict (LGPLv3) via DeepSeek OCR. Thanks to the community for their support and contributions!\n\n## Acknowledgments\n\n- [DeepSeekOCR](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-OCR)\n- [doc-page-extractor](https:\u002F\u002Fgithub.com\u002FMoskize91\u002Fdoc-page-extractor)\n- [pyahocorasick](https:\u002F\u002Fgithub.com\u002FWojciechMula\u002Fpyahocorasick)\n","PDF Craft 是一个专注于将扫描书籍的PDF文件转换成多种其他格式（如Markdown和EPUB）的工具。该项目基于DeepSeek OCR技术，能够识别复杂的文档内容，包括表格和公式，并支持GPU加速以提高处理速度。它能自动识别文档结构、准确提取正文并过滤掉页眉页脚等干扰元素，特别适用于包含脚注、公式及表格的学术或技术文档的转换。最终生成的文件保持了原书的内容完整性和可读性。此外，从v1.0.0版本开始，pdf-craft不再依赖LLM进行文本校正，转而完全采用DeepSeek OCR，实现了更快的本地转换过程。",2,"2026-06-11 03:40:16","high_star"]