[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72462":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},72462,"OCRFlux","chatdoc-com\u002FOCRFlux","chatdoc-com","OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.","",null,"Python",2513,151,17,69,0,1,19,3,62.95,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:05","\u003Cdiv align=\"center\">\n\u003Cimg src=\".\u002Fimages\u002FOCRFlux.png\" alt=\"OCRFlux Logo\" width=\"300\"\u002F>\n\u003Chr\u002F>\n\u003C\u002Fdiv>\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002FLICENSE\">\n    \u003Cimg alt=\"GitHub License\" src=\".\u002Fimages\u002Flicense.svg\" height=\"20\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Freleases\">\n    \u003Cimg alt=\"GitHub release\" src=\".\u002Fimages\u002Frelease.svg\" height=\"20\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Focrflux.pdfparser.io\u002F\">\n    \u003Cimg alt=\"Demo\" src=\".\u002Fimages\u002Fdemo.svg\" height=\"20\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FF33mhsAqqg\">\n    \u003Cimg alt=\"Discord\" src=\".\u002Fimages\u002Fdiscord.svg\" height=\"20\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\nOCRFlux is a multimodal large language model based toolkit for converting PDFs and images into clean, readable, plain Markdown text. It aims to push the current state-of-the-art to a significantly higher level.\n\n\nFunctions: **Whole file parsing**\n- On each page\n    - Convert into text with a natural reading order, even in the presence of multi-column layouts, figures, and insets\n    - Support for complicated tables and equations\n    - Automatically removes headers and footers\n\n- Cross-page table\u002Fparagraph merging\n    - Cross-page table merging\n    - Cross-page paragraph merging\n\n\nKey features:\n- Superior parsing quality on each page\n\n    It respectively achieves 0.095 higher (from 0.872 to 0.967), 0.109 higher (from 0.858 to 0.967) and 0.187 higher (from 0.780 to 0.967) Edit Distance Similarity (EDS) on our released benchmark [OCRFlux-bench-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-single) than the baseline model [olmOCR-7B-0225-preview](https:\u002F\u002Fhuggingface.co\u002Fallenai\u002FolmOCR-7B-0225-preview), [Nanonets-OCR-s](https:\u002F\u002Fhuggingface.co\u002Fnanonets\u002FNanonets-OCR-s) and [MonkeyOCR](https:\u002F\u002Fhuggingface.co\u002Fecho840\u002FMonkeyOCR).\n\n- Native support for cross-page table\u002Fparagraph merging  (to our best this is the first to support this feature in all the open sourced project).\n\n- Based on a 3B parameter VLM, so it can run even on GTX 3090 GPU.\n\nRelease:\n- [OCRFlux-3B](https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B) - 3B parameter VLM\n- Benchmark for evaluation\n    - [OCRFlux-bench-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-single)\n    - [OCRFlux-pubtabnet-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-single)\n    - [OCRFlux-bench-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-cross)\n    - [OCRFlux-pubtabnet-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-cross)\n\n\n### News\n - Jun 17, 2025 - v0.1.0 -  Initial public launch and demo.\n\n### Benchmark for single-page parsing\n\nWe ship two comprehensive benchmarks to help measure the performance of our OCR system in single-page parsing:\n\n  - [OCRFlux-bench-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-single): Containing 2000 pdf pages (1000 English pages and 1000 Chinese pages) and their ground-truth Markdowns (manually labeled with multi-round check).\n\n  - [OCRFlux-pubtabnet-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-single): Derived from the public [PubTabNet](https:\u002F\u002Fgithub.com\u002Fibm-aur-nlp\u002FPubTabNet) benchmark with some format transformation. It contains 9064 HTML table samples, which are split into simple tables and complex tables according to whether they have rowspan and colspan cells.\n\nWe emphasize that the released benchmarks are NOT included in our training and evaluation data. The following is the main result:\n\n\n1. In [OCRFlux-bench-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-single), we calculated the Edit Distance Similarity (EDS) between the generated Markdowns and the ground-truth Markdowns as the metric.\n\n    \u003Ctable>\n      \u003Cthead>\n        \u003Ctr>\n          \u003Cth>Language\u003C\u002Fth>\n          \u003Cth>Model\u003C\u002Fth>\n          \u003Cth>Avg EDS ↑\u003C\u002Fth>\n        \u003C\u002Ftr>\n      \u003C\u002Fthead>\n      \u003Ctbody>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">English\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.885\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.870\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>MonkeyOCR\u003C\u002Ftd>\n          \u003Ctd>0.828\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Fstrong>\u003C\u002Ftd>\n          \u003Ctd>0.971\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">Chinese\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.859\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.846\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>MonkeyOCR\u003C\u002Ftd>\n          \u003Ctd>0.731\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Fstrong>\u003C\u002Ftd>\n          \u003Ctd>0.962\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">Total\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.872\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.858\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>MonkeyOCR\u003C\u002Ftd>\n          \u003Ctd>0.780\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Fstrong>\u003C\u002Ftd>\n          \u003Ctd>0.967\u003C\u002Ftd>\n        \u003C\u002Ftr>\n      \u003C\u002Ftbody>\n    \u003C\u002Ftable>\n\n2. In [OCRFlux-pubtabnet-single](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-single), we calculated the Tree Edit Distance-based Similarity (TEDS) between the generated HTML tables and the ground-truth HTML tables as the metric.\n    \u003Ctable>\n      \u003Cthead>\n        \u003Ctr>\n          \u003Cth>Type\u003C\u002Fth>\n          \u003Cth>Model\u003C\u002Fth>\n          \u003Cth>Avg TEDS ↑\u003C\u002Fth>\n        \u003C\u002Ftr>\n      \u003C\u002Fthead>\n      \u003Ctbody>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">Simple\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.810\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.882\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>MonkeyOCR\u003C\u002Ftd>\n          \u003Ctd>0.880\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Fstrong>\u003C\u002Ftd>\n          \u003Ctd>0.912\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">Complex\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.676\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.772\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>MonkeyOCR\u003Cstrong>\u003C\u002Ftd>\n          \u003Ctd>0.826\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Ftd>\n          \u003Ctd>0.807\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd rowspan=\"4\">Total\u003C\u002Ftd>\n          \u003Ctd>olmOCR-7B-0225-preview\u003C\u002Ftd>\n          \u003Ctd>0.744\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>Nanonets-OCR-s\u003C\u002Ftd>\n          \u003Ctd>0.828\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>MonkeyOCR\u003C\u002Ftd>\n          \u003Ctd>0.853\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n          \u003Ctd>\u003Cstrong>\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B\">OCRFlux-3B\u003C\u002Fa>\u003C\u002Fstrong>\u003C\u002Ftd>\n          \u003Ctd>0.861\u003C\u002Ftd>\n        \u003C\u002Ftr>\n      \u003C\u002Ftbody>\n    \u003C\u002Ftable>\n\nWe also conduct some case studies to show the superiority of our model in the [blog](https:\u002F\u002Focrflux.pdfparser.io\u002F#\u002Fblog) article.\n\n### Benchmark for cross-page table\u002Fparagraph merging\n\nPDF documents are typically paginated, which often results in tables or paragraphs being split across consecutive pages. Accurately detecting and merging such cross-page structures is crucial to avoid generating incomplete or fragmented content. \n\nThe detection task can be formulated as follows: given the Markdowns of two consecutive pages—each structured as a list of Markdown elements (e.g., paragraphs and tables)—the goal is to identify the indexes of elements that should be merged across the pages.\n\nThen for the merging task, if the elements to be merged are paragraphs, we can just concate them. However, for two table fragments, their merging is much more challenging. For example, the table spanning multiple pages will repeat the header of the first page on the second page. Another difficult scenario is that the table cell contains long content that spans multiple lines within the cell, with the first few lines appearing on the previous page and the remaining lines continuing on the next page. We also observe some cases where tables with a large number of columns are split vertically and placed on two consecutive pages. More examples of cross-page tables can be found in our [blog](https:\u002F\u002Focrflux.pdfparser.io\u002F#\u002Fblog) article. To address these issues, we develop the LLM model for cross-page table merging. Specifically, this model takes two split table fragments as input and generates a complete, well-structured table as output.\n\nWe ship two comprehensive benchmarks to help measure the performance of our OCR system in cross-page table\u002Fparagraph detection and merging tasks respectively:\n\n  - [OCRFlux-bench-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-cross): Containing 1000 samples (500 English samples and 500 Chinese samples), each sample contains the Markdown element lists of two consecutive pages, along with the indexes of elements that need to be merged (manually labeled through multiple rounds of review). If no tables or paragraphs require merging, the indexes in the annotation data are left empty.\n\n  - [OCRFlux-pubtabnet-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-cross): Containing 9064 pairs of split table fragments, along with their corresponding ground-truth merged versions.\n\nThe released benchmarks are NOT included in our training and evaluation data neither. The following is the main result:\n\n1. In [OCRFlux-bench-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-bench-cross), we caculated the Accuracy, Precision, Recall and F1 score as the metric. Notice that the detection results are right only when it accurately judges whether there are elements that need to be merged across the two pages and output the right indexes of them.\n\n    | Language | Precision ↑ | Recall ↑ | F1 ↑  | Accuracy ↑ |\n    |----------|-------------|----------|-------|------------|\n    | English  | 0.992       | 0.964    | 0.978 | 0.978      |\n    | Chinese  | 1.000       | 0.988    | 0.994 | 0.994      |\n    | Total    | 0.996       | 0.976    | 0.986 | 0.986      |\n\n2. In [OCRFlux-pubtabnet-cross](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChatDOC\u002FOCRFlux-pubtabnet-cross), we calculate the Tree Edit Distance-based Similarity (TEDS) between the generated merged table and the ground-truth merged table as the metric.\n\n    | Table type | Avg TEDS ↑   |\n    |------------|--------------|\n    | Simple     | 0.965        |\n    | Complex    | 0.935        |\n    | Total      | 0.950        |\n\n### Installation\n\nRequirements:\n - Recent NVIDIA GPU (tested on RTX 3090, 4090, L40S, A100, H100) with at least 12 GB of GPU RAM\n - 20GB of free disk space\n\nYou will need to install poppler-utils and additional fonts for rendering PDF images.\n\nInstall dependencies (Ubuntu\u002FDebian)\n```bash\nsudo apt-get update\nsudo apt-get install poppler-utils poppler-data ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools\n```\n\nSet up a conda environment and install OCRFlux. The requirements for running OCRFlux\nare difficult to install in an existing python environment, so please do make a clean python environment to install into.\n```bash\nconda create -n ocrflux python=3.11\nconda activate ocrflux\n\ngit clone https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux.git\ncd OCRFlux\n\npip install -e . --find-links https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fcu124\u002Ftorch2.5\u002Fflashinfer\u002F\n```\n\n### Usage\n\nFor quick testing, try the [web demo](https:\u002F\u002F5f65ccdc2d4fd2f364.gradio.live). To run locally, a GPU is required, as inference is powered by [vllm](hhttps:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) under the hood.\n\n\n#### Pipeline\n\n- For a pdf document:\n    ```bash\n    python -m ocrflux.pipeline .\u002Flocalworkspace --data test.pdf --model \u002Fmodel_dir\u002FOCRFlux-3B\n    ```\n\n- For an image:\n    ```bash\n    python -m ocrflux.pipeline .\u002Flocalworkspace --data test_page.png --model \u002Fmodel_dir\u002FOCRFlux-3B\n    ```\n\n- For a directory of pdf or images:\n    ```bash\n    python -m ocrflux.pipeline .\u002Flocalworkspace --data test_pdf_dir\u002F* --model \u002Fmodel_dir\u002FOCRFlux-3B\n    ```\n\nNotices:\n- You can set `--skip_cross_page_merge` to skip the cross-page merging in the parsing process to accelerate, it would simply concatenate the parsing results of each page to generate final Markdown of the document.\n\n- You can set `--gpu_memory_utilization` to set GPU memory utiliziation, e.g. `--gpu_memory_utilization 0.9`, default is 0.8.\n\n- OCRFlux is recommended to run on a GPU with more than 24GB of VRAM. However, if you have multiple smaller GPUs (e.g., 12GB), you can set `--tensor_parallel_size N` to run it on N GPUs.\n\n- When using OCRFlux on GPUs which do not support `bf16` like V100, you can set `--dtpye float32` instead.\n\n\nResults will be stored as JSONL files in the `.\u002Flocalworkspace\u002Fresults` directory. \n\nEach line in JSONL files is a json object with the following fields:\n\n```\n{\n    \"orig_path\": str,  # the path to the raw pdf or image file\n    \"num_pages\": int,  # the number of pages in the pdf file\n    \"document_text\": str, # the Markdown text of the converted pdf or image file\n    \"page_texts\": dict, # the Markdown texts of each page in the pdf file, the key is the page index and the value is the Markdown text of the page\n    \"fallback_pages\": [int], # the page indexes that are not converted successfully\n}\n```\n\nGenerate the final Markdown files by running the following command. Generated Markdown files will be in `.\u002Flocalworkspace\u002Fmarkdowns\u002FDOCUMENT_NAME` directory.\n\n```bash\npython -m ocrflux.jsonl_to_markdown .\u002Flocalworkspace\n```\n\n\n#### Offline Inference\nYou can use the inference API to directly call OCRFlux in your codes without using an online vllm server like following:\n\n```\nfrom vllm import LLM\nfrom ocrflux.inference import parse\n\nfile_path = 'test.pdf'\n# file_path = 'test.png'\nllm = LLM(model=\"model_dir\u002FOCRFlux-3B\",gpu_memory_utilization=0.8,max_model_len=8192)\nresult = parse(llm,file_path)\nif result != None:\n    document_markdown = result['document_text']\n    print(document_markdown)\n    with open('test.md','w') as f:\n        f.write(document_markdown)\nelse:\n    print(\"Parse failed.\")\n```\n\nIf parsing is failed or there are fallback pages in the result, you can try to set the argument `max_page_retries` for the `parse` function with a positive integer to get a better result. But it may cause longer inference time.\n\n#### Online Deployment\n\nRun the following command to start the server:\n\n```bash\nbash ocrflux\u002Fserver.sh \u002Fpath\u002Fto\u002Fmodel port\n```\n\nFor example, the following command:\n\n```bash\nbash ocrflux\u002Fserver.sh ChatDOC\u002FOCRFlux-3B 30024\n```\n\nIt will start a vllm server on port 30024. You can also start server by yourself using other methods like `sglang`.\n\nAfter the server is started, you can use the `request` api to request it to parse a pdf file or an image file like following:\n\n```\nimport asyncio\nfrom argparse import Namespace\nfrom ocrflux.client import request\nargs = Namespace(\n    model=\"\u002Fpath\u002Fto\u002FOCRFlux-3B\",\n    skip_cross_page_merge=False,\n    max_page_retries=1,\n    url=\"http:\u002F\u002Flocalhost\",\n    port=30024,\n)\nfile_path = 'test.pdf'\n# file_path = 'test.png'\nresult = asyncio.run(request(args,file_path))\nif result != None:\n    document_markdown = result['document_text']\n    print(document_markdown)\n    with open('test.md','w') as f:\n        f.write(document_markdown)\nelse:\n    print(\"Parse failed.\")\n```\n\n\n#### Docker Usage\n\nRequirements:\n\n- Docker with GPU support [(NVIDIA Toolkit)](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)\n- Pre-downloaded model: [OCRFlux-3B](https:\u002F\u002Fhuggingface.co\u002FChatDOC\u002FOCRFlux-3B)\n\nTo use OCRFlux in a docker container, you can use the following example command to start the docker container firstly:\n\n```bash\ndocker run -it --gpus all \\\n  -v \u002Fpath\u002Fto\u002Flocalworkspace:\u002Flocalworkspace \\\n  -v \u002Fpath\u002Fto\u002Ftest_pdf_dir:\u002Ftest_pdf_dir \\\n  -v \u002Fpath\u002Fto\u002FOCRFlux-3B:\u002FOCRFlux-3B \\\n  --entrypoint bash \\\n  chatdoc\u002Focrflux:latest\n```\n\nand then run the following command on the docker container to parse document files:\n\n```bash\npython3.12 -m ocrflux.pipeline \u002Flocalworkspace\u002Focrflux_results --data \u002Ftest_pdf_dir\u002F*.pdf --model \u002FOCRFlux-3B\u002F\n```\n\nThe parsing results will be stored in `\u002Flocalworkspace\u002Focrflux_results` directory. Run the following command to generate the final Markdown files:\n```bash\npython -m ocrflux.jsonl_to_markdown .\u002Flocalworkspace\u002Focrflux_results\n```\n\n### Full documentation for the pipeline\n\n```bash\npython -m ocrflux.pipeline --help\nusage: pipeline.py [-h] [--task {pdf2markdown,merge_pages,merge_tables}] [--data [DATA ...]] [--pages_per_group PAGES_PER_GROUP] [--max_page_retries MAX_PAGE_RETRIES]\n                   [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--gpu_memory_utilization GPU_MEMORY_UTILIZATION] [--tensor_parallel_size TENSOR_PARALLEL_SIZE]\n                   [--dtype {auto,half,float16,float,bfloat16,float32}] [--workers WORKERS] [--model MODEL] [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE]\n                   [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM] [--skip_cross_page_merge] [--port PORT]\n                   workspace\n\nManager for running millions of PDFs through a batch inference pipeline\n\npositional arguments:\n  workspace             The filesystem path where work will be stored, can be a local folder\n\noptions:\n  -h, --help            show this help message and exit\n  --task {pdf2markdown,merge_pages,merge_tables}\n                        task names, could be 'pdf2markdown', 'merge_pages' or 'merge_tables'\n  --data [DATA ...]     List of paths to files to process\n  --pages_per_group PAGES_PER_GROUP\n                        Aiming for this many pdf pages per work item group\n  --max_page_retries MAX_PAGE_RETRIES\n                        Max number of times we will retry rendering a page\n  --max_page_error_rate MAX_PAGE_ERROR_RATE\n                        Rate of allowable failed pages in a document, 1\u002F250 by default\n  --gpu_memory_utilization GPU_MEMORY_UTILIZATION\n                        Fraction of GPU memory to use, default is 0.8\n  --tensor_parallel_size TENSOR_PARALLEL_SIZE\n                        Number of tensor parallel replicas\n  --dtype {auto,half,float16,float,bfloat16,float32}\n                        Data type for model weights and activations.\n  --workers WORKERS     Number of workers to run at a time\n  --model MODEL         The path to the model\n  --model_max_context MODEL_MAX_CONTEXT\n                        Maximum context length that the model was fine tuned under\n  --model_chat_template MODEL_CHAT_TEMPLATE\n                        Chat template to pass to vllm server\n  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM\n                        Dimension on longest side to use for rendering the pdf pages\n  --skip_cross_page_merge\n                        Whether to skip cross-page merging\n  --port PORT           Port to use for the VLLM server\n```\n\n## Code overview\n\nThere are some nice reusable pieces of the code that may be useful for your own projects:\n - Processing millions of PDFs through our released model using VLLM - [pipeline.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Focrflux\u002Fpipeline.py)\n - Generating final Markdowns from jsonl files - [jsonl_to_markdown.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Focrflux\u002Fjsonl_to_markdown.py)\n - Running offline inference using vllm - [inferencer.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Focrflux\u002Finference.py)\n - Launching a vllm server - [server.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Focrflux\u002Fserver.sh)\n - Running online inference using vllm - [client.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Focrflux\u002Fclient.py)\n - Evaluating the model on the single-page parsing task - [eval_page_to_markdown.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Feval\u002Feval_page_to_markdown.py)\n - Evaluating the model on the table parising task - [eval_table_to_html.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Feval\u002Feval_table_to_html.py)\n - Evaluating the model on the paragraphs\u002Ftables merging detection task - [eval_element_merge_detect.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Feval\u002Feval_element_merge_detect.py)\n - Evaluating the model on the table merging task - [eval_html_table_merge.py](https:\u002F\u002Fgithub.com\u002Fchatdoc-com\u002FOCRFlux\u002Fblob\u002Fmain\u002Feval\u002Feval_html_table_merge.py)\n\n\n## Team\n\n\u003C!-- start team -->\n\n**OCRFlux** is developed and maintained by the ChatDOC team, backed by [ChatDOC](https:\u002F\u002Fchatdoc.com\u002F).\n\n\u003C!-- end team -->\n\n## License\n\n\u003C!-- start license -->\n\n**OCRFlux** is licensed under [Apache 2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0).\nA full copy of the license can be found [on GitHub](https:\u002F\u002Fgithub.com\u002Fallenai\u002FOCRFlux\u002Fblob\u002Fmain\u002FLICENSE).\n\n\u003C!-- end license -->\n","OCRFlux 是一个轻量级但功能强大的多模态工具包，专注于将PDF和图像转换为清晰易读的Markdown文本。其核心功能包括对复杂布局、复杂表格解析以及跨页内容合并的支持，能够自然地处理多列布局、图表和插图，并自动移除页眉和页脚。特别地，OCRFlux 在单页解析质量上超越了现有基线模型，在编辑距离相似度（EDS）指标上有显著提升；同时，它也是首个支持跨页表格\u002F段落合并的开源项目之一。基于30亿参数的视觉语言模型，该工具可以在GTX 3090 GPU等设备上运行。适用于需要高质量PDF转Markdown转换的各种场景，如学术论文、技术文档处理等。",2,"2026-06-11 03:42:08","high_star"]