[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72570":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72570,"DocLayout-YOLO","opendatalab\u002FDocLayout-YOLO","opendatalab","DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception","https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO",null,"Python",2182,163,10,50,0,6,7,33,18,77.44,"GNU Affero General Public License v3.0",false,"main",true,[],"2026-06-12 04:01:06","\u003Cdiv align=\"center\">\n\nEnglish | [简体中文](.\u002FREADME-zh_CN.md)\n\n\n\u003Ch1>DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception\u003C\u002Fh1>\n\nOfficial PyTorch implementation of [DocLayout-YOLO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628).\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.14458-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628) [![Online Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Online%20Demo-yellow)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Models%20and%20Data-yellow)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjuliozhao\u002Fdoclayout-yolo-670cdec674913d9a6f77b542)\n\n\u003C\u002Fdiv>\n    \n## Abstract\n\n> We present DocLayout-YOLO, a real-time and robust layout detection model for diverse documents, based on YOLO-v10. This model is enriched with diversified document pre-training and structural optimization tailored for layout detection. In the pre-training phase, we introduce Mesh-candidate BestFit, viewing document synthesis as a two-dimensional bin packing problem, and create a large-scale diverse synthetic document dataset, DocSynth-300K. In terms of model structural optimization, we propose a module with Global-to-Local Controllability for precise detection of document elements across varying scales. \n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fcomp.png\" width=52%>\n  \u003Cimg src=\"assets\u002Fradar.png\" width=44%> \u003Cbr>\n\u003C\u002Fp>\n\n## News 🚀🚀🚀\n\n**2024.10.25** 🎉🎉  **Mesh-candidate Bestfit** code is released. Mesh-candidate Bestfit is an automatic pipeline which can synthesize large-scale, high-quality, and visually appealing document layout detection dataset. Tutorial and example data are available in [here](.\u002Fmesh-candidate_bestfit).\n\n**2024.10.23** 🎉🎉  **DocSynth300K dataset** is released on [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002FDocSynth300K), DocSynth300K is a large-scale and diverse document layout analysis pre-training dataset, which can largely boost model performance.\n\n**2024.10.21** 🎉🎉  **Online demo** available on [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO).\n\n**2024.10.18** 🎉🎉  DocLayout-YOLO is implemented in **[PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit)** for document context extraction.\n\n**2024.10.16** 🎉🎉  **Paper** now available on [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628).   \n\n\n## Quick Start\n\n[Online Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) is now available. For local development, follow steps below:\n\n### 1. Environment Setup\n\nFollow these steps to set up your environment:\n\n```bash\nconda create -n doclayout_yolo python=3.10\nconda activate doclayout_yolo\npip install -e .\n```\n\n**Note:** If you only need the package for inference, you can simply install it via pip:\n\n```bash\npip install doclayout-yolo\n```\n\n### 2. Prediction\n\nYou can make predictions using either a script or the SDK:\n\n- **Script**\n\n  Run the following command to make a prediction using the script:\n\n  ```bash\n  python demo.py --model path\u002Fto\u002Fmodel --image-path path\u002Fto\u002Fimage\n  ```\n\n- **SDK**\n\n  Here is an example of how to use the SDK for prediction:\n\n  ```python\n  import cv2\n  from doclayout_yolo import YOLOv10\n\n  # Load the pre-trained model\n  model = YOLOv10(\"path\u002Fto\u002Fprovided\u002Fmodel\")\n\n  # Perform prediction\n  det_res = model.predict(\n      \"path\u002Fto\u002Fimage\",   # Image to predict\n      imgsz=1024,        # Prediction image size\n      conf=0.2,          # Confidence threshold\n      device=\"cuda:0\"    # Device to use (e.g., 'cuda:0' or 'cpu')\n  )\n\n  # Annotate and save the result\n  annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)\n  cv2.imwrite(\"result.jpg\", annotated_frame)\n  ```\n\n\nWe provide model fine-tuned on **DocStructBench** for prediction, **which is capable of handing various document types**. Model can be downloaded from [here](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocStructBench\u002Ftree\u002Fmain) and example images can be found under ```assets\u002Fexample```.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fshowcase.png\" width=100%> \u003Cbr>\n\u003C\u002Fp>\n\n\n**Note:** For PDF content extraction, please refer to [PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit\u002Ftree\u002Fmain) and [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU).\n\n**Note:** Thanks to [NielsRogge](https:\u002F\u002Fgithub.com\u002FNielsRogge), DocLayout-YOLO now supports implementation directly from 🤗Huggingface, you can load model as follows:\n\n```python\nfilepath = hf_hub_download(repo_id=\"juliozhao\u002FDocLayout-YOLO-DocStructBench\", filename=\"doclayout_yolo_docstructbench_imgsz1024.pt\")\nmodel = YOLOv10(filepath)\n```\n\nor directly load using ```from_pretrained```:\n\n```python\nmodel = YOLOv10.from_pretrained(\"juliozhao\u002FDocLayout-YOLO-DocStructBench\")\n```\n\nmore details can be found at [this PR](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fpull\u002F6).\n\n**Note:** Thanks to [luciaganlulu](https:\u002F\u002Fgithub.com\u002Fluciaganlulu), DocLayout-YOLO can perform batch inference and prediction. Instead of passing single image into ```model.predict``` in ```demo.py```, pass a **list of image path**. Besides, due to batch inference is not implemented before ```YOLOv11```, you should manually change ```batch_size``` in [here](doclayout_yolo\u002Fengine\u002Fmodel.py#L431).\n\n## DocSynth300K Dataset\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdocsynth300k.png\" width=100%>\n\u003C\u002Fp>\n\n### Data Download\n\nUse following command to download dataset(about 113G):\n\n```python\nfrom huggingface_hub import snapshot_download\n# Download DocSynth300K\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\")\n# If the download was disrupted and the file is not complete, you can resume the download\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\", resume_download=True)\n```\n\n### Data Formatting & Pre-training\n\nIf you want to perform DocSynth300K pretraining, using ```format_docsynth300k.py``` to convert original ```.parquet``` format into ```YOLO``` format. The converted data will be stored at ```.\u002Flayout_data\u002Fdocsynth300k```.\n\n```bash\npython format_docsynth300k.py\n```\n\nTo perform DocSynth300K pre-training, use this [command](assets\u002Fscript.sh#L2). We default use 8GPUs to perform pretraining. To reach optimal performance, you can adjust hyper-parameters such as ```imgsz```, ```lr``` according to your downstream fine-tuning data distribution or setting.\n\n**Note:** Due to memory leakage in YOLO original data loading code, the pretraining on large-scale dataset may be interrupted unexpectedly, use ```--pretrain last_checkpoint.pt --resume``` to resume the pretraining process.\n\n## Training and Evaluation on Public DLA Datasets\n\n### Data Preparation\n\n1. specify  the data root path\n\nFind your ultralytics config file (for Linux user in ```$HOME\u002F.config\u002FUltralytics\u002Fsettings.yaml)``` and change ```datasets_dir``` to project root path.\n\n2. Download prepared yolo-format D4LA and DocLayNet data from below and put to ```.\u002Flayout_data```:\n\n| Dataset | Download |\n|:--:|:--:|\n| D4LA | [link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-D4LA) |\n| DocLayNet | [link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-DocLayNet) |\n\nthe file structure is as follows:\n\n```bash\n.\u002Flayout_data\n├── D4LA\n│   ├── images\n│   ├── labels\n│   ├── test.txt\n│   └── train.txt\n└── doclaynet\n    ├── images\n    ├── labels\n    ├── val.txt\n    └── train.txt\n```\n\n### Training and Evaluation\n\nTraining is conducted on 8 GPUs with a global batch size of 64 (8 images per device). The detailed settings and checkpoints are as follows:\n\n| Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |\n|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n| D4LA | DocLayout-YOLO | &cross; | 1600 | 0.04 | [command](assets\u002Fscript.sh#L5) | [command](assets\u002Fscript.sh#L11) | 81.7 | 69.8 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-from_scratch) |\n| D4LA | DocLayout-YOLO | &check; | 1600 | 0.04 | [command](assets\u002Fscript.sh#L8) | [command](assets\u002Fscript.sh#L11) | 82.4 | 70.3 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-Docsynth300K_pretrained) |\n| DocLayNet | DocLayout-YOLO | &cross; | 1120 | 0.02 | [command](assets\u002Fscript.sh#L14) | [command](assets\u002Fscript.sh#L20) | 93.0 | 77.7 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-from_scratch) |\n| DocLayNet | DocLayout-YOLO | &check; | 1120 | 0.02 | [command](assets\u002Fscript.sh#L17) | [command](assets\u002Fscript.sh#L20) | 93.4 | 79.7 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-Docsynth300K_pretrained) |\n\nThe DocSynth300K pretrained model can be downloaded from [here](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocSynth300K-pretrain). Change ```checkpoint.pt``` to the path of model to be evaluated during evaluation.\n\n\n## Acknowledgement\n\nThe code base is built with [ultralytics](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics) and [YOLO-v10](https:\u002F\u002Fgithub.com\u002Flyuwenyu\u002FRT-DETR).\n\nThanks for their great work!\n\n## Star History\n\nIf you find our project useful, please add a \"star\" to the repo. It's exciting to us when we see your interest, which keep us motivated to continue investing in the project!\n\n\u003Cpicture>\n  \u003Csource\n    media=\"(prefers-color-scheme: dark)\"\n    srcset=\"\n      https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FDocLayout-YOLO&type=Date&theme=dark\n    \"\n  \u002F>\n  \u003Csource\n    media=\"(prefers-color-scheme: light)\"\n    srcset=\"\n      https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FDocLayout-YOLO&type=Date\n    \"\n  \u002F>\n  \u003Cimg\n    alt=\"Star History Chart\"\n    src=\"https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=opendatalab\u002FDocLayout-YOLO&type=Date\"\n  \u002F>\n\u003C\u002Fpicture>\n\n## Citation\n\n```bibtex\n@misc{zhao2024doclayoutyoloenhancingdocumentlayout,\n      title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, \n      author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},\n      year={2024},\n      eprint={2410.12628},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628}, \n}\n\n@article{wang2024mineru,\n  title={MinerU: An Open-Source Solution for Precise Document Content Extraction},\n  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},\n  journal={arXiv preprint arXiv:2409.18839},\n  year={2024}\n}\n\n```\n","DocLayout-YOLO 是一个基于 YOLO-v10 的文档布局分析模型，旨在通过多样化的合成数据和全局到局部自适应感知来增强文档布局检测。该项目的核心功能包括利用 Mesh-candidate BestFit 方法生成大规模多样化合成文档数据集 DocSynth-300K，并在模型结构上引入了全局到局部可控性模块以实现跨尺度的精确检测。适合于需要实时且鲁棒地处理多种文档布局分析的应用场景，如文档信息提取、PDF 内容解析等。",2,"2026-06-11 03:42:37","high_star"]