[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71995":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":20,"hasPages":20,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":26,"discoverSource":27},71995,"GOT-OCR2.0","Ucas-HaoranWei\u002FGOT-OCR2.0","Ucas-HaoranWei","Official code implementation of General OCR Theory:  Towards OCR-2.0 via a Unified End-to-end Model",null,"Python",8138,704,66,225,0,2,6,21,39.54,false,"main",[],"2026-06-12 02:02:57","\u003Ch3>\u003Ca href=\"\">General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model\u003C\u002Fa>\u003C\u002Fh3>\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fucaslcl\u002FGOT-OCR2_0\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-yellow\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fstepfun-ai\u002FGOT-OCR2_0\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelscope-red\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01704\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-PDF-orange\">\u003C\u002Fa> \n\u003Ca href=\"https:\u002F\u002Fzhuanlan.zhihu.com\u002Fp\u002F718163422\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fzhihu-red\">\u003C\u002Fa> \n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fucaslcl\u002FGOT_online\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdemo-green\">\u003C\u002Fa> \n\n[Haoran Wei*](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=J4naK0MAAAAJ&hl=en), Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu,  [Zheng Ge](https:\u002F\u002Fjoker316701882.github.io\u002F), Liang Zhao, [Jianjian Sun](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=MVZrGkYAAAAJ&hl=en), [Yuang Peng](https:\u002F\u002Fyuangpeng.com), Chunrui Han, [Xiangyu Zhang](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=yuB-cfoAAAAJ&hl=en)\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"assets\u002Fgot_logo.png\" style=\"width: 200px\" align=center>\n\u003C\u002Fp>\n\n\n## Release\n- [2025\u002F2\u002F1] 🚀🚀🚀 GOT-OCR2.0 is merged to [Huggingface-transformers](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FGOT-OCR-2.0-hf)\u002F[space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fyonigozlan\u002FGOT-OCR-Transformers). It supports inference batched. Thanks to the MLE of Huggingface [Yoni](https:\u002F\u002Fgithub.com\u002Fyonigozlan).\n- [2024\u002F12\u002F24] 🔥🔥🔥 My new work on system-2 perception is released [slow-perception](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FSlow-Perception).\n- [2024\u002F12\u002F18] 🚀🚀🚀 GOT-OCR2.0 is supported in [PaddleMIX](https:\u002F\u002Fgithub.com\u002FPaddlePaddle\u002FPaddleMIX\u002Ftree\u002Fdevelop\u002Fpaddlemix\u002Fexamples\u002FGOT_OCR_2_0) by Paddle Team. Thanks for the Paddle team!\n- [2024\u002F12\u002F8] 🔥🔥🔥 The model download has exceeded 1M on [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002FGOT-OCR2_0).\n- [2024\u002F12\u002F5] The seven wechat [group](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002FWechat7.jpg).\n- [2024\u002F11\u002F4] The six wechat [group](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002Fwechat6-2.jpg).\n- [2024\u002F10\u002F24] The previous four wechat groups are full, so we created a fifth [group](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002Fwechat5.png).\n- [2024\u002F10\u002F11] Too many friends want to join the wechat group, so we created a fourth [group](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002Fwechat4.jpg).\n- [2024\u002F10\u002F2] [onnx](https:\u002F\u002Fgithub.com\u002FBaofengZan\u002FGOT-OCRv2-onnx) and [mnn](https:\u002F\u002Fgithub.com\u002FBaofengZan\u002Fmnn-llm-GOT-OCR2.0) versions of GOT-OCR2.0.\n- [2024\u002F9\u002F29]🔥🔥🔥 The community has implemented the first version of [llama_cpp_inference](https:\u002F\u002Fgithub.com\u002F1694439208\u002FGOT-OCR-Inference).\n- [2024\u002F9\u002F24]🔥🔥🔥 Support [ms-swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift\u002Fissues\u002F2122) quick [Fine-tune](#fine-tune) for your own data. \n- [2024\u002F9\u002F23]🔥🔥🔥 We release the official [Modelscope demo](https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002Fstepfun-ai\u002FGOT_official_online_demo). Thanks very much for Modelscope providing the GPU resource.\n- [2024\u002F9\u002F19]🔥🔥🔥 GOT-OCR2.0 achieves Huggingface trending #1.\n- [2024\u002F9\u002F14]🔥🔥🔥 We release the official [demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fucaslcl\u002FGOT_online). Thanks very much for Huggingface providing the GPU resource. \n- [2024\u002F9\u002F13]🔥🔥🔥 We release the [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fucaslcl\u002FGOT-OCR2_0) deployment. \n- [2024\u002F9\u002F03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this [repo](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002FGOT-OCR-2.0-paper.pdf). We also have submitted it to Arxiv. \n- [2024\u002F9\u002F03]🔥🔥🔥 We release the OCR-2.0 model GOT! \n\n\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCode%20License-Apache_2.0-green.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002FLICENSE)\n[![Data License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FData%20License-CC%20By%20NC%204.0-red.svg)](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Fstanford_alpaca\u002Fblob\u002Fmain\u002FDATA_LICENSE)\n\n\n\n\n## Community contributions\nWe encourage everyone to develop GOT applications based on this repo. Thanks for the following contributions :\n\n[OpenVINO](https:\u002F\u002Fgithub.com\u002Fcan-gaa-hou\u002FGOT-OCR2.0-OpenVINO)~ contributor: [@can-gaa-hou](https:\u002F\u002Fgithub.com\u002Fcan-gaa-hou)\n\n[GGUF and Llama.cpp inference](https:\u002F\u002Fgithub.com\u002FMosRat\u002Fgot.cpp)~ contributor: [@MosRat](https:\u002F\u002Fgithub.com\u002FMosRat)\n\n[vllm reference](https:\u002F\u002Fgithub.com\u002Fliunian-Jay\u002FMU-GOT\u002Fblob\u002Fmaster\u002FPDF_parsing\u002FGOT\u002FGOT\u002Fmodel\u002Fmodeling_GOT_vllm.py) ~ contributor: [@Jay](https:\u002F\u002Fgithub.com\u002Fliunian-Jay)\n\n[onnx and mnn supports](https:\u002F\u002Fgithub.com\u002FBaofengZan\u002FGOT-OCRv2-onnx) ~ contributor: [@BaofengZan](https:\u002F\u002Fgithub.com\u002FBaofengZan)\n\n[llama_cpp inference](https:\u002F\u002Fgithub.com\u002F1694439208\u002FGOT-OCR-Inference) ~ contributor: [@1694439208](https:\u002F\u002Fgithub.com\u002F1694439208)\n\n[Colab of GOT](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1nmiNciZ5ugQVp4rFbL9ZWpEPd92Y9o7p?usp=sharing)   ~      contributor: [@Zizhe Wang](https:\u002F\u002Fgithub.com\u002FPaperPlaneDeemo)\n\n[CPU version of GOT](https:\u002F\u002Fgithub.com\u002FElvisClaros\u002FGOT-OCR2.0) ~ contributor: [@ElvisClaros](https:\u002F\u002Fgithub.com\u002FElvisClaros)\n\n[Online demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FTonic\u002FGOT-OCR) ~ contributor: [@Joseph Pollack](https:\u002F\u002Fhuggingface.co\u002FTonic)\n\n[Dokcer & client demo](https:\u002F\u002Fgithub.com\u002FQIN2DIM\u002FGOT-OCR2.0) ~ contributor: [@QIN2DIM](https:\u002F\u002Fgithub.com\u002FQIN2DIM) \n\n[GUI of GOT](https:\u002F\u002Fgithub.com\u002FXJF2332\u002FGOT-OCR-2-GUI) ~ contributor: [@XJF2332](https:\u002F\u002Fgithub.com\u002FXJF2332) \n\n## Contents\n- [Install](#install)\n- [GOT Weights](#got-weights)\n- [Benchmarks](#benchmarks)\n- [Demo](#demo)\n- [Train](#train)\n- [Fine-tune](#fine-tune)\n- [Eval](#eval)\n\n***\n\u003Cp align=\"center\">\n\u003Cimg src=\"assets\u002Fgot_support.jpg\" style=\"width: 800px\" align=center>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n\u003Ca href=\"\">Towards OCR-2.0 via a Unified End-to-end Model\u003C\u002Fa>       \n\u003C\u002Fp>\n\n***\n\n\n## Install\n0. Our environment is cuda11.8+torch2.0.1\n1. Clone this repository and navigate to the GOT folder\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0.git\ncd 'the GOT folder'\n```\n2. Install Package\n```Shell\nconda create -n got python=3.10 -y\nconda activate got\npip install -e .\n```\n\n3. Install Flash-Attention\n```\npip install ninja\npip install flash-attn --no-build-isolation\n```\n## GOT Weights\n- [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fucaslcl\u002FGOT-OCR2_0)\n- [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1OdDtsJ8bFJYlNUzCQG4hRkUL6V-qBQaN?usp=sharing)\n- [BaiduYun](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1G4aArpCOt6I_trHv_1SE2g) code: OCR2\n\n## Benchmarks\n- [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1OdDtsJ8bFJYlNUzCQG4hRkUL6V-qBQaN?usp=sharing)\n- [BaiduYun](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1G4aArpCOt6I_trHv_1SE2g) code: OCR2\n\n## Demo\n1. plain texts OCR:\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0.py  --model-name  \u002FGOT_weights\u002F  --image-file  \u002Fan\u002Fimage\u002Ffile.png  --type ocr\n```\n2. format texts OCR:\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0.py  --model-name  \u002FGOT_weights\u002F  --image-file  \u002Fan\u002Fimage\u002Ffile.png  --type format\n```\n3. fine-grained OCR:\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0.py  --model-name  \u002FGOT_weights\u002F  --image-file  \u002Fan\u002Fimage\u002Ffile.png  --type format\u002Focr --box [x1,y1,x2,y2]\n```\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0.py  --model-name  \u002FGOT_weights\u002F  --image-file  \u002Fan\u002Fimage\u002Ffile.png  --type format\u002Focr --color red\u002Fgreen\u002Fblue\n```\n4. multi-crop OCR:\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0_crop.py  --model-name  \u002FGOT_weights\u002F --image-file  \u002Fan\u002Fimage\u002Ffile.png \n```\n5. **Note**: This feature is not batch inference!! It works on the token level.  Please read the paper and then correct use multi-page OCR (the image path contains multiple .png files):\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0_crop.py  --model-name  \u002FGOT_weights\u002F --image-file  \u002Fimages\u002Fpath\u002F  --multi-page\n```\n6. render the formatted OCR results:\n```Shell\npython3 GOT\u002Fdemo\u002Frun_ocr_2.0.py  --model-name  \u002FGOT_weights\u002F  --image-file  \u002Fan\u002Fimage\u002Ffile.png  --type format --render\n ```\n**Note**:\nThe rendering results can be found in \u002Fresults\u002Fdemo.html. Please open the demo.html to see the results.\n\n\n## Train\n0. Train sample can be found [here](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002Ftrain_sample.jpg). Note that the '\\\u003Cimage>' in the 'conversations'-'human'-'value' is necessary!\n1. This codebase only supports post-training (stage-2\u002Fstage-3) upon our GOT weights.\n2. If you want to train from stage-1 described in our paper, you need this [repo](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FVary-tiny-600k).\n\n```Shell\ndeepspeed   \u002FGOT-OCR-2.0-master\u002FGOT\u002Ftrain\u002Ftrain_GOT.py \\\n --deepspeed \u002FGOT-OCR-2.0-master\u002Fzero_config\u002Fzero2.json    --model_name_or_path \u002FGOT_weights\u002F \\\n --use_im_start_end True   \\\n --bf16 True   \\\n --gradient_accumulation_steps 2    \\\n --evaluation_strategy \"no\"   \\\n --save_strategy \"steps\"  \\\n --save_steps 200   \\\n --save_total_limit 1   \\\n --weight_decay 0.    \\\n --warmup_ratio 0.001     \\\n --lr_scheduler_type \"cosine\"    \\\n --logging_steps 1    \\\n --tf32 True     \\\n --model_max_length 8192    \\\n --gradient_checkpointing True   \\\n --dataloader_num_workers 8    \\\n --report_to none  \\\n --per_device_train_batch_size 2    \\\n --num_train_epochs 1  \\\n --learning_rate 2e-5   \\\n --datasets pdf-ocr+scence \\\n --output_dir \u002Fyour\u002Foutput\u002Fpath\n```\n\n\n**Note**:\n1. Change the corresponding data information in [constant.py](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Ftree\u002Fmain\u002FGOT-OCR-2.0-master\u002FGOT\u002Futils).\n2. Change line 37 in [conversation_dataset_qwen.py](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Ftree\u002Fmain\u002FGOT-OCR-2.0-master\u002FGOT\u002Fdata) to your data_name.\n\n## Fine-tune\nQuick Fine-tune with ms-swift:\n\n```Shell\ngit clone https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift.git\ncd ms-swift\npip install -e .[llm]\n```\n```Shell\n# default：sft LLM & projector, freeze vision encoder\nCUDA_VISIBLE_DEVICES=0 swift sft\\\n--model_type got-ocr2 \\\n--model_id_or_path stepfun-ai\u002FGOT-OCR2_0 \\\n--sft_type lora \\\n--dataset latex-ocr-print#5000\n\n# Deepspeed ZeRO2\nNPROC_PER_NODE=4 \\\nCUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \\\n--model_type got-ocr2 \\\n--model_id_or_path stepfun-ai\u002FGOT-OCR2_0 \\\n--sft_type lora \\\n--dataset latex-ocr-print#5000 \\\n--deepspeed default-zero2\n```\n\n**With your data**:\n```Shell\n--dataset train.jsonl\n--val_dataset val.jsonl (optional)\n```\n**Data format**:\n```Shell\n{\"query\": \"\u003Cimage>55555\", \"response\": \"66666\", \"images\": [\"image_path\"]}\n{\"query\": \"\u003Cimage>\u003Cimage>eeeee\", \"response\": \"fffff\", \"history\": [], \"images\": [\"image_path1\", \"image_path2\"]}\n{\"query\": \"EEEEE\", \"response\": \"FFFFF\", \"history\": [[\"query1\", \"response1\"], [\"query2\", \"response2\"]]}\n```\nMore details can be seen in [ms-swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift\u002Fissues\u002F2122).\n\n## Eval\n1. We use the [Fox](https:\u002F\u002Fgithub.com\u002Fucaslcl\u002FFox) and [OneChart](https:\u002F\u002Fgithub.com\u002FLingyvKong\u002FOneChart) benchmarks, and other benchmarks can be found in the weights download link.\n2. The eval codes can be found in GOT\u002Feval.\n3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs， the --num-chunks can be set to 8.\n ```Shell\npython3 GOT\u002Feval\u002Fevaluate_GOT.py --model-name \u002FGOT_weights\u002F --gtfile_path xxxx.json --image_path  \u002Fimage\u002Fpath\u002F --out_path \u002Fdata\u002Feval_results\u002FGOT_mathpix_test\u002F --num-chunks 8 --datatype OCR\n```\n\n## Contact\nIf you are interested in this work or have questions about the code or the paper, please join our communication [Wechat](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.jpg) group.\n\n**Note**:\nAll six wechat groups are full, please join [group 7](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FGOT-OCR2.0\u002Fblob\u002Fmain\u002Fassets\u002FWechat7.jpg).\n\nDon't hesitate to contact me by email, weihaoran18@mails.ucas.ac.cn, if you have any questions.\n\n## Acknowledgement\n- [Vary](https:\u002F\u002Fgithub.com\u002FUcas-HaoranWei\u002FVary\u002F): the codebase we built upon!\n- [Qwen](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen): the LLM base model of Vary, which is good at both English and Chinese!\n\n\n## Citation\n```bibtex\n@article{wei2024general,\n  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},\n  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},\n  journal={arXiv preprint arXiv:2409.01704},\n  year={2024}\n}\n@article{wei2023vary,\n  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},\n  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},\n  journal={arXiv preprint arXiv:2312.06109},\n  year={2023}\n}\n\n\n","GOT-OCR2.0是一个面向通用光学字符识别（OCR）的统一端到端模型实现。该项目通过构建一个综合性的OCR框架，实现了从图像输入到文本输出的全流程自动化处理，支持多种语言和复杂场景下的文字检测与识别。其核心技术包括高效的特征提取、自适应的文字定位以及强大的序列建模能力，使得该模型在准确率和速度上均有出色表现。适用于需要高精度文本识别的应用场景，如文档数字化、自然场景中的文字读取等。","2026-06-11 03:39:53","high_star"]