GitHub - renshujiajia/GOT-OCR2.0: Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

Release

[2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this repo. We also have submitted it to Arxiv.
[2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.

Install

Our environment is cuda11.8+torch2.0.1
Clone this repository and navigate to the GOT folder

git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
cd 'the GOT folder'

Install Package

conda create -n got python=3.10 -y
conda activate got
pip install -e .

Install Flash-Attention

pip install ninja
pip install flash-attn --no-build-isolation

GOT Weights

Google Drive
BaiduYun code: OCR2

Demo

plain texts OCR:

python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type ocr

format texts OCR:

python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format

fine-grained OCR:

python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --box [x1,y1,x2,y2]

python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --color red/green/blue

multi-crop OCR:

python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /an/image/file.png

multi-page OCR (the image path contains multiple .png files):

python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /images/path/  --multi-page

render the formatted OCR results:

python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format --render

Note: The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.

Train

This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
If you want train from stage-1 described in our paper, you need this repo.

deepspeed   /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
 --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json    --model_name_or_path /GOT_weights/ \
 --use_im_start_end True   \
 --bf16 True   \
 --gradient_accumulation_steps 2    \
 --evaluation_strategy "no"   \
 --save_strategy "steps"  \
 --save_steps 200   \
 --save_total_limit 1   \
 --weight_decay 0.    \
 --warmup_ratio 0.001     \
 --lr_scheduler_type "cosine"    \
 --logging_steps 1    \
 --tf32 True     \
 --model_max_length 8192    \
 --gradient_checkpointing True   \
 --dataloader_num_workers 8    \
 --report_to none  \
 --per_device_train_batch_size 2    \
 --num_train_epochs 1  \
 --learning_rate 2e-5   \
 --datasets pdf-ocr+scence \
 --output_dir /your/output.path

Note:

Change the corresponding data information in constant.py.
Change line 37 in conversation_dataset_qwen.py to your data_name.

Eval

We use the Fox and OneChart benchmarks, and other benchmarks can be found in the weights download link.
The eval codes can be found in GOT/eval.
You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs， the --num-chunks can be set to 8.

python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path  /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR

Contact

If you are interested in this work or have questions about the code or the paper, please join our communication Wechat group.

Acknowledgement

Vary: the codebase we built upon!
Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
GOT-OCR-2.0-master		GOT-OCR-2.0-master
assets		assets
GOT-OCR-2.0-paper.pdf		GOT-OCR-2.0-paper.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Release

Contents

Install

GOT Weights

Demo

Train

Eval

Contact

Acknowledgement

Citation

About

Releases

Packages

Languages

renshujiajia/GOT-OCR2.0

Folders and files

Latest commit

History

Repository files navigation

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Release

Contents

Install

GOT Weights

Demo

Train

Eval

Contact

Acknowledgement

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages