Skip to content

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Notifications You must be signed in to change notification settings

renshujiajia/GOT-OCR2.0

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Haoran Wei*, Chenglong Liu*, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, Xiangyu Zhang

Release

  • [2024/9/03]🔥🔥🔥 We open-source the codes, weights, and benchmarks. The paper can be found in this repo. We also have submitted it to Arxiv.
  • [2024/9/03]🔥🔥🔥 We release the OCR-2.0 model GOT!

Code License Data License

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of Vary.

Contents


Towards OCR-2.0 via a Unified End-to-end Model


Install

  1. Our environment is cuda11.8+torch2.0.1
  2. Clone this repository and navigate to the GOT folder
git clone https://github.com/Ucas-HaoranWei/GOT-OCR2.0.git
cd 'the GOT folder'
  1. Install Package
conda create -n got python=3.10 -y
conda activate got
pip install -e .
  1. Install Flash-Attention
pip install ninja
pip install flash-attn --no-build-isolation

GOT Weights

Demo

  1. plain texts OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type ocr
  1. format texts OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format
  1. fine-grained OCR:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --box [x1,y1,x2,y2]
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format/ocr --color red/green/blue
  1. multi-crop OCR:
python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /an/image/file.png 
  1. multi-page OCR (the image path contains multiple .png files):
python3 GOT/demo/run_ocr_2.0_crop.py  --model-name  /GOT_weights/ --image-file  /images/path/  --multi-page
  1. render the formatted OCR results:
python3 GOT/demo/run_ocr_2.0.py  --model-name  /GOT_weights/  --image-file  /an/image/file.png  --type format --render

Note: The rendering results can be found in /results/demo.html. Please open the demo.html to see the results.

Train

  1. This codebase only supports post-training (stage-2/stage-3) upon our GOT weights.
  2. If you want train from stage-1 described in our paper, you need this repo.
deepspeed   /GOT-OCR-2.0-master/GOT/train/train_GOT.py \
 --deepspeed /GOT-OCR-2.0-master/zero_config/zero2.json    --model_name_or_path /GOT_weights/ \
 --use_im_start_end True   \
 --bf16 True   \
 --gradient_accumulation_steps 2    \
 --evaluation_strategy "no"   \
 --save_strategy "steps"  \
 --save_steps 200   \
 --save_total_limit 1   \
 --weight_decay 0.    \
 --warmup_ratio 0.001     \
 --lr_scheduler_type "cosine"    \
 --logging_steps 1    \
 --tf32 True     \
 --model_max_length 8192    \
 --gradient_checkpointing True   \
 --dataloader_num_workers 8    \
 --report_to none  \
 --per_device_train_batch_size 2    \
 --num_train_epochs 1  \
 --learning_rate 2e-5   \
 --datasets pdf-ocr+scence \
 --output_dir /your/output.path

Note:

  1. Change the corresponding data information in constant.py.
  2. Change line 37 in conversation_dataset_qwen.py to your data_name.

Eval

  1. We use the Fox and OneChart benchmarks, and other benchmarks can be found in the weights download link.
  2. The eval codes can be found in GOT/eval.
  3. You can use the evaluate_GOT.py to run the eval. If you have 8 GPUs, the --num-chunks can be set to 8.
python3 GOT/eval/evaluate_GOT.py --model-name /GOT_weights/ --gtfile_path xxxx.json --image_path  /image/path/ --out_path /data/eval_results/GOT_mathpix_test/ --num-chunks 8 --datatype OCR

Contact

If you are interested in this work or have questions about the code or the paper, please join our communication Wechat group.

Acknowledgement

  • Vary: the codebase we built upon!
  • Qwen: the LLM base model of Vary, which is good at both English and Chinese!

Citation

@article{wei2024general,
  title={General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model},
  author={Wei, Haoran and Liu, Chenglong and Chen, Jinyue and Wang, Jia and Kong, Lingyu and Xu, Yanming and Ge, Zheng and Zhao, Liang and Sun, Jianjian and Peng, Yuang and others},
  journal={arXiv preprint arXiv:2409.01704},
  year={2024}
}
@article{wei2023vary,
  title={Vary: Scaling up the Vision Vocabulary for Large Vision-Language Models},
  author={Wei, Haoran and Kong, Lingyu and Chen, Jinyue and Zhao, Liang and Ge, Zheng and Yang, Jinrong and Sun, Jianjian and Han, Chunrui and Zhang, Xiangyu},
  journal={arXiv preprint arXiv:2312.06109},
  year={2023}
}


About

Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.6%
  • HTML 1.4%