Skip to content

Latest commit

 

History

History
27 lines (22 loc) · 917 Bytes

README.MD

File metadata and controls

27 lines (22 loc) · 917 Bytes

install.sh is for Debian Linux Prerequisites: apt and python above 3.8

The project includes 3 main parts:
PDF Text Extractor - extracts text from PDF
Image Extractor from PDF - extracts images and saves it to a folder Text Visualizer - Visualize the text to see what the computer recognizes

If on debian linux do

Sudo bash install.sh
Steps:

  1. Install tesseract-ocr and libtesseract-dev using your os package installed
  2. Create a virual env python3 -m venv venv
  3. source venv/bin/activate
  4. Install all libraries required pip install -r requirments.txt

Depending on your work load either use main.py if you want a graphical interface or maincli.py to use command line argumets

For mainCLI.py you can use either syntax
python3 main.py PDFfile
or
python3 main.py PDFfile -o outputFileName

For visualizer.py the syntax is
python3 visualizer.py PDFfile