PDF Accounts Extractor

Utliities for extracting financial data from accounts PDF.

The input PDFs must have readable text - i.e. the pages must either be from a native PDF or one that has been run through an optical character recognition process to extract the text.

Installation and requirements

The only requirement is pdfplumber which can be installed using:

pip install pdfplumber
# or
pip install -r requirements.txt

Jupyter Notebook and Pandas are required to use the working notebook used to develop the tool.

Using the tool

At the moment the tool consists of a script extract_financial_lines.py. This attempts to find lines that match a pattern of <text> <number> <number> within a PDF document. This pattern is intended to be consistent with a representation of financial data in accounts of the form <item description> <current year value> <previous year value> as found in a balance sheet, for example. An example item that might be extracted is:

Trade Creditors 57,054 62,853

The tool can be run from the command line against a PDF file:

python extract_financial_lines.py test_accounts.pdf

You can also use the script in another script.

from extract_financial_lines import get_finances
import pdfplumber

# the function requires a pdfplumber.PDF object
pdf = pdfplumber.open("test_accounts.pdf")
rows = get_finances(pdf)
for r in rows:
    print(r)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

PDF Accounts Extractor

Installation and requirements

Using the tool

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

PDF Accounts Extractor

Installation and requirements

Using the tool