Skip to content

Extracting financial data from PDFs of company account

License

Notifications You must be signed in to change notification settings

drkane/pdf-accounts

Repository files navigation

PDF Accounts Extractor

Utliities for extracting financial data from accounts PDF.

The input PDFs must have readable text - i.e. the pages must either be from a native PDF or one that has been run through an optical character recognition process to extract the text.

Installation and requirements

The only requirement is pdfplumber which can be installed using:

pip install pdfplumber
# or
pip install -r requirements.txt

Jupyter Notebook and Pandas are required to use the working notebook used to develop the tool.

Using the tool

At the moment the tool consists of a script extract_financial_lines.py. This attempts to find lines that match a pattern of <text> <number> <number> within a PDF document. This pattern is intended to be consistent with a representation of financial data in accounts of the form <item description> <current year value> <previous year value> as found in a balance sheet, for example. An example item that might be extracted is:

Trade Creditors 57,054 62,853

The tool can be run from the command line against a PDF file:

python extract_financial_lines.py test_accounts.pdf

You can also use the script in another script.

from extract_financial_lines import get_finances
import pdfplumber

# the function requires a pdfplumber.PDF object
pdf = pdfplumber.open("test_accounts.pdf")
rows = get_finances(pdf)
for r in rows:
    print(r)

About

Extracting financial data from PDFs of company account

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published