Utliities for extracting financial data from accounts PDF.
The input PDFs must have readable text - i.e. the pages must either be from a native PDF or one that has been run through an optical character recognition process to extract the text.
The only requirement is pdfplumber which can be installed using:
pip install pdfplumber
# or
pip install -r requirements.txt
Jupyter Notebook and Pandas are required to use the working notebook used to develop the tool.
At the moment the tool consists of a script extract_financial_lines.py
. This attempts
to find lines that match a pattern of <text> <number> <number>
within a PDF document.
This pattern is intended to be consistent with a representation of financial data in
accounts of the form <item description> <current year value> <previous year value>
as found in a balance sheet, for example. An example item that might be extracted is:
Trade Creditors 57,054 62,853
The tool can be run from the command line against a PDF file:
python extract_financial_lines.py test_accounts.pdf
You can also use the script in another script.
from extract_financial_lines import get_finances
import pdfplumber
# the function requires a pdfplumber.PDF object
pdf = pdfplumber.open("test_accounts.pdf")
rows = get_finances(pdf)
for r in rows:
print(r)