Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add feature to merge a hocr file into a existing pdf file #255

Merged
merged 1 commit into from
Dec 3, 2023

Conversation

maneau
Copy link

@maneau maneau commented Dec 2, 2023

The main objective is to keep the PDF small. Tesseract regenerates the PDF from 300DPI screenshots.
On the other hand, the original PDF is lost once ocerized, so any original PDF properties or formats are lost too.
To do this, we parse the hocr file and go through the PDF pages to write the occluded words invisibly.

The method mergeHocrIntoAPdf is add to PdfBoxUtilities.

PdfBoxUtilities.mergeHocrIntoAPdf(outputbase1 + ".hocr", pdfFilename, outputbase2, false);

@nguyenq nguyenq merged commit 62d7238 into nguyenq:master Dec 3, 2023
@nguyenq
Copy link
Owner

nguyenq commented Dec 3, 2023

@maneau

I see some warnings when running the unit tests. Is it a cause of concern?

Dec 03, 2023 8:22:43 AM org.apache.fontbox.ttf.GlyphSubstitutionTable readLookupTable
SEVERE: The expected SubstFormat for ExtensionSubstFormat1 subtable is 6 but should be 1

@maneau
Copy link
Author

maneau commented Dec 3, 2023

It seem's to be a pdfbox regression in version 3.0. It doesn't appear on 2.0. https://issues.apache.org/jira/browse/PDFBOX-5689

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants