Add feature to merge a hocr file into a existing pdf file #255

maneau · 2023-12-02T21:52:47Z

The main objective is to keep the PDF small. Tesseract regenerates the PDF from 300DPI screenshots.
On the other hand, the original PDF is lost once ocerized, so any original PDF properties or formats are lost too.
To do this, we parse the hocr file and go through the PDF pages to write the occluded words invisibly.

The method mergeHocrIntoAPdf is add to PdfBoxUtilities.

PdfBoxUtilities.mergeHocrIntoAPdf(outputbase1 + ".hocr", pdfFilename, outputbase2, false);

nguyenq · 2023-12-03T13:54:11Z

@maneau

I see some warnings when running the unit tests. Is it a cause of concern?

Dec 03, 2023 8:22:43 AM org.apache.fontbox.ttf.GlyphSubstitutionTable readLookupTable
SEVERE: The expected SubstFormat for ExtensionSubstFormat1 subtable is 6 but should be 1

maneau · 2023-12-03T20:28:15Z

It seem's to be a pdfbox regression in version 3.0. It doesn't appear on 2.0. https://issues.apache.org/jira/browse/PDFBOX-5689

Add feature to merge a hocr file into a existing pdf file

9817036

nguyenq merged commit 62d7238 into nguyenq:master Dec 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add feature to merge a hocr file into a existing pdf file #255

Add feature to merge a hocr file into a existing pdf file #255

maneau commented Dec 2, 2023

nguyenq commented Dec 3, 2023 •

edited

Loading

maneau commented Dec 3, 2023

Add feature to merge a hocr file into a existing pdf file #255

Add feature to merge a hocr file into a existing pdf file #255

Conversation

maneau commented Dec 2, 2023

nguyenq commented Dec 3, 2023 • edited Loading

maneau commented Dec 3, 2023

nguyenq commented Dec 3, 2023 •

edited

Loading