PDFTextStripper - parsing incorrectness #458

fungc · 2020-03-12T18:20:00Z

Hello,

I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:

Invisible, redundant text
sometimes the PDF will have invisible text in front of the actual text.
e.g.

HTML:
line1
line2
line3

PDF:
line1
line2 (<--- invisible)
line2
line3

This happens even when you just open the pdf and select / copy the text.

commas are places in the wrong position, when parsed
commas show up correctly, but when parsed, they show in incorrect position
e.g.
HTML:
hello, my name, is

PDF:
,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

Interestingly, the comma problem goes away when I parse like this
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
pdfTextStripper.setSortByPosition(true);
return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!

danfickle · 2020-03-16T12:19:45Z

Number 1 may be a serious bug in this library, so I'd love to get the html to reproduce it.

Number 2 and 3, I'm not sure. Does this happen with other PDFs or just ones produced by this library?

fungc · 2020-03-16T21:43:05Z

Financier-Extraordinaire.pdf

I can't get you the html at the moment, but here is an output pdf
I think (1) has to do with paging, it always happens at the end of a page or at the beginning.

(2) (3) does not happen with other PDFs; I was testing with Apache FOP.

do you have an email we can chat?

fungc · 2020-03-17T17:54:37Z

Financier-Extraordinaire-long.pdf

Found another bug. For extra long strings, the end of the string becomes invisible but copy-able

leonorader · 2020-03-22T14:49:33Z

@fungc could you please provide html code for these issues?

Seems to be confied to ordered lists as far as I can tell.

danfickle · 2020-08-21T12:32:27Z

@fungc, I know it has been a while, but I was able to reproduce but only with ordered list items. Was that your experience?

Anyway, I will try to debug.

… [ci skip]

This fixes repeating content in page margins when line-height is other than one. It also fixes the PDF UA crash caused by the repeating content. However, it is a behavior changing fix. Documents with text split over two pages (usually undesired) will now get a forced page break before the split text.

With changes to get it working and test proof.

danfickle added a commit that referenced this issue Aug 21, 2020

#458 Failing test for repeated content in page margins. [ci skip]

09f9e5f

Seems to be confied to ordered lists as far as I can tell.

danfickle added a commit that referenced this issue Aug 22, 2020

#458 - Fix for list markers being output in page margin area.

23606dd

danfickle added a commit that referenced this issue Aug 22, 2020

#458 - Make test cross platform.

b329580

danfickle added a commit that referenced this issue Aug 22, 2020

#458 Take two at making test cross platform.

b037492

danfickle added a commit that referenced this issue Nov 13, 2020

#594 #458 Failing test for more repeating content where it should not…

774b8e6

… [ci skip]

danfickle mentioned this issue Nov 13, 2020

Getting a nullptr exception, reproduction case included #594

Closed

danfickle mentioned this issue Nov 27, 2020

#594 #458 Fix for repeating content and PDF/UA crash. #610

Merged

danfickle added a commit that referenced this issue Nov 28, 2020

#594 #458 Test with font-size much larger than line height.

d06ccc5

With changes to get it working and test proof.

danfickle closed this as completed in a9ba3af Nov 28, 2020

danfickle mentioned this issue Nov 30, 2020

Upload to maven central via bintray. #7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDFTextStripper - parsing incorrectness #458

PDFTextStripper - parsing incorrectness #458

fungc commented Mar 12, 2020

danfickle commented Mar 16, 2020

fungc commented Mar 16, 2020

fungc commented Mar 17, 2020

leonorader commented Mar 22, 2020 •

edited

Loading

danfickle commented Aug 21, 2020

PDFTextStripper - parsing incorrectness #458

PDFTextStripper - parsing incorrectness #458

Comments

fungc commented Mar 12, 2020

danfickle commented Mar 16, 2020

fungc commented Mar 16, 2020

fungc commented Mar 17, 2020

leonorader commented Mar 22, 2020 • edited Loading

danfickle commented Aug 21, 2020

leonorader commented Mar 22, 2020 •

edited

Loading