Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDFTextStripper - parsing incorrectness #458

Closed
fungc opened this issue Mar 12, 2020 · 5 comments
Closed

PDFTextStripper - parsing incorrectness #458

fungc opened this issue Mar 12, 2020 · 5 comments

Comments

@fungc
Copy link

fungc commented Mar 12, 2020

Hello,

I am using PDFTextStripper, from the PDFbox library, to parse the text out of the pdf generated from html using openhtmltopdf.

Code for parsing:
final PDDocument document = PDDocument.load(pdfBytes);
final PDFTextStripper pdfTextStripper = new PDFTextStripper();
return pdfTextStripper.getText(document);

However, I am seeing a few problems:

  1. Invisible, redundant text
    sometimes the PDF will have invisible text in front of the actual text.
    e.g.

HTML:
line1
line2
line3

PDF:
line1
line2 (<--- invisible)
line2
line3

This happens even when you just open the pdf and select / copy the text.

  1. commas are places in the wrong position, when parsed
    commas show up correctly, but when parsed, they show in incorrect position
    e.g.
    HTML:
    hello, my name, is

PDF:
,,hello my name is

NOTE this does not happen when you open the pdf and select / copy the text.

  1. Interestingly, the comma problem goes away when I parse like this
    final PDDocument document = PDDocument.load(pdfBytes);
    final PDFTextStripper pdfTextStripper = new PDFTextStripper();
    pdfTextStripper.setSortByPosition(true);
    return pdfTextStripper.getText(document);

However, all superscripts / subscripts then gets messed up on the output
e.g. receptiońs becomes receptións

Do you know why these happens?

Thank you!

@danfickle
Copy link
Owner

Number 1 may be a serious bug in this library, so I'd love to get the html to reproduce it.

Number 2 and 3, I'm not sure. Does this happen with other PDFs or just ones produced by this library?

@fungc
Copy link
Author

fungc commented Mar 16, 2020

Financier-Extraordinaire.pdf

I can't get you the html at the moment, but here is an output pdf
I think (1) has to do with paging, it always happens at the end of a page or at the beginning.

(2) (3) does not happen with other PDFs; I was testing with Apache FOP.

do you have an email we can chat?

@fungc
Copy link
Author

fungc commented Mar 17, 2020

Financier-Extraordinaire-long.pdf

Found another bug. For extra long strings, the end of the string becomes invisible but copy-able

@leonorader
Copy link
Contributor

leonorader commented Mar 22, 2020

@fungc could you please provide html code for these issues?

danfickle added a commit that referenced this issue Aug 21, 2020
Seems to be confied to ordered lists as far as I can tell.
@danfickle
Copy link
Owner

@fungc, I know it has been a while, but I was able to reproduce but only with ordered list items. Was that your experience?

Anyway, I will try to debug.

danfickle added a commit that referenced this issue Aug 22, 2020
danfickle added a commit that referenced this issue Nov 27, 2020
This fixes repeating content in page margins when line-height is other than one. It also fixes the PDF UA crash caused by the repeating content.

However, it is a behavior changing fix. Documents with text split over two pages (usually undesired) will now get a forced page break before the split text.
danfickle added a commit that referenced this issue Nov 28, 2020
With changes to get it working and test proof.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants