Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: image may be scaled too large for tesseract #2252

Merged
merged 4 commits into from
Dec 13, 2023

Conversation

badGarnet
Copy link
Collaborator

@badGarnet badGarnet commented Dec 12, 2023

This PR addresses CORE-2965 by limiting zoom factor so that the scaled image can still be processed by tesseract.

  • tesseract has a 2^31 byte limit on image data
  • occasionally an image may be scaled too much and larger than that size
  • fix limits the scaling factor so that we never scale an image larger than what tesseract can handle

test

A unit test is added in this PR to test a unlikely case where we'd scale an image a few thousand times and massively exceed the limit without the fix.

Unstructured reviewers can also use the document in the ticket to test.

- tesseract has a 2^31 byte limit on image data
- occasionally an image may be scaled too much and larger than that size
- fix limits the scaling factor so that we never scale an image larger
  than what tesseract can handle
Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few questions.

Comment on lines +491 to +495
assert [region.text for region in ocr.get_ocr_layout_tesseract(image)] == [
"Hello",
"World",
"!",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's being tested here, that Tesseract doesn't error out during this operation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right; and scaled image still returns expected results

@@ -459,9 +460,13 @@ def get_ocr_layout_tesseract(
text_height < env_config.TESSERACT_MIN_TEXT_HEIGHT
or text_height > env_config.TESSERACT_MAX_TEXT_HEIGHT
):
max_zoom = max(0, np.round(np.sqrt(TESSERACT_MAX_SIZE / np.prod(image.size) / 32), 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 32 represent here? It seems like it should be bytes per pixel but 32 seems too big to be that. Is it 32 bits per pixel and maybe it should be 4 bytes per pixel?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

color depth; ideally we compute that from the image property but conveniently we always pass in a 32 color image in by the PIL open as RGB mode.

@badGarnet badGarnet requested a review from qued December 13, 2023 18:38
Copy link
Contributor

@awalker4 awalker4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Verified that the file in the ticket works, as well as the one attached in CORE-2696.

Copy link
Contributor

@qued qued left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed the test fails on main and passes on this branch.

Merged via the queue into main with commit 36e4639 Dec 13, 2023
51 checks passed
@badGarnet badGarnet deleted the fix/image-can-zoom-too-far-for-tesseract branch December 13, 2023 20:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants