-
Notifications
You must be signed in to change notification settings - Fork 707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: image may be scaled too large for tesseract #2252
Conversation
- tesseract has a 2^31 byte limit on image data - occasionally an image may be scaled too much and larger than that size - fix limits the scaling factor so that we never scale an image larger than what tesseract can handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions.
assert [region.text for region in ocr.get_ocr_layout_tesseract(image)] == [ | ||
"Hello", | ||
"World", | ||
"!", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's being tested here, that Tesseract doesn't error out during this operation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right; and scaled image still returns expected results
@@ -459,9 +460,13 @@ def get_ocr_layout_tesseract( | |||
text_height < env_config.TESSERACT_MIN_TEXT_HEIGHT | |||
or text_height > env_config.TESSERACT_MAX_TEXT_HEIGHT | |||
): | |||
max_zoom = max(0, np.round(np.sqrt(TESSERACT_MAX_SIZE / np.prod(image.size) / 32), 1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does 32 represent here? It seems like it should be bytes per pixel but 32 seems too big to be that. Is it 32 bits per pixel and maybe it should be 4 bytes per pixel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
color depth; ideally we compute that from the image property but conveniently we always pass in a 32 color image in by the PIL open as RGB mode.
…o-far-for-tesseract
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Verified that the file in the ticket works, as well as the one attached in CORE-2696.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed the test fails on main
and passes on this branch.
This PR addresses CORE-2965 by limiting zoom factor so that the scaled image can still be processed by tesseract.
test
A unit test is added in this PR to test a unlikely case where we'd scale an image a few thousand times and massively exceed the limit without the fix.
Unstructured reviewers can also use the document in the ticket to test.