Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: check existence of variable res before iteration #2063

Merged
merged 3 commits into from
Nov 14, 2023

Conversation

ahmetmeleq
Copy link
Contributor

@ahmetmeleq ahmetmeleq commented Nov 13, 2023

Handles a case where Paddle returns a list item in ocr_data as None in partition.

  • While parsing PaddleOCR data, it was assumed that PaddleOCR does not return None for any list item in ocr_data.
  • Removed the assumption by skipping the text region whenever this happens.

@cragwolfe
Copy link
Contributor

is there an existing GH issue for this, and/or a file where this causes the error?

@ahmetmeleq
Copy link
Contributor Author

ahmetmeleq commented Nov 14, 2023

@cragwolfe

This PR addresses a community response thread where it was not possible for the user to share their file; so we don't know why exactly PaddleOCR returns that portion of ocr_data as None (which causes the error, see the copied stack trace):

https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1699891467623009

However by reading through I saw that we were assuming that PaddleOCR does not return None for any list item in ocr_data.

I've removed that assumption here, addressing the case by skipping the text region when this happens (similar to how we handle another case here).

Copied stack trace:

Error in partitioning content: 'NoneType' object is not iterable
Traceback (most recent call last):
  File "/home/notebook-user/unstructured/ingest/error.py", line 19, in wrapper
    return f(*args, **kwargs)
  File "/home/notebook-user/unstructured/ingest/interfaces.py", line 438, in partition_file
    elements = partition(
  File "/home/notebook-user/unstructured/partition/auto.py", line 383, in partition
    elements = _partition_pdf(
  File "/home/notebook-user/unstructured/documents/elements.py", line 371, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/unstructured/file_utils/filetype.py", line 591, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/unstructured/file_utils/filetype.py", line 546, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/unstructured/chunking/title.py", line 297, in wrapper
    elements = func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 182, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 339, in partition_pdf_or_image
    _layout_elements = _partition_pdf_or_image_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 179, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 769, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 800, in _partition_pdf_or_image_with_ocr_from_image
    ocr_data = get_layout_elements_from_ocr(
  File "/home/notebook-user/unstructured/partition/ocr.py", line 359, in get_layout_elements_from_ocr
    ocr_regions = get_ocr_layout_from_image(
  File "/home/notebook-user/unstructured/partition/ocr.py", line 462, in get_ocr_layout_from_image
    ocr_regions = get_ocr_layout_paddle(image)
  File "/home/notebook-user/unstructured/partition/ocr.py", line 521, in get_ocr_layout_paddle
    ocr_regions = parse_ocr_data_paddle(ocr_data)
  File "/home/notebook-user/unstructured/partition/ocr.py", line 601, in parse_ocr_data_paddle
    for line in res:
TypeError: 'NoneType' object is not iterable
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/home/notebook-user/unstructured/ingest/pipeline/partition.py", line 48, in run
    elements = doc.process_file(
  File "/home/notebook-user/unstructured/ingest/interfaces.py", line 479, in process_file
    isd_elems_raw = self.partition_file(partition_config=partition_config, **partition_kwargs)
  File "/home/notebook-user/unstructured/ingest/error.py", line 22, in wrapper
    raise cls(cls.error_string.format(str(error))) from error
unstructured.ingest.error.PartitionError: Error in partitioning content: 'NoneType' object is not iterable

Copy link
Contributor

@yuming-long yuming-long left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ahmetmeleq ahmetmeleq added this pull request to the merge queue Nov 14, 2023
Merged via the queue into main with commit 68686e2 Nov 14, 2023
46 checks passed
@ahmetmeleq ahmetmeleq deleted the ahmet/bug/ocr-data-nonetype-iter branch November 14, 2023 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants