Tags: TJX2014/unstructured
Tags
chore: bump unstructured-inference 0.7.31 (Unstructured-IO#2981)
fix: parse URL response Content-Type according to RFC 9110 (Unstructu… …red-IO#2950) Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes Unstructured-IO#2257
Fix: avoid elements sharing the same memory address (Unstructured-IO#… …2940) This PR attempts to fix a memory issue, which resulted in errors like this: Unstructured-IO#2931 The root cause seems to be in how ListItems are being combined, not in how hashes or parent IDs are updated. When `assign_and_map_hash_ids()` is called and elements (or elements' metadata) do not have unique memory addresses, then updating the parent_id of one element will also overwrite the parent_id of some other element. --------- Co-authored-by: cragwolfe <crag@unstructured.io>
fix: reqs arm64 friendly again. release 0.13.4 (Unstructured-IO#2935) Cut a release. Run pip-compile on mac to avoid `nvidia-*` requirements creeping into `requirements/extra-pdf-image.txt`. This should fix arm64 image builds that have been breaking on main.
chore: bump unstructured-inference pin (Unstructured-IO#2913) **Summary** Update dependencies to use the new version of `unstructured-inference` released yesterday. Remedy a few small problems with `make pip-compile` that stood in the way.
fix: Brings back missing word list files (Unstructured-IO#2857) Fixes Unstructured-IO#2855
build(release): release commit for 0.13.1 (Unstructured-IO#2850)
build(release): release commit for 0.13.0 (Unstructured-IO#2732)
Unstructured v0.12.6 release (Unstructured-IO#2626) ## 0.12.6 ### Enhancements * **Improve ability to capture embedded links in `partition_pdf()` for `fast` strategy** Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing. * **Refactor `add_chunking_strategy` decorator to dispatch by name.** Add `chunk()` function to be used by the `add_chunking_strategy` decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers. ### Features * **Added Unstructured Platform Documentation** The Unstructured Platform is currently in beta. The documentation provides how-to guides for setting up workflow automation, job scheduling, and configuring source and destination connectors. ### Fixes * **Partitioning raises on file-like object with `.name` not a local file path.** When partitioning a file using the `file=` argument, and `file` is a file-like object (e.g. io.BytesIO) having a `.name` attribute, and the value of `file.name` is not a valid path to a file present on the local filesystem, `FileNotFoundError` is raised. This prevents use of the `file.name` attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP. * **Fix SharePoint dates with inconsistent formatting** Adds logic to conditionally support dates returned by office365 that may vary in date formatting or may be a datetime rather than a string. * **Include warnings** about the potential risk of installing a version of `pandoc` which does not support RTF files + instructions that will help resolve that issue. * **Incorporate the `install-pandoc` Makefile recipe** into relevant stages of CI workflow, ensuring it is a version that supports RTF input files. * **Fix Google Drive source key** Allow passing string for source connector key. * **Fix table structure evaluations calculations** Replaced special value `-1.0` with `np.nan` and corrected rows filtering of files metrics basing on that. * **Fix Sharepoint-with-permissions test** Ignore permissions metadata, update test. * **Fix table structure evaluations for edge case** Fixes the issue when the prediction does not contain any table - no longer errors in such case.
build(release): release commit for 0.12.5 (Unstructured-IO#2585)
PreviousNext