Skip to content

Tags: TJX2014/unstructured

Tags

0.13.7

Toggle 0.13.7's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump unstructured-inference 0.7.31 (Unstructured-IO#2981)

0.13.6

Toggle 0.13.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: parse URL response Content-Type according to RFC 9110 (Unstructu…

…red-IO#2950)

Currently, `file_and_type_from_url()` does not correctly handle the
`Content-Type` header. Specifically, it assumes that the header contains
only the mime-type (e.g. `text/html`), however, [RFC
9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows
for additional directives — specifically the `charset` — to be returned
in the header. This leads to a `ValueError` when loading a URL with a
response Content-Type header such as `text/html; charset=UTF-8`.

To reproduce the issue:

```python
from unstructured.partition.auto import partition

url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/"
partition(url=url)
```

Which will result in the following exception:

```python
{
	"name": "ValueError",
	"message": "Invalid file. The FileType.UNK file type is not supported in partition.",
	"stack": "---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 4
      1 from unstructured.partition.auto import partition
      3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\"
----> 4 partition(url=url)

File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs)
    539 else:
    540     msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\"
--> 541     raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\")
    543 for element in elements:
    544     element.metadata.url = url

ValueError: Invalid file. The FileType.UNK file type is not supported in partition."
}
```

This PR fixes the issue by parsing the mime-type out of the
`Content-Type` header string.


Closes Unstructured-IO#2257

0.13.5

Toggle 0.13.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Fix: avoid elements sharing the same memory address (Unstructured-IO#…

…2940)

This PR attempts to fix a memory issue, which resulted in errors like
this: Unstructured-IO#2931
The root cause seems to be in how ListItems are being combined, not in
how hashes or parent IDs are updated.

When `assign_and_map_hash_ids()` is called and elements (or elements'
metadata) do not have unique memory addresses, then updating the
parent_id of one element will also overwrite the parent_id of some other
element.

---------

Co-authored-by: cragwolfe <crag@unstructured.io>

0.13.4

Toggle 0.13.4's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: reqs arm64 friendly again. release 0.13.4 (Unstructured-IO#2935)

Cut a release.

Run pip-compile on mac to avoid `nvidia-*` requirements creeping into
`requirements/extra-pdf-image.txt`. This should fix arm64 image builds
that have been breaking on main.

0.13.3

Toggle 0.13.3's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
chore: bump unstructured-inference pin (Unstructured-IO#2913)

**Summary**
Update dependencies to use the new version of `unstructured-inference`
released yesterday. Remedy a few small problems with `make pip-compile`
that stood in the way.

0.13.2

Toggle 0.13.2's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
fix: Brings back missing word list files (Unstructured-IO#2857)

Fixes Unstructured-IO#2855

0.13.1

Toggle 0.13.1's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(release): release commit for 0.13.1 (Unstructured-IO#2850)

0.13.0

Toggle 0.13.0's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(release): release commit for 0.13.0 (Unstructured-IO#2732)

0.12.6

Toggle 0.12.6's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
Unstructured v0.12.6 release (Unstructured-IO#2626)

## 0.12.6

### Enhancements

* **Improve ability to capture embedded links in `partition_pdf()` for
`fast` strategy** Previously, a threshold value that affects the capture
of embedded links was set to a fixed value by default. This allows users
to specify the threshold value for better capturing.
* **Refactor `add_chunking_strategy` decorator to dispatch by name.**
Add `chunk()` function to be used by the `add_chunking_strategy`
decorator to dispatch chunking call based on a chunking-strategy name
(that can be dynamic at runtime). This decouples chunking dispatch from
only those chunkers known at "compile" time and enables runtime
registration of custom chunkers.

### Features
* **Added Unstructured Platform Documentation** The Unstructured
Platform is currently in beta. The documentation provides how-to guides
for setting up workflow automation, job scheduling, and configuring
source and destination connectors.

### Fixes

* **Partitioning raises on file-like object with `.name` not a local
file path.** When partitioning a file using the `file=` argument, and
`file` is a file-like object (e.g. io.BytesIO) having a `.name`
attribute, and the value of `file.name` is not a valid path to a file
present on the local filesystem, `FileNotFoundError` is raised. This
prevents use of the `file.name` attribute for downstream purposes to,
for example, describe the source of a document retrieved from a network
location via HTTP.
* **Fix SharePoint dates with inconsistent formatting** Adds logic to
conditionally support dates returned by office365 that may vary in date
formatting or may be a datetime rather than a string.
* **Include warnings** about the potential risk of installing a version
of `pandoc` which does not support RTF files + instructions that will
help resolve that issue.
* **Incorporate the `install-pandoc` Makefile recipe** into relevant
stages of CI workflow, ensuring it is a version that supports RTF input
files.
* **Fix Google Drive source key** Allow passing string for source
connector key.
* **Fix table structure evaluations calculations** Replaced special
value `-1.0` with `np.nan` and corrected rows filtering of files metrics
basing on that.
* **Fix Sharepoint-with-permissions test** Ignore permissions metadata,
update test.
* **Fix table structure evaluations for edge case** Fixes the issue when
the prediction does not contain any table - no longer errors in such
case.

0.12.5

Toggle 0.12.5's commit message

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
build(release): release commit for 0.12.5 (Unstructured-IO#2585)