0.5.5-dev0

Enhancements

Features

Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

0.5.4

Enhancements

Added Biomedical literature connector for ingest cli.
Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
Rename s3_connector.py to s3.py for readability and consistency with the rest of the connectors.
Now S3Connector relies on s3fs instead of on boto3, and it inherits from FsspecConnector.
Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language specific checks like vocabulary and POS tagging are applied. Set to "true" for higher resolution partitioning and "false" for faster processing.
Improves detect_filetype warning to include filename when provided.
Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast" strategy if detectron2 is not available.
Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in favor of --remote-url.

Features

Add AzureBlobStorageConnector based on its fsspec implementation inheriting from FsspecConnector
Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

Fixes processing for text files with message/rfc822 MIME type.
Open xml files in read-only mode when reading contents to construct an XMLDocument.

0.5.3

Enhancements

auto.partition() can now load Unstructured ISD json documents.
Simplify partitioning functions.
Improve logging for ingest CLI.

Features

Add --wikipedia-auto-suggest argument to the ingest CLI to disable automatic redirection to pages with similar names.
Add setup script for Amazon Linux 2
Add optional encoding argument to the partition_(text/email/html) functions.
Added Google Drive connector for ingest cli.
Added Gitlab connector for ingest cli.

Fixes

0.5.2

Enhancements

Fully move from printing to logging.
unstructured-ingest now uses a default --download_dir of $HOME/.cache/unstructured/ingest rather than a "tmp-ingest-" dir in the working directory.

Features

Fixes

setup_ubuntu.sh no longer fails in some contexts by interpreting DEBIAN_FRONTEND=noninteractive as a command
unstructured-ingest no longer re-downloads files when --preserve-downloads is used without --download-dir.
Fixed an issue that was causing text to be skipped in some HTML documents.

0.5.1

Enhancements

Features

Fixes

Fixes an error causing JavaScript to appear in the output of partition_html sometimes.
Fix several issues with the requires_dependencies decorator, including the error message and how it was used, which had caused an error for unstructured-ingest --github-url ....

0.5.0

Enhancements

Add requires_dependencies Python decorator to check dependencies are installed before instantiating a class or running a function

Features

Added Wikipedia connector for ingest cli.

Fixes

Fix process_document file cleaning on failure
Fixes an error introduced in the metadata tracking commit that caused NarrativeText and FigureCaption elements to be represented as Text in HTML documents.

0.4.16

Enhancements

Fallback to using file extensions for filetype detection if libmagic is not present

Features

Added setup script for Ubuntu
Added GitHub connector for ingest cli.
Added partition_md partitioner.
Added Reddit connector for ingest cli.

Fixes

Initializes connector properly in ingest.main::MainProcess
Restricts version of unstructured-inference to avoid multithreading issue

0.4.15

Enhancements

Added elements_to_json and elements_from_json for easier serialization/deserialization
convert_to_dict, dict_to_elements and convert_to_csv are now aliases for functions that use the ISD terminology.

Fixes

Update to ensure all elements are preserved during serialization/deserialization

0.4.14

Automatically install nltk models in the tokenize module.

0.4.13

Fixes unstructured-ingest cli.

0.4.12

Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.
Add parser parameter to partition_html.

0.4.11

Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.
Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.

0.4.10

Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.

0.4.9

Added ingest modules and s3 connector, sample ingest script
Default to url=None for partition_pdf and partition_image
Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".
Document Element objects now track metadata

0.4.8

Modified XML and HTML parsers not to load comments.

0.4.7

Added the ability to pull an HTML document from a url in partition_html.
Added the the ability to get file summary info from lists of filenames and lists of file contents.
Added optional page break to partition for .pptx, .pdf, images, and .html files.
Added to_dict method to document elements.
Include more unicode quotes in replace_unicode_quotes.

0.4.6

Loosen the default cap threshold to 0.5.
Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling the cap ratio threshold.
Unknown text elements are identified as Text for HTML and plain text documents.
Body Text styles no longer default to NarrativeText for Word documents. The style information is insufficient to determine that the text is narrative.
Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
Adds an Address element for capturing elements that only contain an address.
Suppress the UserWarning when detectron is called.
Checks that titles and narrative test have at least one English word.
Checks that titles and narrative text are at least 50% alpha characters.
Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH environment variable for controlling the max number of words in a title.
Updated partition_pptx to order the elements on the page

0.4.4

Updated partition_pdf and partition_image to return unstructured Element objects
Fixed the healthcheck url path when partitioning images and PDFs via API
Adds an optional coordinates attribute to document objects
Adds FigureCaption and CheckBox document elements
Added ability to split lists detected in LayoutElement objects
Adds partition_pptx for partitioning PowerPoint documents
LayoutParser models now download from HugginfaceHub instead of DropBox
Fixed file type detection for XML and HTML files on Amazone Linux

0.4.3

Adds requests as a base dependency
Fix in exceeds_cap_ratio so the function doesn't break with empty text
Fix bug in _parse_received_data.
Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

Added partition_image to process documents in an image format.
Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

Added support for text files in the partition function
Pinned opencv-python for easier installation on Linux

0.4.0

Added generic partition brick that detects the file type and routes a file to the appropriate partitioning brick.
Added a file type detection module.
Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
Cleaning brick for removing ordered bullets clean_ordered_bullets.
Extract brick method for ordered bullets extract_ordered_bullets.
Test for clean_ordered_bullets.
Test for extract_ordered_bullets.
Added partition_docx for pre-processing Word Documents.
Added new REGEX patterns to extract email header information
Added new functions to extract header information parse_received_data and partition_header
Added new function to parse plain text files partition_text
Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
Add new Image element and function to find embedded images find_embedded_images
Added get_directory_file_info for summarizing information about source documents

0.3.5

Add support for local inference
Add new pattern to recognize plain text dash bullets
Add test for bullet patterns
Fix for partition_html that allows for processing div tags that have both text and child elements
Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
Helper functions for identifying and extracting phone numbers
Add new function extract_attachment_info that extracts and decodes the attachment of an email.
Staging brick to convert a list of Elements to a pandas dataframe.
Add plain text functionality to partition_email

0.3.4

Python-3.7 compat

0.3.3

Removes BasicConfig from logger configuration
Adds the partition_email partitioning brick
Adds the replace_mime_encodings cleaning bricks
Small fix to HTML parsing related to processing list items with sub-tags
Add EmailElement data structure to store email documents

0.3.2

Added translate_text brick for translating text between languages
Add an apply method to make it easier to apply cleaners to elements

0.3.1

Added __init.py__ to partition

0.3.0

Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.
Removing the local PDF parsing code and any dependencies and tests.
Reorganizes the staging bricks in the unstructured.partition module
Allow entities to be passed into the Datasaur staging brick
Added HTML escapes to the replace_unicode_quotes brick
Fix bad responses in partition_pdf to raise ValueError
Adds partition_html for partitioning HTML documents.

0.2.6

Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf
Add partitioning brick for calling the document image analysis API

0.2.5

Update python requirement to >=3.7

0.2.4

Add alternative way of importing Final to support google colab

0.2.3

Add cleaning bricks for removing prefixes and postfixes
Add cleaning bricks for extracting text before and after a pattern

0.2.2

Add staging brick for Datasaur

0.2.1

Added brick to convert an ISD dictionary to a list of elements
Update PDFDocument to use the from_file method
Added staging brick for CSV format for ISD (Initial Structured Data) format.
Added staging brick for separating text into attention window size chunks for transformers.
Added staging brick for LabelBox.
Added ability to upload LabelStudio predictions
Added utility function for JSONL reading and writing
Added staging brick for CSV format for Prodigy
Added staging brick for Prodigy
Added ability to upload LabelStudio annotations
Added text_field and id_field to stage_for_label_studio signature

0.2.0

Initial release of unstructured

Files

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

0.5.5-dev0

Enhancements

Features

Fixes

0.5.4

Enhancements

Features

Fixes

0.5.3

Enhancements

Features

Fixes

0.5.2

Enhancements

Features

Fixes

0.5.1

Enhancements

Features

Fixes

0.5.0

Enhancements

Features

Fixes

0.4.16

Enhancements

Features

Fixes

0.4.15

Enhancements

Fixes

0.4.14

0.4.13

0.4.12

0.4.11

0.4.10

0.4.9

0.4.8

0.4.7

0.4.6

0.4.4

0.4.3

0.4.2

0.4.1

0.4.0

0.3.5

0.3.4

0.3.3

0.3.2

0.3.1

0.3.0

0.2.6

0.2.5

0.2.4

0.2.3

0.2.2

0.2.1

0.2.0