Skip to content

Latest commit

 

History

History
1031 lines (669 loc) · 32.8 KB

CHANGELOG.md

File metadata and controls

1031 lines (669 loc) · 32.8 KB

0.9.1-dev10

Enhancements

  • Enable partition_html to skip headers and footers with the skip_headers_and_footers flag.
  • Update partition_doc and partition_docx to track emphasized texts in the output
  • Adds post processing function filter_element_types
  • Set the default strategy for partitioning images to hi_res
  • Add page break parameter section in API documentation to sync with change in Prod API
  • Update partition_html to track emphasized texts in the output
  • Update XMLDocument._read_xml to create <p> tag element for the text enclosed in the <pre> tag
  • Add parameter include_tail_text to _construct_text to enable (skip) tail text inclusion

Features

Fixes

  • Remove unused _partition_via_api function
  • Fixed emoji bug in partition_xlsx.
  • Pass file_filename metadata when partitioning file object
  • Skip ingest test on missing Slack token
  • Add Dropbox variables to CI environments
  • Remove default encoding for ingest
  • Adds new element type EmailAddress for recognising email address in the  text
  • Simplifies min_partition logic; makes partitions falling below the min_partition less likely.
  • Fix bug where ingest test check for number of files fails in smoke test

0.9.0

Enhancements

  • Dependencies are now split by document type, creating a slimmer base installation.

0.8.8

Enhancements

Features

Fixes

  • Rename "date" field to "last_modified"
  • Adds Box connector

Fixes

0.8.7

Enhancements

  • Put back useful function split_by_paragraph

Features

Fixes

  • Fix argument order in NLTK download step

0.8.6

Enhancements

Features

Fixes

  • Remove debug print lines and non-functional code

0.8.5

Enhancements

  • Add parameter skip_infer_table_types to enable (skip) table extraction for other doc types
  • Adds optional Unstructured API unit tests in CI
  • Tracks last modified date for all document types.
  • refactor the ingest cli to better support expanding supported connectors

0.8.3

Enhancements

Features

Fixes

  • NLTK now only gets downloaded if necessary.
  • Handling for empty tables in Word Documents and PowerPoints.

0.8.4

Enhancements

  • Additional tests and refactor of JSON detection.
  • Update functionality to retrieve image metadata from a page for document_to_element_list
  • Links are now tracked in partition_html output.
  • Set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
  • set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add slide notes to pptx
  • Add --encoding directive to ingest
  • Improve json detection by detect_filetype

Features

  • Adds Outlook connector
  • Add support for dpi parameter in inference library
  • Adds Onedrive connector.
  • Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

  • Fixes issue with email partitioning where From field was being assigned the To field value.
  • Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
  • Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
  • Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
  • Adds .txt, .text, and .tab to list of extensions to check if file has a text/plain MIME type.
  • Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
  • Removed old error message that's superseded by requires_dependencies.
  • Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

0.8.1

Enhancements

  • Add support for Python 3.11

Features

Fixes

  • Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
  • Fix list detection in MS Word documents.
  • Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

0.8.0

Enhancements

  • Allow model used for hi res pdf partition strategy to be chosen when called.
  • Updated inference package

Features

  • Add metadata_filename parameter across all partition functions

Fixes

  • Update to ensure convert_to_datafame grabs all of the metadata fields.

  • Adjust encoding recognition threshold value in detect_file_encoding

  • Fix KeyError when isd_to_elements doesn't find a type

  • Fix _output_filename for local connector, allowing single files to be written correctly to the disk

  • Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

  • Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

0.7.12

Enhancements

  • Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

  • Add Elasticsearch connector for ingest cli to pull specific fields from all documents in an index.
  • Adds Dropbox connector

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.11

Enhancements

  • More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
  • Make large model available (from unstructured-inference bump to 0.5.3)
  • Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
  • partition_email and partition_msg will now process attachments if process_attachments=True and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally

0.7.10

Enhancements

  • Adds a max_partition parameter to partition_text, partition_pdf, partition_email, partition_msg and partition_xml that sets a limit for the size of an individual document elements. Defaults to 1500 for everything except partition_xml, which has a default value of None.
  • DRY connector refactor

Features

  • hi_res model for pdfs and images is selectable via environment variable.

Fixes

  • CSV check now ignores escaped commas.
  • Fix for filetype exploration util when file content does not have a comma.
  • Adds negative lookahead to bullet pattern to avoid detecting plain text line breaks like ------- as list items.
  • Fix pre tag parsing for partition_html
  • Fix lookup error for annotated Arabic and Hebrew encodings

0.7.9

Enhancements

  • Improvements to string check for leafs in partition_xml.
  • Adds --partition-ocr-languages to unstructured-ingest.

Features

  • Adds partition_org for processed Org Mode documents.

Fixes

0.7.8

Enhancements

Features

  • Adds Google Cloud Service connector

Fixes

  • Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
  • partition_email now works if there is no message content
  • Updates the "fast" strategy for partition_pdf so that it's able to recursively
  • Adds recursive functionality to all fsspec connectors
  • Adds generic --recursive ingest flag

0.7.7

Enhancements

  • Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs
  • Adds missed file-like object handling in detect_file_encoding
  • Adds functionality to extract charset info from eml files

Features

  • Added coordinate system class to track coordinate types and convert to different coordinate

Fixes

  • Adds an html_assemble_articles kwarg to partition_html to enable users to capture control whether content outside of <article> tags is captured when <article> tags are present.
  • Check for the xml attribute on element before looking for pagebreaks in partition_docx.

0.7.6

Enhancements

  • Convert fast startegy to ocr_only for images
  • Adds support for page numbers in .docx and .doc when user or renderer created page breaks are present.
  • Adds retry logic for the unstructured-ingest Biomed connector

Features

  • Provides users with the ability to extract additional metadata via regex.
  • Updates partition_docx to include headers and footers in the output.
  • Create partition_tsv and associated tests. Make additional changes to detect_filetype.

Fixes

  • Remove fake api key in test partition_via_api since we now require valid/empty api keys
  • Page number defaults to None instead of 1 when page number is not present in the metadata. A page number of None indicates that page numbers are not being tracked for the document or that page numbers do not apply to the element in question..
  • Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide in case the shape.top and shape.left attributes are None.

0.7.5

Enhancements

  • Adds functionality to sort elements in partition_pdf for fast strategy
  • Adds ingest tests with --fast strategy on PDF documents
  • Adds --api-key to unstructured-ingest

Features

  • Adds partition_rst for processed ReStructured Text documents.

Fixes

  • Adds handling for emails that do not have a datetime to extract.
  • Adds pdf2image package as core requirement of unstructured (with no extras)

0.7.4

Enhancements

  • Allows passing kwargs to request data field for partition_via_api and partition_multiple_via_api
  • Enable MIME type detection if libmagic is not available
  • Adds handling for empty files in detect_filetype and partition.

Features

Fixes

  • Reslove grpcio import issue on weaviate.schema.validate_schema for python 3.9 and 3.10
  • Remove building detectron2 from source in Dockerfile

0.7.3

Enhancements

  • Update IngestDoc abstractions and add data source metadata in ElementMetadata

Features

Fixes

  • Pass strategy parameter down from partition for partition_image
  • Filetype detection if a CSV has a text/plain MIME type
  • convert_office_doc no longers prints file conversion info messages to stdout.
  • partition_via_api reflects the actual filetype for the file processed in the API.

0.7.2

Enhancements

  • Adds an optional encoding kwarg to elements_to_json and elements_from_json
  • Bump version of base image to use new stable version of tesseract

Features

Fixes

  • Update the read_txt_file utility function to keep using spooled_to_bytes_io_if_needed for xml
  • Add functionality to the read_txt_file utility function to handle file-like object from URL
  • Remove the unused parameter encoding from partition_pdf
  • Change auto.py to have a None default for encoding
  • Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds benchmark test with test docs in example-docs
  • Re-enable test_upload_label_studio_data_with_sdk
  • File detection now detects code files as plain text
  • Adds tabulate explicitly to dependencies
  • Fixes an issue in metadata.page_number of pptx files
  • Adds showing help if no parameters passed

0.7.1

Enhancements

Features

  • Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with a helper function for defining a class to use in Weaviate schemas.
  • Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes

0.7.0

Enhancements

  • Installing detectron2 from source is no longer required when using the local-inference extra.
  • Updates .pptx parsing to include text in tables.

Features

Fixes

  • Fixes an issue in _add_element_metadata that caused all elements to have page_number=1 in the element metadata.
  • Adds .log as a file extension for TXT files.
  • Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
  • Allow passed encoding to be used in the replace_mime_encodings
  • Fixes page metadata for partition_html when include_metadata=False
  • A ValueError now raises if file_filename is not specified when you use partition_via_api with a file-like object.

0.6.11

Enhancements

  • Supports epub tests since pandoc is updated in base image

Features

Fixes

0.6.10

Enhancements

  • XLS support from auto partition

Features

Fixes

0.6.9

Enhancements

  • fast strategy for pdf now keeps element bounding box data
  • setup.py refactor

Features

Fixes

  • Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds additional MIME types for CSV

0.6.8

Enhancements

Features

  • Add partition_csv for CSV files.

Fixes

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast" or "hi_res" strategies if pytesseract is not installed. Also adds the required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.

0.6.6

Enhancements

  • Adds an "auto" strategy that chooses the partitioning strategy based on document characteristics and function kwargs. This is the new default strategy for partition_pdf and partition_image. Users can maintain existing behavior by explicitly setting strategy="hi_res".
  • Added an additional trace logger for NLP debugging.
  • Add get_date method to ElementMetadata for converting the datestring to a datetime object.
  • Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

  • Added table reading as html with URL parsing to partition_docx in docx
  • Added metadata field for text_as_html for docx files

Fixes

  • fileutils/file_type check json and eml decode ignore error
  • partition_email was updated to more flexibly handle deviations from the RFC-2822 standard. The time in the metadata returns None if the time does not match RFC-2822 at all.
  • Include all metadata fields when converting to dataframe or CSV

0.6.5

Enhancements

  • Added support for SpooledTemporaryFile file argument.

Features

Fixes

0.6.4

Enhancements

  • Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision logic into its own module.

Features

Fixes

0.6.3

Enhancements

  • Add an "ocr_only" strategy for partition_image.

Features

  • Added partition_multiple_via_api for partitioning multiple documents in a single REST API call.
  • Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
  • Added partition_odt for processing Open Office documents.

Fixes

  • Updates the grouping logic in the partition_pdf fast strategy to group together text in the same bounding box.

0.6.2

Enhancements

  • Added logic to partition_pdf for detecting copy protected PDFs and falling back to the hi res strategy when necessary.

Features

  • Add partition_via_api for partitioning documents through the hosted API.

Fixes

  • Fix how exceeds_cap_ratio handles empty (returns True instead of False)
  • Updates detect_filetype to properly detect JSONs when the MIME type is text/plain.

0.6.1

Enhancements

  • Updated the table extraction parameter name to be more descriptive

Features

Fixes

0.6.0

Enhancements

  • Adds an ssl_verify kwarg to partition and partition_html to enable turning off SSL verification for HTTP requests. SSL verification is on by default.
  • Allows users to pass in ocr language to partition_pdf and partition_image through the ocr_language kwarg. ocr_language corresponds to the code for the language pack in Tesseract. You will need to install the relevant Tesseract language pack to use a given language.

Features

  • Table extraction is now possible for pdfs from partition and partition_pdf.
  • Adds support for extracting attachments from .msg files

Fixes

  • Adds an ssl_verify kwarg to partition and partition_html to enable turning off SSL verification for HTTP requests. SSL verification is on by default.

0.5.13

Enhancements

  • Allow headers to be passed into partition when url is used.

Features

  • bytes_string_to_string cleaning brick for bytes string output.

Fixes

  • Fixed typo in call to exactly_one in partition_json
  • unstructured-documents encode xml string if document_tree is None in _read_xml.
  • Update to _read_xml so that Markdown files with embedded HTML process correctly.
  • Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
  • unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
  • partition_pdf and partition_text group broken paragraphs to avoid fragmented NarrativeText elements.
  • .json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)

0.5.12

Enhancements

  • Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
  • Use the image registry as a cache when building Docker images.
  • Adds the ability for partition_text to group together broken paragraphs.
  • Added method to utils to allow date time format validation

Features

  • Add Slack connector to pull messages for a specific channel

  • Add --partition-by-api parameter to unstructured-ingest

  • Added partition_rtf for processing rich text files.

  • partition now accepts a url kwarg in addition to file and filename.

Fixes

  • Allow encoding to be passed into replace_mime_encodings.
  • unstructured-ingest connector-specific dependencies are imported on demand.
  • unstructured-ingest --flatten-metadata supported for local connector.
  • unstructured-ingest fix runtime error when using --metadata-include.

0.5.11

Enhancements

Features

Fixes

  • Guard against null style attribute in docx document elements
  • Update HTML encoding to better support foreign language characters

0.5.10

Enhancements

  • Updated inference package
  • Add sender, recipient, date, and subject to element metadata for emails

Features

  • Added --download-only parameter to unstructured-ingest

Fixes

  • FileNotFound error when filename is provided but file is not on disk

0.5.9

Enhancements

Features

Fixes

  • Convert file to str in helper split_by_paragraph for partition_text

0.5.8

Enhancements

  • Update elements_to_json to return string when filename is not specified
  • elements_from_json may take a string instead of a filename with the text kwarg
  • detect_filetype now does a final fallback to file extension.
  • Empty tags are now skipped during the depth check for HTML processing.

Features

  • Add local file system to unstructured-ingest
  • Add --max-docs parameter to unstructured-ingest
  • Added partition_msg for processing MSFT Outlook .msg files.

Fixes

  • convert_file_to_text now passes through the source_format and target_format kwargs. Previously they were hard coded.
  • Partitioning functions that accept a text kwarg no longer raise an error if an empty string is passed (and empty list of elements is returned instead).
  • partition_json no longer fails if the input is an empty list.
  • Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off in some cases.

BREAKING CHANGES

  • stage_for_transformers now returns a list of elements, making it consistent with other staging bricks

0.5.7

Enhancements

  • Refactored codebase using exactly_one
  • Adds ability to pass headers when passing a url in partition_html()
  • Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

  • Add --flatten-metadata parameter to unstructured-ingest
  • Add --fields-include parameter to unstructured-ingest

Fixes

0.5.6

Enhancements

  • contains_english_word(), used heavily in text processing, is 10x faster.

Features

  • Add --metadata-include and --metadata-exclude parameters to unstructured-ingest
  • Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

  • Fix problem with PDF partition (duplicated test)

0.5.4

Enhancements

  • Added Biomedical literature connector for ingest cli.
  • Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
  • Rename s3_connector.py to s3.py for readability and consistency with the rest of the connectors.
  • Now S3Connector relies on s3fs instead of on boto3, and it inherits from FsspecConnector.
  • Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language specific checks like vocabulary and POS tagging are applied. Set to "true" for higher resolution partitioning and "false" for faster processing.
  • Improves detect_filetype warning to include filename when provided.
  • Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast" strategy if detectron2 is not available.
  • Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in favor of --remote-url.

Features

  • Add AzureBlobStorageConnector based on its fsspec implementation inheriting from FsspecConnector
  • Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

  • Fixes processing for text files with message/rfc822 MIME type.
  • Open xml files in read-only mode when reading contents to construct an XMLDocument.

0.5.3

Enhancements

  • auto.partition() can now load Unstructured ISD json documents.
  • Simplify partitioning functions.
  • Improve logging for ingest CLI.

Features

  • Add --wikipedia-auto-suggest argument to the ingest CLI to disable automatic redirection to pages with similar names.
  • Add setup script for Amazon Linux 2
  • Add optional encoding argument to the partition_(text/email/html) functions.
  • Added Google Drive connector for ingest cli.
  • Added Gitlab connector for ingest cli.

Fixes

0.5.2

Enhancements

  • Fully move from printing to logging.
  • unstructured-ingest now uses a default --download_dir of $HOME/.cache/unstructured/ingest rather than a "tmp-ingest-" dir in the working directory.

Features

Fixes

  • setup_ubuntu.sh no longer fails in some contexts by interpreting DEBIAN_FRONTEND=noninteractive as a command
  • unstructured-ingest no longer re-downloads files when --preserve-downloads is used without --download-dir.
  • Fixed an issue that was causing text to be skipped in some HTML documents.

0.5.1

Enhancements

Features

Fixes

  • Fixes an error causing JavaScript to appear in the output of partition_html sometimes.
  • Fix several issues with the requires_dependencies decorator, including the error message and how it was used, which had caused an error for unstructured-ingest --github-url ....

0.5.0

Enhancements

  • Add requires_dependencies Python decorator to check dependencies are installed before instantiating a class or running a function

Features

  • Added Wikipedia connector for ingest cli.

Fixes

  • Fix process_document file cleaning on failure
  • Fixes an error introduced in the metadata tracking commit that caused NarrativeText and FigureCaption elements to be represented as Text in HTML documents.

0.4.16

Enhancements

  • Fallback to using file extensions for filetype detection if libmagic is not present

Features

  • Added setup script for Ubuntu
  • Added GitHub connector for ingest cli.
  • Added partition_md partitioner.
  • Added Reddit connector for ingest cli.

Fixes

  • Initializes connector properly in ingest.main::MainProcess
  • Restricts version of unstructured-inference to avoid multithreading issue

0.4.15

Enhancements

  • Added elements_to_json and elements_from_json for easier serialization/deserialization
  • convert_to_dict, dict_to_elements and convert_to_csv are now aliases for functions that use the ISD terminology.

Fixes

  • Update to ensure all elements are preserved during serialization/deserialization

0.4.14

  • Automatically install nltk models in the tokenize module.

0.4.13

  • Fixes unstructured-ingest cli.

0.4.12

  • Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.
  • Add parser parameter to partition_html.

0.4.11

  • Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.
  • Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.

0.4.10

  • Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.

0.4.9

  • Added ingest modules and s3 connector, sample ingest script
  • Default to url=None for partition_pdf and partition_image
  • Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".
  • Document Element objects now track metadata

0.4.8

  • Modified XML and HTML parsers not to load comments.

0.4.7

  • Added the ability to pull an HTML document from a url in partition_html.
  • Added the the ability to get file summary info from lists of filenames and lists of file contents.
  • Added optional page break to partition for .pptx, .pdf, images, and .html files.
  • Added to_dict method to document elements.
  • Include more unicode quotes in replace_unicode_quotes.

0.4.6

  • Loosen the default cap threshold to 0.5.
  • Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling the cap ratio threshold.
  • Unknown text elements are identified as Text for HTML and plain text documents.
  • Body Text styles no longer default to NarrativeText for Word documents. The style information is insufficient to determine that the text is narrative.
  • Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
  • Adds an Address element for capturing elements that only contain an address.
  • Suppress the UserWarning when detectron is called.
  • Checks that titles and narrative test have at least one English word.
  • Checks that titles and narrative text are at least 50% alpha characters.
  • Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH environment variable for controlling the max number of words in a title.
  • Updated partition_pptx to order the elements on the page

0.4.4

  • Updated partition_pdf and partition_image to return unstructured Element objects
  • Fixed the healthcheck url path when partitioning images and PDFs via API
  • Adds an optional coordinates attribute to document objects
  • Adds FigureCaption and CheckBox document elements
  • Added ability to split lists detected in LayoutElement objects
  • Adds partition_pptx for partitioning PowerPoint documents
  • LayoutParser models now download from HugginfaceHub instead of DropBox
  • Fixed file type detection for XML and HTML files on Amazone Linux

0.4.3

  • Adds requests as a base dependency
  • Fix in exceeds_cap_ratio so the function doesn't break with empty text
  • Fix bug in _parse_received_data.
  • Update detect_filetype to properly handle .doc, .xls, and .ppt.

0.4.2

  • Added partition_image to process documents in an image format.
  • Fixed utf-8 encoding error in partition_email with attachments for text/html

0.4.1

  • Added support for text files in the partition function
  • Pinned opencv-python for easier installation on Linux

0.4.0

  • Added generic partition brick that detects the file type and routes a file to the appropriate partitioning brick.
  • Added a file type detection module.
  • Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
  • Cleaning brick for removing ordered bullets clean_ordered_bullets.
  • Extract brick method for ordered bullets extract_ordered_bullets.
  • Test for clean_ordered_bullets.
  • Test for extract_ordered_bullets.
  • Added partition_docx for pre-processing Word Documents.
  • Added new REGEX patterns to extract email header information
  • Added new functions to extract header information parse_received_data and partition_header
  • Added new function to parse plain text files partition_text
  • Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
  • Add new Image element and function to find embedded images find_embedded_images
  • Added get_directory_file_info for summarizing information about source documents

0.3.5

  • Add support for local inference
  • Add new pattern to recognize plain text dash bullets
  • Add test for bullet patterns
  • Fix for partition_html that allows for processing div tags that have both text and child elements
  • Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
  • Helper functions for identifying and extracting phone numbers
  • Add new function extract_attachment_info that extracts and decodes the attachment of an email.
  • Staging brick to convert a list of Elements to a pandas dataframe.
  • Add plain text functionality to partition_email

0.3.4

  • Python-3.7 compat

0.3.3

  • Removes BasicConfig from logger configuration
  • Adds the partition_email partitioning brick
  • Adds the replace_mime_encodings cleaning bricks
  • Small fix to HTML parsing related to processing list items with sub-tags
  • Add EmailElement data structure to store email documents

0.3.2

  • Added translate_text brick for translating text between languages
  • Add an apply method to make it easier to apply cleaners to elements

0.3.1

  • Added __init.py__ to partition

0.3.0

  • Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.
  • Removing the local PDF parsing code and any dependencies and tests.
  • Reorganizes the staging bricks in the unstructured.partition module
  • Allow entities to be passed into the Datasaur staging brick
  • Added HTML escapes to the replace_unicode_quotes brick
  • Fix bad responses in partition_pdf to raise ValueError
  • Adds partition_html for partitioning HTML documents.

0.2.6

  • Small change to how _read is placed within the inheritance structure since it doesn't really apply to pdf
  • Add partitioning brick for calling the document image analysis API

0.2.5

  • Update python requirement to >=3.7

0.2.4

  • Add alternative way of importing Final to support google colab

0.2.3

  • Add cleaning bricks for removing prefixes and postfixes
  • Add cleaning bricks for extracting text before and after a pattern

0.2.2

  • Add staging brick for Datasaur

0.2.1

  • Added brick to convert an ISD dictionary to a list of elements
  • Update PDFDocument to use the from_file method
  • Added staging brick for CSV format for ISD (Initial Structured Data) format.
  • Added staging brick for separating text into attention window size chunks for transformers.
  • Added staging brick for LabelBox.
  • Added ability to upload LabelStudio predictions
  • Added utility function for JSONL reading and writing
  • Added staging brick for CSV format for Prodigy
  • Added staging brick for Prodigy
  • Added ability to upload LabelStudio annotations
  • Added text_field and id_field to stage_for_label_studio signature

0.2.0

  • Initial release of unstructured