Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document Parquet and ORC compression support #550

Conversation

sameerz
Copy link
Collaborator

@sameerz sameerz commented Aug 12, 2020

  1. Please write a description in this text box of the changes that are being
    made.

Add documentation on compression formats supported for Parquet and ORC on read and write. Closes issue #341

  1. Please ensure that you have written units tests for the changes made/features
    added.

Reviewed in .md file in github.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
@sameerz sameerz added the documentation Improvements or additions to documentation label Aug 12, 2020
@sameerz sameerz added this to the Aug 3 - Aug 14 milestone Aug 12, 2020
@sameerz sameerz self-assigned this Aug 12, 2020
@@ -184,6 +186,8 @@ Parquet will not be GPU-accelerated. If the INT96 timestamp format is not requir
compatibility with other tools then set `spark.sql.parquet.outputTimestampType` to
`TIMESTAMP_MICROS`.

The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files and writing `uncompressed` and `snappy` Parquet files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the important thing we are missing is that we cannot tell what the file is compressed with until we start to read it and at that point we don't have the ability to fall back to the CPU so you will get an error. This is also true for ORC.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to note that we will error out on reading or writing an unsupported format

…format

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
@@ -164,6 +164,10 @@ similar issue exists for writing dates as described
appears to work for dates after the epoch as described
[here](https://github.com/NVIDIA/spark-rapids/issues/140).

The plugin supports reading `uncompressed`, `snappy` and `zlib` ORC files and writing `uncompressed`
and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the
CPU when reading or writing an unsupported compression format, and will error out in that case.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it is just reading. Writing we will fall back to the CPU for an unsupported format.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected to only indicate reading will error out.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
@sameerz
Copy link
Collaborator Author

sameerz commented Aug 12, 2020

build

@sameerz sameerz merged commit 458d27e into NVIDIA:branch-0.2 Aug 13, 2020
@sameerz sameerz linked an issue Aug 20, 2020 that may be closed by this pull request
@sameerz sameerz deleted the documentation-parquet-orc-compression-support branch August 24, 2020 22:32
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Compatibility notes for parquet and orc compression on read + write

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Fixing grammar

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Edit / rewrite

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Add note about erroring out in case of reading / writing unsupported format

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Correction - ony reading will error out.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Compatibility notes for parquet and orc compression on read + write

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Fixing grammar

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Edit / rewrite

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Add note about erroring out in case of reading / writing unsupported format

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

* Correction - ony reading will error out.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this pull request Nov 30, 2023
…IDIA#550)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Document compression formats for readers/writers
2 participants