Document Parquet and ORC compression support #550

sameerz · 2020-08-12T04:29:45Z

Please write a description in this text box of the changes that are being
made.

Add documentation on compression formats supported for Parquet and ORC on read and write. Closes issue #341

Please ensure that you have written units tests for the changes made/features
added.

Reviewed in .md file in github.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

revans2 · 2020-08-12T12:27:15Z

docs/compatibility.md

@@ -184,6 +186,8 @@ Parquet will not be GPU-accelerated. If the INT96 timestamp format is not requir
 compatibility with other tools then set `spark.sql.parquet.outputTimestampType` to
 `TIMESTAMP_MICROS`.

+The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files and writing `uncompressed` and `snappy` Parquet files. 


I think the important thing we are missing is that we cannot tell what the file is compressed with until we start to read it and at that point we don't have the ability to fall back to the CPU so you will get an error. This is also true for ORC.

Updated to note that we will error out on reading or writing an unsupported format

…format Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

revans2 · 2020-08-12T13:20:02Z

docs/compatibility.md

@@ -164,6 +164,10 @@ similar issue exists for writing dates as described
 appears to work for dates after the epoch as described
 [here](https://github.com/NVIDIA/spark-rapids/issues/140). 

+The plugin supports reading `uncompressed`, `snappy` and `zlib` ORC files and writing `uncompressed`
+ and `snappy` ORC files.  At this point, the plugin does not have the ability to fall back to the 
+ CPU when reading or writing an unsupported compression format, and will error out in that case. 


Sorry it is just reading. Writing we will fall back to the CPU for an unsupported format.

Corrected to only indicate reading will error out.

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

sameerz · 2020-08-12T13:55:14Z

build

* Compatibility notes for parquet and orc compression on read + write Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Fixing grammar Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Edit / rewrite Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Add note about erroring out in case of reading / writing unsupported format Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Correction - ony reading will error out. Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

…IDIA#550) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

sameerz added 3 commits August 11, 2020 20:40

Compatibility notes for parquet and orc compression on read + write

dfec569

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

Fixing grammar

6c530d8

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

Edit / rewrite

2803e86

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

sameerz added the documentation Improvements or additions to documentation label Aug 12, 2020

sameerz added this to the Aug 3 - Aug 14 milestone Aug 12, 2020

sameerz self-assigned this Aug 12, 2020

revans2 reviewed Aug 12, 2020

View reviewed changes

Add note about erroring out in case of reading / writing unsupported …

042695c

…format Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

revans2 reviewed Aug 12, 2020

View reviewed changes

Correction - ony reading will error out.

7e45b2f

Signed-off-by: Sameer Raheja <sraheja@nvidia.com>

revans2 approved these changes Aug 12, 2020

View reviewed changes

sameerz merged commit 458d27e into NVIDIA:branch-0.2 Aug 13, 2020

sameerz linked an issue Aug 20, 2020 that may be closed by this pull request

[BUG] Document compression formats for readers/writers #341

Closed

sameerz mentioned this pull request Aug 20, 2020

[BUG] Document compression formats for readers/writers #341

Closed

sameerz deleted the documentation-parquet-orc-compression-support branch August 24, 2020 22:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Parquet and ORC compression support #550

Document Parquet and ORC compression support #550

sameerz commented Aug 12, 2020

revans2 Aug 12, 2020

sameerz Aug 12, 2020

revans2 Aug 12, 2020

sameerz Aug 12, 2020

sameerz commented Aug 12, 2020

Document Parquet and ORC compression support #550

Document Parquet and ORC compression support #550

Conversation

sameerz commented Aug 12, 2020

revans2 Aug 12, 2020

Choose a reason for hiding this comment

sameerz Aug 12, 2020

Choose a reason for hiding this comment

revans2 Aug 12, 2020

Choose a reason for hiding this comment

sameerz Aug 12, 2020

Choose a reason for hiding this comment

sameerz commented Aug 12, 2020