-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document Parquet and ORC compression support #550
Document Parquet and ORC compression support #550
Conversation
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
docs/compatibility.md
Outdated
@@ -184,6 +186,8 @@ Parquet will not be GPU-accelerated. If the INT96 timestamp format is not requir | |||
compatibility with other tools then set `spark.sql.parquet.outputTimestampType` to | |||
`TIMESTAMP_MICROS`. | |||
|
|||
The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files and writing `uncompressed` and `snappy` Parquet files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the important thing we are missing is that we cannot tell what the file is compressed with until we start to read it and at that point we don't have the ability to fall back to the CPU so you will get an error. This is also true for ORC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to note that we will error out on reading or writing an unsupported format
…format Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
docs/compatibility.md
Outdated
@@ -164,6 +164,10 @@ similar issue exists for writing dates as described | |||
appears to work for dates after the epoch as described | |||
[here](https://github.com/NVIDIA/spark-rapids/issues/140). | |||
|
|||
The plugin supports reading `uncompressed`, `snappy` and `zlib` ORC files and writing `uncompressed` | |||
and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the | |||
CPU when reading or writing an unsupported compression format, and will error out in that case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it is just reading. Writing we will fall back to the CPU for an unsupported format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected to only indicate reading will error out.
Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
build |
* Compatibility notes for parquet and orc compression on read + write Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Fixing grammar Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Edit / rewrite Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Add note about erroring out in case of reading / writing unsupported format Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Correction - ony reading will error out. Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
* Compatibility notes for parquet and orc compression on read + write Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Fixing grammar Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Edit / rewrite Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Add note about erroring out in case of reading / writing unsupported format Signed-off-by: Sameer Raheja <sraheja@nvidia.com> * Correction - ony reading will error out. Signed-off-by: Sameer Raheja <sraheja@nvidia.com>
…IDIA#550) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
made.
Add documentation on compression formats supported for Parquet and ORC on read and write. Closes issue #341
added.
Reviewed in .md file in github.