-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Binary type encoding for Parquet writes #10778
Comments
I cannot find in the parquet spec, how to distinguish between string and binary type when the physical type is This means you can just encode your current data as a binary type and pass it as a string column. You'd have to pass this type information to the reader on your own. Do you have any parquet files I can look at that contain binary data. |
Here's a sample Parquet file for reference, written by Apache Spark, that contains a string column, a list-of-bytes column, and a binary column. string-list-binary.parquet.zip It's clear from the metadata that Spark is encoding extra metadata information to distinguish strings from raw binary types, but otherwise from the raw Parquet metadata they appear to be identical. Here's the output of the metadata from Parquet's CLI tool:
I agree this is not a feature requirement of the libcudf writer, and instead is more of a requirement of the high-level reader code to reinterpret string data as binary data (or vice-versa) based on what schema is expected. |
In the Prints out the type for the This is the difference. If we write the binary columns out as strings they will have the Technically for reads we should be looking at this metadata too and only return a DType.STRING if UTF8/StringType is set, but older versions of parquet didn't use that so we need a flag to say if you care about it or not. |
Actually I need to correct it a bit, at least for what Spark does on reads. BINARY with an annotation of STRING, ENUM, UUID, and JSON should also be returned as DType.STRING. |
Ok so you want parquet writer to not write any converted type or logical type when you pass the column option cudf/cpp/include/cudf/io/types.hpp Line 252 in 6acf226
|
This issue has been labeled |
Still needed. |
There are a couple of issues(#11044 and #10778) revolving around adding support for binary writes and reads to parquet. The desire is to be able to write strings and lists of int8 values as binary. This PR adds support for strings to be written as binary and for binary data to be read as binary or strings. I have left the default for binary data to read as a string to prevent any surprises upon upgrade. Single-depth list columns of int8 and uint8 values are not written as binary with this change. That will be another PR after discussions about the possible impact of the change. Closes #11044 Issue #10778 Authors: - Mike Wilson (https://github.com/hyperbolic2346) Approvers: - Karthikeyan (https://github.com/karthikeyann) - MithunR (https://github.com/mythrocks) - Yunsong Wang (https://github.com/PointKernel) - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec - Vyas Ramasubramani (https://github.com/vyasr) URL: #11160
Is your feature request related to a problem? Please describe.
Apache Spark supports writing binary data to Parquet as Parquet's binary (
BYTE_ARRAY
) type. The RAPIDS Accelerator needs to emulate this behavior for binary data in order to accelerate Parquet writes containing binary types.Describe the solution you'd like
The cuio
column_metadata
already contains a commented-out field that would indicate whether the data should be encoded as a binary type. Incoming data that is either a string or a LIST of INT8/UINT8 would be encoded as Parquet binary if this flag is set.Describe alternatives you've considered
cuio could infer that a column that is a LIST of INT8/UINT8 should be written as binary, but there are some use-cases that would break with that assumption. For example, if someone used a Spark type of ArrayType(ByteType) then it would appear in cudf as LIST of INT8, but that should not be binary-encoded.
The text was updated successfully, but these errors were encountered: