Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

Closed
razajafri opened this issue Jan 15, 2021 · 26 comments
Assignees
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@razajafri
Copy link
Contributor

Describe the bug
While writing parquet files with fixed-point types. Fixed-point types with precision < 10 (spark uses precision) cannot be successfully read back after being written to parquet. I think the columns are being written as ints as Spark tries to read them back with readLong when a readDecimal was expected. This is just a hunch and the real problem might be completely different.

Steps/Code to reproduce bug
Create a parquet file with Decimals using precisions < 10

Expected behavior
Reading back the parquet file with a 3rd party reader (I used Spark) should work without any problem

Additional context
Please see the attached parquet file
ROUND_TRIP.tar.gz

@razajafri razajafri added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 15, 2021
@hyperbolic2346 hyperbolic2346 self-assigned this Jan 15, 2021
@hyperbolic2346
Copy link
Contributor

hyperbolic2346 commented Jan 16, 2021

@revans2 helped track this down to an issue reading parquet with precision <10 if the value is stored as an int64. This is not necessary to store the data, but cuDF writes the column as it is and doesn't downsize to smaller data sizes if that is possible. Spark will just fail to load this data. This is a bug in spark, but leads to the question: Should we handle this by writing the data in the smallest format that can hold the data?

@kkraus14, What would you suggest here based on python needs? We can have the java plugin scale the column data down to 32 bits before calling parquet write or cuDF can notice this situation and write the data as 32-bit values as an optimization. I can see value to both, but I worry that something not explicit could lead to surprises.

@kkraus14
Copy link
Collaborator

Spark will just fail to load this data.

Do you mean vanilla CPU Spark or Spark-RAPIDS?

This is a bug in spark, but leads to the question: Should we handle this by writing the data in the smallest format that can hold the data?

I think it would be a bit weird if someone writes a decimal64 type column and then when reading back the data they get a decimal32 type column. Is there anything in the parquet specification regarding this?

Would welcome other thoughts here.

What would you suggest here based on python needs? We can have the java plugin scale the column data down to 32 bits before calling parquet write or cuDF can notice this situation and write the data as 32-bit values as an optimization. I can see value to both, but I worry that something not explicit could lead to surprises.

My initial thought is that I'd want my column types to round trip successfully, but I don't think we've decided if in Python we want distinct decimal32 vs decimal64 types or if we just want a decimal type that we hide promotion logic underneath.

cc @shwina @brandon-b-miller @codereport for their thoughts here

@shwina
Copy link
Contributor

shwina commented Jan 18, 2021

I don't think we've decided if in Python we want distinct decimal32 vs decimal64 types or if we just want a decimal type that we hide promotion logic underneath.

Right now, we just have a decimal64 type that we're calling decimal. If we want to include decimal32, my preference would be explicit, separate, decimal32 and decimal64 types. That would have a more predictable behaviour w.r.t memory usage which is typically important for cuDF users.

@revans2
Copy link
Contributor

revans2 commented Jan 19, 2021

@razajafri I think we need to spend some more time to verify that my quick look at the code is correct and is actually causing the error/data corruption you saw.

In the parquet reader it looks like it is able to read in values correctly despite the precision.

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L552-L555

DecimalType.is64BitDecimalType(column.dataType()) returns true for all precision values that could fit in a long, including those that would fit in a 32 bit decimal too.

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L236-L242

The problem is when we try to read them back out. WritableColumnVector is the base class used by the vectorized parquet reader, and when we read in a decimal value it assumes that the precision determines how the data was stored (32-bit vs 64-bit).

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L362-L375

@shwina I don't think the question is around supporting decimal32 and decimal64 as separate types vs a single type. I think the question really is do we want cudf to work around the bug in Spark, and possibly other platforms, so that if we write out a parquet file Spark can read it back in again in all cases. We already know that we are not going to be compatible with some systems that only support reading decimal as fixed length byte arrays (Hive and Impala according to the Spark documentation). So if we are also incompatible with Spark it is probably OK, but it would be good to document how to work around it.

Either way I think we want to push for Spark to fix their bug, but it is likely going to take a long time for that fix to be released.

@shwina
Copy link
Contributor

shwina commented Jan 19, 2021

@shwina I don't think the question is around supporting decimal32 and decimal64 as separate types vs a single type.

Understood -- my comment was specifically about the handling on the Python side. Apologies, I think that's orthogonal to this PR.

@razajafri
Copy link
Contributor Author

@revans2 you are correct.

Here are my findings. The VectorizedParquetRecordReader reads in the parquet file correctly because its basing the read on the requestedSchema which is a MessageType and has the underlying data stored correctly as INT64, precisely as optional INT64 Decimal(_,_) where as the OnHeapColumnVector is initialized based on the batchSchema which is coming from org.apache.spark.sql.parquet.row.requested_schema that is set by the reader which is a StructType and only has Decimal(_,_) in it.

https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224

So as @revans2 said, it initializes the OnHeapColumnVector with int32, basing it on the precision < 10, but the VectorizedParquetRecordReader which has read it as longs expects the OnHeapColumnVector to be initialized with int64.

The right way to fix this would be to file a bug against Spark to handle this correctly but as already mentioned that could take a long time and possibly won't make it into the upcoming release.

@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 20, 2021
@kkraus14
Copy link
Collaborator

cc @vuule

This sounds like another situation where an option / passable Schema would be desired.

It doesn't make sense to change libcudf's behavior generally to me.

@hyperbolic2346
Copy link
Contributor

To merge these two threads, there is now a Spark bug and a PR to fix it.

@razajafri
Copy link
Contributor Author

@hyperbolic2346 thanks mike

@revans2
Copy link
Contributor

revans2 commented Jan 28, 2021

With some help from the Spark community it looks like this is actually a violation of the parquet specification.

From
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

DECIMAL can be used to annotate the following types:

    int32: for 1 <= precision <= 9
    int64: for 1 <= precision <= 18; precision < 10 will produce a warning
    fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
    binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

It probably would be good for us to fix this in our writer.

@kkraus14
Copy link
Collaborator

int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Why wouldn't it just be specified as 10 <= precision <= 18? I don't interpret a warning as a violation of the specification...

With some help from the Spark community it looks like this is actually a violation of the parquet specification.

Could you point me to the discussion in the Spark community? I don't see anything in the JIRA issue or the PR.

@razajafri
Copy link
Contributor Author

To merge these two threads, there is now a Spark bug and a PR to fix it.

Is this the link you want @kkraus14?

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 9, 2021

There was a claim that it was a violation of the Parquet specification, but in the specification it specifically says int64: for 1 <= precision <= 18; precision < 10 will produce a warning. I interpret this as we're within the specification, albeit with a way that will produce a warning.

I'm wondering where there was discussion that this is a violation of the specification as opposed to a bug / limitation of Spark in handling the case that produces a warning?

@revans2
Copy link
Contributor

revans2 commented Feb 11, 2021

Sorry it is a warning. Not a full on violation. We are working to fix it in Spark, but I still believe that other tools will make similar mistakes and having cudf fix it is ideal.

@kkraus14
Copy link
Collaborator

I think the challenge here is Spark is coming from the perspective of having a single Decimal type whose memory is either opaque or statically defined as 128 bits (I'm not sure which), so regardless of what the precision is Spark will read it into that type.

On the other hand, libcudf has decided to support decimal32, decimal64, and in the future decimal128 explicitly. If someone is using decimal64 or decimal128 I imagine they'd expect something to round trip through Parquet in the type specified. Is there a logical type in Parquet or some other mechanism that could be used in this situation?

@kkraus14
Copy link
Collaborator

cc @vuule @devavret for your thoughts / opinions here as well

@razajafri
Copy link
Contributor Author

FYI. the spark PR is now merged

@razajafri
Copy link
Contributor Author

Spark has fixed the issue where it will read the Decimal value as it was written instead of basing it on the precision of the Decimal value.
@kkraus14 I know you pinged @vuule and @devavret but do we still need to keep this open?

@kkraus14
Copy link
Collaborator

My weak opinion is that we should keep the current behavior and possibly add a warning if someone is writing decimal64 data that fits in a decimal32 space per the Parquet spec.

What do you think @revans2 @vuule @devavret?

@vuule
Copy link
Contributor

vuule commented Mar 17, 2021

+1 for keeping the current behavior + a warning. Mainly because we expose the type size to the users and keeping the input type seems like the least surprising behavior.

@kkraus14
Copy link
Collaborator

We can easily implement this warning on the Python side, but would be ideal to have some form of mechanism in libcudf if possible. Is there a clean mechanism for warning from libcudf?

@devavret
Copy link
Contributor

FWIW, if this is widespread, it's easy to add that option.

@vuule
Copy link
Contributor

vuule commented Mar 17, 2021

Is there a clean mechanism for warning from libcudf?

Not that I know of (or that I could find in the code).

FWIW, if this is widespread, it's easy to add that option.

The option to convert to 32bit if precision < 10?

@devavret
Copy link
Contributor

The option to convert to 32bit if precision < 10?

The option to specify physical type for any input type. And a toggle in input schema to control that.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@GregoryKimball
Copy link
Contributor

Closing for now because Spark-RAPIDS has several good workarounds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

9 participants