[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

razajafri · 2021-01-15T04:48:58Z

Describe the bug
While writing parquet files with fixed-point types. Fixed-point types with precision < 10 (spark uses precision) cannot be successfully read back after being written to parquet. I think the columns are being written as ints as Spark tries to read them back with readLong when a readDecimal was expected. This is just a hunch and the real problem might be completely different.

Steps/Code to reproduce bug
Create a parquet file with Decimals using precisions < 10

Expected behavior
Reading back the parquet file with a 3rd party reader (I used Spark) should work without any problem

Additional context
Please see the attached parquet file
ROUND_TRIP.tar.gz

The text was updated successfully, but these errors were encountered:

hyperbolic2346 · 2021-01-16T02:22:13Z

@revans2 helped track this down to an issue reading parquet with precision <10 if the value is stored as an int64. This is not necessary to store the data, but cuDF writes the column as it is and doesn't downsize to smaller data sizes if that is possible. Spark will just fail to load this data. This is a bug in spark, but leads to the question: Should we handle this by writing the data in the smallest format that can hold the data?

@kkraus14, What would you suggest here based on python needs? We can have the java plugin scale the column data down to 32 bits before calling parquet write or cuDF can notice this situation and write the data as 32-bit values as an optimization. I can see value to both, but I worry that something not explicit could lead to surprises.

kkraus14 · 2021-01-16T04:30:49Z

Spark will just fail to load this data.

Do you mean vanilla CPU Spark or Spark-RAPIDS?

This is a bug in spark, but leads to the question: Should we handle this by writing the data in the smallest format that can hold the data?

I think it would be a bit weird if someone writes a decimal64 type column and then when reading back the data they get a decimal32 type column. Is there anything in the parquet specification regarding this?

Would welcome other thoughts here.

What would you suggest here based on python needs? We can have the java plugin scale the column data down to 32 bits before calling parquet write or cuDF can notice this situation and write the data as 32-bit values as an optimization. I can see value to both, but I worry that something not explicit could lead to surprises.

My initial thought is that I'd want my column types to round trip successfully, but I don't think we've decided if in Python we want distinct decimal32 vs decimal64 types or if we just want a decimal type that we hide promotion logic underneath.

cc @shwina @brandon-b-miller @codereport for their thoughts here

shwina · 2021-01-18T19:12:36Z

I don't think we've decided if in Python we want distinct decimal32 vs decimal64 types or if we just want a decimal type that we hide promotion logic underneath.

Right now, we just have a decimal64 type that we're calling decimal. If we want to include decimal32, my preference would be explicit, separate, decimal32 and decimal64 types. That would have a more predictable behaviour w.r.t memory usage which is typically important for cuDF users.

revans2 · 2021-01-19T16:20:28Z

@razajafri I think we need to spend some more time to verify that my quick look at the code is correct and is actually causing the error/data corruption you saw.

In the parquet reader it looks like it is able to read in values correctly despite the precision.

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java#L552-L555

DecimalType.is64BitDecimalType(column.dataType()) returns true for all precision values that could fit in a long, including those that would fit in a 32 bit decimal too.

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L236-L242

The problem is when we try to read them back out. WritableColumnVector is the base class used by the vectorized parquet reader, and when we read in a decimal value it assumes that the precision determines how the data was stored (32-bit vs 64-bit).

https://github.com/apache/spark/blob/32dad1d5a6d77aef64e2968c28195ca2bbafc1c1/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java#L362-L375

@shwina I don't think the question is around supporting decimal32 and decimal64 as separate types vs a single type. I think the question really is do we want cudf to work around the bug in Spark, and possibly other platforms, so that if we write out a parquet file Spark can read it back in again in all cases. We already know that we are not going to be compatible with some systems that only support reading decimal as fixed length byte arrays (Hive and Impala according to the Spark documentation). So if we are also incompatible with Spark it is probably OK, but it would be good to document how to work around it.

Either way I think we want to push for Spark to fix their bug, but it is likely going to take a long time for that fix to be released.

shwina · 2021-01-19T19:23:03Z

@shwina I don't think the question is around supporting decimal32 and decimal64 as separate types vs a single type.

Understood -- my comment was specifically about the handling on the Python side. Apologies, I think that's orthogonal to this PR.

razajafri · 2021-01-19T19:34:34Z

@revans2 you are correct.

Here are my findings. The VectorizedParquetRecordReader reads in the parquet file correctly because its basing the read on the requestedSchema which is a MessageType and has the underlying data stored correctly as INT64, precisely as optional INT64 Decimal(_,_) where as the OnHeapColumnVector is initialized based on the batchSchema which is coming from org.apache.spark.sql.parquet.row.requested_schema that is set by the reader which is a StructType and only has Decimal(_,_) in it.

https://github.com/apache/spark/blob/a44e008de3ae5aecad9e0f1a7af6a1e8b0d97f4e/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java#L224

So as @revans2 said, it initializes the OnHeapColumnVector with int32, basing it on the precision < 10, but the VectorizedParquetRecordReader which has read it as longs expects the OnHeapColumnVector to be initialized with int64.

The right way to fix this would be to file a bug against Spark to handle this correctly but as already mentioned that could take a long time and possibly won't make it into the upcoming release.

kkraus14 · 2021-01-20T02:35:53Z

cc @vuule

This sounds like another situation where an option / passable Schema would be desired.

It doesn't make sense to change libcudf's behavior generally to me.

hyperbolic2346 · 2021-01-25T17:08:22Z

To merge these two threads, there is now a Spark bug and a PR to fix it.

razajafri · 2021-01-25T17:28:33Z

@hyperbolic2346 thanks mike

revans2 · 2021-01-28T14:34:53Z

With some help from the Spark community it looks like this is actually a violation of the parquet specification.

From
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md

DECIMAL can be used to annotate the following types:

    int32: for 1 <= precision <= 9
    int64: for 1 <= precision <= 18; precision < 10 will produce a warning
    fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits
    binary: precision is not limited, but is required. The minimum number of bytes to store the unscaled value should be used.

It probably would be good for us to fix this in our writer.

kkraus14 · 2021-01-29T15:57:04Z

int64: for 1 <= precision <= 18; precision < 10 will produce a warning

Why wouldn't it just be specified as 10 <= precision <= 18? I don't interpret a warning as a violation of the specification...

With some help from the Spark community it looks like this is actually a violation of the parquet specification.

Could you point me to the discussion in the Spark community? I don't see anything in the JIRA issue or the PR.

razajafri · 2021-02-09T18:19:15Z

To merge these two threads, there is now a Spark bug and a PR to fix it.

Is this the link you want @kkraus14?

kkraus14 · 2021-02-09T23:11:23Z

There was a claim that it was a violation of the Parquet specification, but in the specification it specifically says int64: for 1 <= precision <= 18; precision < 10 will produce a warning. I interpret this as we're within the specification, albeit with a way that will produce a warning.

I'm wondering where there was discussion that this is a violation of the specification as opposed to a bug / limitation of Spark in handling the case that produces a warning?

revans2 · 2021-02-11T15:06:26Z

Sorry it is a warning. Not a full on violation. We are working to fix it in Spark, but I still believe that other tools will make similar mistakes and having cudf fix it is ideal.

kkraus14 · 2021-02-11T23:13:07Z

I think the challenge here is Spark is coming from the perspective of having a single Decimal type whose memory is either opaque or statically defined as 128 bits (I'm not sure which), so regardless of what the precision is Spark will read it into that type.

On the other hand, libcudf has decided to support decimal32, decimal64, and in the future decimal128 explicitly. If someone is using decimal64 or decimal128 I imagine they'd expect something to round trip through Parquet in the type specified. Is there a logical type in Parquet or some other mechanism that could be used in this situation?

kkraus14 · 2021-02-12T16:36:54Z

cc @vuule @devavret for your thoughts / opinions here as well

razajafri · 2021-02-23T18:44:45Z

FYI. the spark PR is now merged

razajafri · 2021-03-17T19:09:53Z

Spark has fixed the issue where it will read the Decimal value as it was written instead of basing it on the precision of the Decimal value.
@kkraus14 I know you pinged @vuule and @devavret but do we still need to keep this open?

kkraus14 · 2021-03-17T19:48:33Z

My weak opinion is that we should keep the current behavior and possibly add a warning if someone is writing decimal64 data that fits in a decimal32 space per the Parquet spec.

What do you think @revans2 @vuule @devavret?

vuule · 2021-03-17T20:11:30Z

+1 for keeping the current behavior + a warning. Mainly because we expose the type size to the users and keeping the input type seems like the least surprising behavior.

kkraus14 · 2021-03-17T20:38:29Z

We can easily implement this warning on the Python side, but would be ideal to have some form of mechanism in libcudf if possible. Is there a clean mechanism for warning from libcudf?

devavret · 2021-03-17T23:54:09Z

FWIW, if this is widespread, it's easy to add that option.

vuule · 2021-03-17T23:57:09Z

Is there a clean mechanism for warning from libcudf?

Not that I know of (or that I could find in the code).

FWIW, if this is widespread, it's easy to add that option.

The option to convert to 32bit if precision < 10?

devavret · 2021-03-17T23:59:05Z

The option to convert to 32bit if precision < 10?

The option to specify physical type for any input type. And a toggle in input schema to control that.

github-actions · 2021-04-17T00:15:54Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

GregoryKimball · 2023-02-13T17:28:48Z

Closing for now because Spark-RAPIDS has several good workarounds.

razajafri added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Jan 15, 2021

razajafri mentioned this issue Jan 15, 2021

Decimal Support for writing Parquet NVIDIA/spark-rapids#1531

Merged

hyperbolic2346 self-assigned this Jan 15, 2021

kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 20, 2021

devavret mentioned this issue Apr 5, 2021

[FEA] Logging in libcudf #7861

Closed

github-actions bot added the inactive-30d label Apr 17, 2021

beckernick added this to the Decimal data type and operations milestone Jul 14, 2021

GregoryKimball closed this as completed Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

razajafri commented Jan 15, 2021

hyperbolic2346 commented Jan 16, 2021 •

edited

Loading

kkraus14 commented Jan 16, 2021

shwina commented Jan 18, 2021

revans2 commented Jan 19, 2021

shwina commented Jan 19, 2021

razajafri commented Jan 19, 2021

kkraus14 commented Jan 20, 2021

hyperbolic2346 commented Jan 25, 2021

razajafri commented Jan 25, 2021

revans2 commented Jan 28, 2021

kkraus14 commented Jan 29, 2021

razajafri commented Feb 9, 2021

kkraus14 commented Feb 9, 2021

revans2 commented Feb 11, 2021

kkraus14 commented Feb 11, 2021

kkraus14 commented Feb 12, 2021

razajafri commented Feb 23, 2021

razajafri commented Mar 17, 2021

kkraus14 commented Mar 17, 2021

vuule commented Mar 17, 2021

kkraus14 commented Mar 17, 2021

devavret commented Mar 17, 2021

vuule commented Mar 17, 2021

devavret commented Mar 17, 2021

github-actions bot commented Apr 17, 2021

GregoryKimball commented Feb 13, 2023

[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152

Comments

razajafri commented Jan 15, 2021

hyperbolic2346 commented Jan 16, 2021 • edited Loading

kkraus14 commented Jan 16, 2021

shwina commented Jan 18, 2021

revans2 commented Jan 19, 2021

shwina commented Jan 19, 2021

razajafri commented Jan 19, 2021

kkraus14 commented Jan 20, 2021

hyperbolic2346 commented Jan 25, 2021

razajafri commented Jan 25, 2021

revans2 commented Jan 28, 2021

kkraus14 commented Jan 29, 2021

razajafri commented Feb 9, 2021

kkraus14 commented Feb 9, 2021

revans2 commented Feb 11, 2021

kkraus14 commented Feb 11, 2021

kkraus14 commented Feb 12, 2021

razajafri commented Feb 23, 2021

razajafri commented Mar 17, 2021

kkraus14 commented Mar 17, 2021

vuule commented Mar 17, 2021

kkraus14 commented Mar 17, 2021

devavret commented Mar 17, 2021

vuule commented Mar 17, 2021

devavret commented Mar 17, 2021

github-actions bot commented Apr 17, 2021

GregoryKimball commented Feb 13, 2023

hyperbolic2346 commented Jan 16, 2021 •

edited

Loading