-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Fixed-point types with precision < 10 (for Spark) cannot be successfully read back from parquet #7152
Comments
@revans2 helped track this down to an issue reading parquet with precision <10 if the value is stored as an int64. This is not necessary to store the data, but cuDF writes the column as it is and doesn't downsize to smaller data sizes if that is possible. Spark will just fail to load this data. This is a bug in spark, but leads to the question: Should we handle this by writing the data in the smallest format that can hold the data? @kkraus14, What would you suggest here based on python needs? We can have the java plugin scale the column data down to 32 bits before calling parquet write or cuDF can notice this situation and write the data as 32-bit values as an optimization. I can see value to both, but I worry that something not explicit could lead to surprises. |
Do you mean vanilla CPU Spark or Spark-RAPIDS?
I think it would be a bit weird if someone writes a Would welcome other thoughts here.
My initial thought is that I'd want my column types to round trip successfully, but I don't think we've decided if in Python we want distinct cc @shwina @brandon-b-miller @codereport for their thoughts here |
Right now, we just have a |
@razajafri I think we need to spend some more time to verify that my quick look at the code is correct and is actually causing the error/data corruption you saw. In the parquet reader it looks like it is able to read in values correctly despite the precision.
The problem is when we try to read them back out. @shwina I don't think the question is around supporting decimal32 and decimal64 as separate types vs a single type. I think the question really is do we want cudf to work around the bug in Spark, and possibly other platforms, so that if we write out a parquet file Spark can read it back in again in all cases. We already know that we are not going to be compatible with some systems that only support reading decimal as fixed length byte arrays (Hive and Impala according to the Spark documentation). So if we are also incompatible with Spark it is probably OK, but it would be good to document how to work around it. Either way I think we want to push for Spark to fix their bug, but it is likely going to take a long time for that fix to be released. |
Understood -- my comment was specifically about the handling on the Python side. Apologies, I think that's orthogonal to this PR. |
@revans2 you are correct. Here are my findings. The So as @revans2 said, it initializes the The right way to fix this would be to file a bug against Spark to handle this correctly but as already mentioned that could take a long time and possibly won't make it into the upcoming release. |
cc @vuule This sounds like another situation where an option / passable Schema would be desired. It doesn't make sense to change libcudf's behavior generally to me. |
To merge these two threads, there is now a Spark bug and a PR to fix it. |
@hyperbolic2346 thanks mike |
With some help from the Spark community it looks like this is actually a violation of the parquet specification. From
It probably would be good for us to fix this in our writer. |
Why wouldn't it just be specified as
Could you point me to the discussion in the Spark community? I don't see anything in the JIRA issue or the PR. |
Is this the link you want @kkraus14? |
There was a claim that it was a violation of the Parquet specification, but in the specification it specifically says I'm wondering where there was discussion that this is a violation of the specification as opposed to a bug / limitation of Spark in handling the case that produces a warning? |
Sorry it is a warning. Not a full on violation. We are working to fix it in Spark, but I still believe that other tools will make similar mistakes and having cudf fix it is ideal. |
I think the challenge here is Spark is coming from the perspective of having a single Decimal type whose memory is either opaque or statically defined as 128 bits (I'm not sure which), so regardless of what the precision is Spark will read it into that type. On the other hand, libcudf has decided to support decimal32, decimal64, and in the future decimal128 explicitly. If someone is using decimal64 or decimal128 I imagine they'd expect something to round trip through Parquet in the type specified. Is there a logical type in Parquet or some other mechanism that could be used in this situation? |
FYI. the spark PR is now merged |
+1 for keeping the current behavior + a warning. Mainly because we expose the type size to the users and keeping the input type seems like the least surprising behavior. |
We can easily implement this warning on the Python side, but would be ideal to have some form of mechanism in libcudf if possible. Is there a clean mechanism for warning from libcudf? |
FWIW, if this is widespread, it's easy to add that option. |
Not that I know of (or that I could find in the code).
The option to convert to 32bit if precision < 10? |
The option to specify physical type for any input type. And a toggle in input schema to control that. |
This issue has been labeled |
Closing for now because Spark-RAPIDS has several good workarounds. |
Describe the bug
While writing parquet files with fixed-point types. Fixed-point types with precision < 10 (spark uses precision) cannot be successfully read back after being written to parquet. I think the columns are being written as ints as Spark tries to read them back with
readLong
when areadDecimal
was expected. This is just a hunch and the real problem might be completely different.Steps/Code to reproduce bug
Create a parquet file with Decimals using precisions < 10
Expected behavior
Reading back the parquet file with a 3rd party reader (I used Spark) should work without any problem
Additional context
Please see the attached parquet file
ROUND_TRIP.tar.gz
The text was updated successfully, but these errors were encountered: