Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] support reading decimal data stored as byte array from parquet files #6909

Closed
sperlingxx opened this issue Dec 4, 2020 · 2 comments
Closed
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@sperlingxx
Copy link
Contributor

Currently, we can read decimal columns from parquet files, if their storage type are INT32 or INT64. But in real world applications, there are many parquet files containing decimal columns stored with FIXED_LENGTH_BYTE_ARRAY.

I think we had better to support reading them as fixed-point data type, just as integer based decimal columns. For data exceeding 8 bytes, perhaps we can perform rounding cast to fit them in DECIMAL64?

@sperlingxx sperlingxx added feature request New feature or request Needs Triage Need team to review and classify labels Dec 4, 2020
@sperlingxx sperlingxx added the improvement Improvement / enhancement to an existing function label Dec 4, 2020
@kkraus14 kkraus14 added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS and removed Needs Triage Need team to review and classify labels Dec 7, 2020
@revans2
Copy link
Contributor

revans2 commented Dec 7, 2020

Just to be clear here we have noticed that in files written by pyarrow and older versions of spark there can be decimal columne that would fit in INT32 or INT64, but are stored as FIXED_LENGTH_BYTE_ARRAY, just because the INT32 and INT64 format is newer.

Supporting these is a priority for us from the Spark side. Reading > 64-bit decimal values is not something we are planning on trying to support until we have 128-bit decimal support.

rapids-bot bot pushed a commit that referenced this issue Dec 15, 2020
This pull request is to address #6909.

Authors:
  - sperlingxx <lovedreamf@gmail.com>
  - Alfred Xu <lovedreamf@gmail.com>

Approvers:
  - Robert (Bobby) Evans
  - Mike Wilson
  - Devavret Makkar

URL: #6969
@sperlingxx
Copy link
Contributor Author

Close this issue since corresponding pull request has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

3 participants