Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Parquet support for seconds and milliseconds duration types #11854

Merged
merged 28 commits into from
Nov 1, 2022

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Oct 3, 2022

Description

Fixes #11833

Parquet writer used int64 for second and millisecond durations. This does not match the Parquet spec, which requires int32 to be used here.

Changed the physical type of time_millis to int32 to match specs.
Set logical type for time(duration) types.
Using the logical types allows us to write nanosecond durations as nanoseconds, so no precision loss any more.
Parquet writer option timestamp_type does not apply to durations any more.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@vuule vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 3, 2022
@vuule vuule self-assigned this Oct 3, 2022
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Oct 3, 2022
@vuule
Copy link
Contributor Author

vuule commented Oct 4, 2022

Keeping as draft until I evaluate the impact on the file size.

@vuule vuule changed the title Fix writing of seconds and milliseconds duration types to Parquet Fix Parquet support for seconds and milliseconds duration types Oct 5, 2022
@codecov
Copy link

codecov bot commented Oct 6, 2022

Codecov Report

Base: 88.09% // Head: 86.87% // Decreases project coverage by -1.22% ⚠️

Coverage data is based on head (83a23c5) compared to base (f0b4c4f).
Patch coverage: 87.71% of modified lines in pull request are covered.

❗ Current head 83a23c5 differs from pull request most recent head cd89006. Consider uploading reports for the commit cd89006 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11854      +/-   ##
================================================
- Coverage         88.09%   86.87%   -1.23%     
================================================
  Files               133      133              
  Lines             21982    22003      +21     
================================================
- Hits              19366    19115     -251     
- Misses             2616     2888     +272     
Impacted Files Coverage Δ
python/cudf/cudf/io/text.py 91.66% <ø> (ø)
python/cudf/cudf/io/json.py 92.06% <75.00%> (-2.68%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <80.95%> (-0.39%) ⬇️
python/cudf/cudf/core/dataframe.py 93.63% <100.00%> (ø)
python/cudf/cudf/io/avro.py 81.25% <100.00%> (+2.67%) ⬆️
python/cudf/cudf/io/csv.py 92.30% <100.00%> (+0.20%) ⬆️
python/cudf/cudf/io/orc.py 92.94% <100.00%> (ø)
python/cudf/cudf/utils/ioutils.py 79.80% <100.00%> (+0.32%) ⬆️
python/cudf/cudf/core/udf/strings_lowering.py 0.00% <0.00%> (-100.00%) ⬇️
python/cudf/cudf/core/udf/strings_typing.py 0.00% <0.00%> (-95.78%) ⬇️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vuule
Copy link
Contributor Author

vuule commented Oct 6, 2022

Ran benchmarks with and without the fix - There's big difference in file size for high cardinality + low run length. Also, both the reader and the writer are faster (probably mostly because of the file size diff).

Comment on lines +880 to +882
if (s->col.converted_type == TIMESTAMP_MILLIS) {
units = cudf::timestamp_ms::period::den;
} else if (s->col.converted_type == TIME_MICROS or
s->col.converted_type == TIMESTAMP_MICROS) {
} else if (s->col.converted_type == TIMESTAMP_MICROS) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DURATION types excluded because we never scale when reading - timestamp_type does not apply any more.

@vuule vuule marked this pull request as ready for review October 10, 2022 20:14
@vuule vuule requested a review from a team as a code owner October 10, 2022 20:15
@@ -1666,7 +1674,10 @@ __global__ void __launch_bounds__(block_size) gpuDecodePageData(
} else if (dtype == INT96) {
gpuOutputInt96Timestamp(s, val_src_pos, static_cast<int64_t*>(dst));
} else if (dtype_len == 8) {
if (s->ts_scale) {
if (s->dtype_len_in == 4) {
// Reading INT32 TIME_MILLIS in 64-bit DURATION_MILLISECONDS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to link the Parquet specification or something like that? It's not obvious why this choice is made.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added link

case FLOAT: {
int32_t v;
if (dtype_len_in == 4)
if (dtype_len_in == 8)
v = s->col.leaf_column->element<int64_t>(val_idx);
Copy link
Contributor

@bdice bdice Oct 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we know this won't overflow? (Should we static_cast to make the downconversion explicit?) If this is a special case just for millisecond timestamps, should we guard it with a second check against that type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, there's no guarantee that it won't overflow. I've been thinking only in terms of round-trip of Parquet file so I didn't realize this.
What should be done in case of overflow (based on what the rest of cuDF does)?

I don't think we need another check for the exact type here. This is generic code, only depends on the input size. If we ever supported casting in the Parquet writer, this code would cover int64 to int32 casts (overflow issues and all :)).

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
@vuule vuule requested a review from bdice October 26, 2022 20:53
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion for the Python test, otherwise LGTM.

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
@vuule
Copy link
Contributor Author

vuule commented Oct 31, 2022

rerun tests

Comment on lines 1114 to 1116
int32_t v;
if (dtype_len_in == 4)
if (dtype_len_in == 8)
v = s->col.leaf_column->element<int64_t>(val_idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this better?

auto const v = [&] {
  switch(....) {
    case ....: return ...;
  ...
  }
}();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up combining the two suggestions.

@vuule vuule requested a review from ttnghia November 1, 2022 08:05
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 1, 2022
@vuule
Copy link
Contributor Author

vuule commented Nov 1, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1c2ad6a into rapidsai:branch-22.12 Nov 1, 2022
@vuule vuule deleted the bug-parquet-writer-timedelta branch November 1, 2022 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

[BUG] Resolve parquet writer issue with "timedelta64[s]" and "timedelta64[ms]"
5 participants