-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Parquet support for seconds and milliseconds duration types #11854
Fix Parquet support for seconds and milliseconds duration types #11854
Conversation
Keeping as draft until I evaluate the impact on the file size. |
…bug-parquet-writer-timedelta
…bug-parquet-writer-timedelta
…bug-parquet-writer-timedelta
Codecov ReportBase: 88.09% // Head: 86.87% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## branch-22.12 #11854 +/- ##
================================================
- Coverage 88.09% 86.87% -1.23%
================================================
Files 133 133
Lines 21982 22003 +21
================================================
- Hits 19366 19115 -251
- Misses 2616 2888 +272
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
Ran benchmarks with and without the fix - There's big difference in file size for high cardinality + low run length. Also, both the reader and the writer are faster (probably mostly because of the file size diff). |
if (s->col.converted_type == TIMESTAMP_MILLIS) { | ||
units = cudf::timestamp_ms::period::den; | ||
} else if (s->col.converted_type == TIME_MICROS or | ||
s->col.converted_type == TIMESTAMP_MICROS) { | ||
} else if (s->col.converted_type == TIMESTAMP_MICROS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DURATION types excluded because we never scale when reading - timestamp_type does not apply any more.
cpp/src/io/parquet/page_data.cu
Outdated
@@ -1666,7 +1674,10 @@ __global__ void __launch_bounds__(block_size) gpuDecodePageData( | |||
} else if (dtype == INT96) { | |||
gpuOutputInt96Timestamp(s, val_src_pos, static_cast<int64_t*>(dst)); | |||
} else if (dtype_len == 8) { | |||
if (s->ts_scale) { | |||
if (s->dtype_len_in == 4) { | |||
// Reading INT32 TIME_MILLIS in 64-bit DURATION_MILLISECONDS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to link the Parquet specification or something like that? It's not obvious why this choice is made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added link
cpp/src/io/parquet/page_enc.cu
Outdated
case FLOAT: { | ||
int32_t v; | ||
if (dtype_len_in == 4) | ||
if (dtype_len_in == 8) | ||
v = s->col.leaf_column->element<int64_t>(val_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we know this won't overflow? (Should we static_cast
to make the downconversion explicit?) If this is a special case just for millisecond timestamps, should we guard it with a second check against that type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, there's no guarantee that it won't overflow. I've been thinking only in terms of round-trip of Parquet file so I didn't realize this.
What should be done in case of overflow (based on what the rest of cuDF does)?
I don't think we need another check for the exact type here. This is generic code, only depends on the input size. If we ever supported casting in the Parquet writer, this code would cover int64 to int32 casts (overflow issues and all :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion for the Python test, otherwise LGTM.
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
rerun tests |
cpp/src/io/parquet/page_enc.cu
Outdated
int32_t v; | ||
if (dtype_len_in == 4) | ||
if (dtype_len_in == 8) | ||
v = s->col.leaf_column->element<int64_t>(val_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better?
auto const v = [&] {
switch(....) {
case ....: return ...;
...
}
}();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up combining the two suggestions.
…bug-parquet-writer-timedelta
Co-authored-by: Nghia Truong <nghiatruong.vn@gmail.com>
…le/cudf into bug-parquet-writer-timedelta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@gpucibot merge |
Description
Fixes #11833
Parquet writer used int64 for
second
andmillisecond
durations. This does not match the Parquet spec, which requires int32 to be used here.Changed the physical type of time_millis to int32 to match specs.
Set logical type for time(duration) types.
Using the logical types allows us to write nanosecond durations as nanoseconds, so no precision loss any more.
Parquet writer option
timestamp_type
does not apply to durations any more.Checklist