Fix Parquet support for seconds and milliseconds duration types #11854

vuule · 2022-10-03T23:01:23Z

Description

Parquet writer used int64 for second and millisecond durations. This does not match the Parquet spec, which requires int32 to be used here.

Changed the physical type of time_millis to int32 to match specs.
Set logical type for time(duration) types.
Using the logical types allows us to write nanosecond durations as nanoseconds, so no precision loss any more.
Parquet writer option timestamp_type does not apply to durations any more.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

vuule · 2022-10-04T00:49:16Z

Keeping as draft until I evaluate the impact on the file size.

…bug-parquet-writer-timedelta

codecov · 2022-10-06T01:26:58Z

Codecov Report

Base: 88.09% // Head: 86.87% // Decreases project coverage by -1.22% ⚠️

Coverage data is based on head (83a23c5) compared to base (f0b4c4f).
Patch coverage: 87.71% of modified lines in pull request are covered.

❗ Current head 83a23c5 differs from pull request most recent head cd89006. Consider uploading reports for the commit cd89006 to get more accurate results

Additional details and impacted files

@@               Coverage Diff                @@
##           branch-22.12   #11854      +/-   ##
================================================
- Coverage         88.09%   86.87%   -1.23%     
================================================
  Files               133      133              
  Lines             21982    22003      +21     
================================================
- Hits              19366    19115     -251     
- Misses             2616     2888     +272

Impacted Files	Coverage Δ
python/cudf/cudf/io/text.py	`91.66% <ø> (ø)`
python/cudf/cudf/io/json.py	`92.06% <75.00%> (-2.68%)`	⬇️
python/cudf/cudf/io/parquet.py	`90.45% <80.95%> (-0.39%)`	⬇️
python/cudf/cudf/core/dataframe.py	`93.63% <100.00%> (ø)`
python/cudf/cudf/io/avro.py	`81.25% <100.00%> (+2.67%)`	⬆️
python/cudf/cudf/io/csv.py	`92.30% <100.00%> (+0.20%)`	⬆️
python/cudf/cudf/io/orc.py	`92.94% <100.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`79.80% <100.00%> (+0.32%)`	⬆️
python/cudf/cudf/core/udf/strings_lowering.py	`0.00% <0.00%> (-100.00%)`	⬇️
python/cudf/cudf/core/udf/strings_typing.py	`0.00% <0.00%> (-95.78%)`	⬇️
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

vuule · 2022-10-06T02:20:37Z

Ran benchmarks with and without the fix - There's big difference in file size for high cardinality + low run length. Also, both the reader and the writer are faster (probably mostly because of the file size diff).

…bug-parquet-writer-timedelta

vuule · 2022-10-07T19:08:53Z

cpp/src/io/parquet/page_data.cu

+            if (s->col.converted_type == TIMESTAMP_MILLIS) {
              units = cudf::timestamp_ms::period::den;
-            } else if (s->col.converted_type == TIME_MICROS or
-                       s->col.converted_type == TIMESTAMP_MICROS) {
+            } else if (s->col.converted_type == TIMESTAMP_MICROS) {


DURATION types excluded because we never scale when reading - timestamp_type does not apply any more.

bdice · 2022-10-24T19:23:16Z

cpp/src/io/parquet/page_data.cu

@@ -1666,7 +1674,10 @@ __global__ void __launch_bounds__(block_size) gpuDecodePageData(
        } else if (dtype == INT96) {
          gpuOutputInt96Timestamp(s, val_src_pos, static_cast<int64_t*>(dst));
        } else if (dtype_len == 8) {
-          if (s->ts_scale) {
+          if (s->dtype_len_in == 4) {
+            // Reading INT32 TIME_MILLIS in 64-bit DURATION_MILLISECONDS


Do we need to link the Parquet specification or something like that? It's not obvious why this choice is made.

bdice · 2022-10-24T19:24:32Z

cpp/src/io/parquet/page_enc.cu

          case FLOAT: {
            int32_t v;
-            if (dtype_len_in == 4)
+            if (dtype_len_in == 8)
+              v = s->col.leaf_column->element<int64_t>(val_idx);


How do we know this won't overflow? (Should we static_cast to make the downconversion explicit?) If this is a special case just for millisecond timestamps, should we guard it with a second check against that type?

Right, there's no guarantee that it won't overflow. I've been thinking only in terms of round-trip of Parquet file so I didn't realize this.
What should be done in case of overflow (based on what the rest of cuDF does)?

I don't think we need another check for the exact type here. This is generic code, only depends on the input size. If we ever supported casting in the Parquet writer, this code would cover int64 to int32 casts (overflow issues and all :)).

python/cudf/cudf/tests/test_parquet.py

…bug-parquet-writer-timedelta

bdice

One suggestion for the Python test, otherwise LGTM.

python/cudf/cudf/tests/test_parquet.py

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

vuule · 2022-10-31T22:43:48Z

rerun tests

ttnghia · 2022-11-01T04:37:57Z

cpp/src/io/parquet/page_enc.cu

            int32_t v;
-            if (dtype_len_in == 4)
+            if (dtype_len_in == 8)
+              v = s->col.leaf_column->element<int64_t>(val_idx);


Is this better?

auto const v = [&] { switch(....) { case ....: return ...; ... } }();

Ended up combining the two suggestions.

cpp/src/io/parquet/page_enc.cu

cpp/tests/io/parquet_test.cpp

…bug-parquet-writer-timedelta

Co-authored-by: Nghia Truong <nghiatruong.vn@gmail.com>

…le/cudf into bug-parquet-writer-timedelta

cpp/tests/io/parquet_test.cpp

PointKernel

LGTM

vuule · 2022-11-01T22:32:47Z

@gpucibot merge

fix physical type for s and ms; test

9dda63b

vuule added bug Something isn't working cuIO cuIO issue non-breaking Non-breaking change labels Oct 3, 2022

vuule self-assigned this Oct 3, 2022

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Oct 3, 2022

vuule added 8 commits October 3, 2022 18:09

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

2705aaa

…bug-parquet-writer-timedelta

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

5c3a0b1

…bug-parquet-writer-timedelta

don't apply timestamp_type to durations

318daf7

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

6afa3ed

…bug-parquet-writer-timedelta

writer fix

f02b09e

reader fix

4bbbcb7

disable C++ tests w/ timestamp_type for durations

05687aa

Python test

47f4db0

vuule changed the title ~~Fix writing of seconds and milliseconds duration types to Parquet~~ Fix Parquet support for seconds and milliseconds duration types Oct 5, 2022

vuule added 4 commits October 5, 2022 23:18

C++ test

32e2172

fix data range

8051e42

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

bb6c378

…bug-parquet-writer-timedelta

set logical type; don't lose nano precision

d68f300

vuule commented Oct 7, 2022

View reviewed changes

vuule added 4 commits October 7, 2022 12:13

no negative scaling for durations

aeaa4d6

expand test; remove unneeded casts

05b8a47

statistics fix; test updates

6662853

comment; style

ec8b242

vuule marked this pull request as ready for review October 10, 2022 20:14

vuule requested a review from a team as a code owner October 10, 2022 20:15

bdice reviewed Oct 24, 2022

View reviewed changes

vuule added 5 commits October 24, 2022 16:18

link in comment

c295451

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

8b99c6b

…bug-parquet-writer-timedelta

check pandas output shape

0afbc64

also check column names

e80f41e

check. everything.

3a95a6d

vuule requested a review from bdice October 26, 2022 20:53

bdice approved these changes Oct 26, 2022

View reviewed changes

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved

Update python/cudf/cudf/tests/test_parquet.py

eb2fa7b

Co-authored-by: Bradley Dice <bdice@bradleydice.com>

galipremsagar approved these changes Oct 26, 2022

View reviewed changes

ttnghia reviewed Nov 1, 2022

View reviewed changes

cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved

ttnghia reviewed Nov 1, 2022

View reviewed changes

cpp/tests/io/parquet_test.cpp Show resolved Hide resolved

vuule and others added 4 commits October 31, 2022 23:45

Merge branch 'branch-22.12' of https://github.com/rapidsai/cudf into …

eaa0e7a

…bug-parquet-writer-timedelta

Update cpp/src/io/parquet/page_enc.cu

7f841f0

Co-authored-by: Nghia Truong <nghiatruong.vn@gmail.com>

Merge branch 'bug-parquet-writer-timedelta' of https://github.com/vuu…

154a064

…le/cudf into bug-parquet-writer-timedelta

code review changes

971a7f1

vuule requested a review from ttnghia November 1, 2022 08:05

ttnghia reviewed Nov 1, 2022

View reviewed changes

cpp/tests/io/parquet_test.cpp Outdated Show resolved Hide resolved

PointKernel approved these changes Nov 1, 2022

View reviewed changes

add masks to duration tests

cd89006

ttnghia approved these changes Nov 1, 2022

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Nov 1, 2022

rapids-bot bot merged commit 1c2ad6a into rapidsai:branch-22.12 Nov 1, 2022

vuule deleted the bug-parquet-writer-timedelta branch November 1, 2022 22:32

GregoryKimball mentioned this pull request Jun 7, 2023

[BUG] Unable to write timedelta64[s] type correctly with parquet writer #13409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Parquet support for seconds and milliseconds duration types #11854

Fix Parquet support for seconds and milliseconds duration types #11854

vuule commented Oct 3, 2022 •

edited

Loading

vuule commented Oct 4, 2022

codecov bot commented Oct 6, 2022 •

edited

Loading

vuule commented Oct 6, 2022

vuule Oct 7, 2022

bdice Oct 24, 2022

vuule Oct 25, 2022

bdice Oct 24, 2022 •

edited

Loading

vuule Oct 24, 2022

bdice left a comment

vuule commented Oct 31, 2022

ttnghia Nov 1, 2022

vuule Nov 1, 2022

PointKernel left a comment

vuule commented Nov 1, 2022

Fix Parquet support for seconds and milliseconds duration types #11854

Fix Parquet support for seconds and milliseconds duration types #11854

Conversation

vuule commented Oct 3, 2022 • edited Loading

Description

Checklist

vuule commented Oct 4, 2022

codecov bot commented Oct 6, 2022 • edited Loading

Codecov Report

vuule commented Oct 6, 2022

vuule Oct 7, 2022

Choose a reason for hiding this comment

bdice Oct 24, 2022

Choose a reason for hiding this comment

vuule Oct 25, 2022

Choose a reason for hiding this comment

bdice Oct 24, 2022 • edited Loading

Choose a reason for hiding this comment

vuule Oct 24, 2022

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

vuule commented Oct 31, 2022

ttnghia Nov 1, 2022

Choose a reason for hiding this comment

vuule Nov 1, 2022

Choose a reason for hiding this comment

PointKernel left a comment

Choose a reason for hiding this comment

vuule commented Nov 1, 2022

vuule commented Oct 3, 2022 •

edited

Loading

codecov bot commented Oct 6, 2022 •

edited

Loading

bdice Oct 24, 2022 •

edited

Loading