Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null count in statistics for parquet #9303

Merged
merged 10 commits into from
Oct 6, 2021

Conversation

devavret
Copy link
Contributor

Fixes #9221

The bug was in the method for calculating nulls (num_rows - num_valid) which needed num_rows to be summed in typed_statistics_chunk because it's not part of the untyped struct. This caused a problem because the serial typed_statistics_chunk.reduce() had the num_rows summation but the parallel block_reduce(typed_statistics_chunk) did not.

As part of the fix, num_rows is removed entirely and null_count is incremented in case of encountering nulls in the column.

@devavret devavret added bug Something isn't working 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer non-breaking Non-breaking change labels Sep 24, 2021
@devavret devavret requested review from a team as code owners September 24, 2021 18:29
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Sep 24, 2021
@codecov
Copy link

codecov bot commented Sep 24, 2021

Codecov Report

Merging #9303 (9095cdb) into branch-21.12 (ab4bfaa) will decrease coverage by 0.03%.
The diff coverage is 1.72%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9303      +/-   ##
================================================
- Coverage         10.79%   10.75%   -0.04%     
================================================
  Files               116      116              
  Lines             18869    19482     +613     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17386     +553     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
... and 82 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88eefe5...9095cdb. Read the comment docs.

@devavret devavret requested review from vuule and kaatish October 5, 2021 22:40
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
Copy link
Contributor

@isVoid isVoid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 small q

python/cudf/cudf/tests/test_parquet.py Outdated Show resolved Hide resolved
Co-authored-by: Michael Wang <isVoid@users.noreply.github.com>
@vuule
Copy link
Contributor

vuule commented Oct 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3f09f96 into rapidsai:branch-21.12 Oct 6, 2021
@vyasr vyasr added 4 - Needs Review Waiting for reviewer to review or respond and removed 4 - Needs cuIO Reviewer labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cuDF to_parquet Writing Incorrect Column-Chunk Statistics
6 participants