Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HostColumnVectoreCore#isNull should return true for out-of-range rows #10779

Merged

Conversation

gerashegalov
Copy link
Contributor

This PR suggests a 3VL way of interpreting isNull for a rowId out of bounds. Such a value is unknown and therefore isNull should be true.

NVIDIA/spark-rapids#5140 shows that SpecificUnsafeProjection may probe child columns for NULL even though the parent column row is also NULL.

However there are no rows in the child CV when the parent row is NULL leading to an assert violation if asserts are enabled or an NPE if disabled.

Signed-off-by: Gera Shegalov gera@apache.org

Signed-off-by: Gera Shegalov <gera@apache.org>
@gerashegalov gerashegalov added bug Something isn't working Java Affects Java cuDF API. labels May 3, 2022
@gerashegalov gerashegalov requested a review from a team as a code owner May 3, 2022 21:44
@gerashegalov gerashegalov added the breaking Breaking change label May 3, 2022
@gerashegalov gerashegalov self-assigned this May 3, 2022
@codecov
Copy link

codecov bot commented May 3, 2022

Codecov Report

Merging #10779 (b350219) into branch-22.06 (a9eb47c) will increase coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10779      +/-   ##
================================================
+ Coverage         86.40%   86.43%   +0.03%     
================================================
  Files               143      143              
  Lines             22448    22448              
================================================
+ Hits              19396    19403       +7     
+ Misses             3052     3045       -7     
Impacted Files Coverage Δ
python/cudf/cudf/testing/_utils.py 93.98% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.74% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 89.21% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/column/numerical.py 96.17% <0.00%> (+0.29%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 92.91% <0.00%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a9eb47c...b350219. Read the comment docs.

@ttnghia ttnghia changed the title HostColumnVectoreCore#isNull should return true for out-of-range rowId HostColumnVectoreCore#isNull should return true for out-of-range rows May 4, 2022
return BitVectorHelper.isNull(offHeap.valid, rowIndex);
}
return false;
return rowIndex < 0 || rowIndex >= rows // unknown, hence NULL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is logically correct. Accessing out of bounds is different from having null.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get to decide what is logically correct because we define the logic. E.g., we have different out-of-bound policies in other use cases. We could make it configurable here as well once there is a different use case necessitating it.

I need to instrument codegen in Spark to get a better grasp of how we end up in this lookup. However, in terms of tests it appears to be a fix that strictly improves the plugin because it does not break existing tests.

@gerashegalov gerashegalov added the 5 - Ready to Merge Testing and reviews complete, ready to merge label May 6, 2022
@gerashegalov
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b12fd56 into rapidsai:branch-22.06 May 6, 2022
@gerashegalov gerashegalov deleted the nullForRowIdOutOfRange branch May 6, 2022 08:13
gerashegalov added a commit to NVIDIA/spark-rapids that referenced this pull request May 9, 2022
This PR adds tests verifying the fix for correctly dealing with all-null data partitions in rapidsai/cudf#10779. 

Signed-off-by: Gera Shegalov <gera@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge breaking Breaking change bug Something isn't working Java Affects Java cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants