Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add struct type support for drop_list_duplicates #9202

Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Sep 9, 2021

This PR add support for struct type into the existing drop_list_duplicates API. This is the first time a nested type is supported in this function. Some more code cleanup has also been done.

To be clear: Only structs of basic types and structs of structs are supported. Structs of lists are not, due to their complex nature.

Closes #8972.
Blocked by #9218 (it is merged).

@ttnghia ttnghia added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 9, 2021
@ttnghia ttnghia self-assigned this Sep 9, 2021
@ttnghia ttnghia added the 3 - Ready for Review Ready for review by team label Sep 14, 2021
@ttnghia ttnghia marked this pull request as ready for review September 14, 2021 18:20
@ttnghia ttnghia requested a review from a team as a code owner September 14, 2021 18:20
@rapidsai rapidsai deleted a comment from codecov bot Sep 18, 2021
@rapidsai rapidsai deleted a comment from codecov bot Sep 18, 2021
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved

// If nans are considered as NOT equal, even both element(i) and element(j) are NaNs this
// comparison will still return `false`. This is the desired behavior in Apache Spark.
return lhs_val == rhs_val;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the purposes of checking for duplicates, are we expecting to be checking for exact duplicates or does potential floating point precision come into play? If this is the latter, it might useful in the future to examine the overlap between this code and the equality/equivalency checking code in cudf_test/column_utilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting question. I thought about this before when firstly implemented it, but since there is no requirement for adding precision control thus I just did an exact comparison. If there is a request for adding precision control then we can go back and add it easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is something we want to entertain. The implication would then be that all libcudf functions that are comparing floating point values for equality should allow controlling the precision. This is just part of how floating point values work and while it makes sense to have that control in testing utilities, putting it in actual compute APIs would be very detrimental to maintenance/performance/testing.

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor questions, but this looks pretty close to ready.

cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
{
// Two entries are not considered for equality if they belong to different lists
// If both element(i) and element(j) are NaNs and nans are considered as equal value then this
// comparison will return `true`. This is the desired behavior in Pandas.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is something we can improve in a future PR if it's too big a change now, but do we want to embed pandas/Spark specific behavior this deep in the call stack rather than having some configuration value passed down? I can see that it would need to get passed through about 4 levels of calls to get here, which would be annoying, but it also seems more in line with our general design philosophy to have an enum to configure this behavior rather than encoding it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely agree with this. When this API was first reviewed, there was a long discussion about how to satisfy both Pandas' users and Spark's users, since each side requires a different behavior (NaN are equal or not). In the meantime, I don't have a better idea how to avoid this deep call stack. I'm happy to do it (or anybody can do it) if there is a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but would it not be possible to define an enum for different nan behaviors and then pass that down (all the way through get_unique_entries_and_list_offsets->get_unique_entries_dispatch... I know it's a lot of forwarding) to control this behavior based on a templated version of the column_row_comparator_fn? I'm not saying that we need to do it in this PR, but if that solution would work and be preferable we should at least make an issue for this to avoid losing track once the PR is merged.

Copy link
Contributor Author

@ttnghia ttnghia Sep 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'll take a note on this. There should be a room for improvement. I'll create an issue for this too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Here it is: #9257

@codecov
Copy link

codecov bot commented Sep 20, 2021

Codecov Report

Merging #9202 (5eac9a9) into branch-21.10 (3ee3ecf) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

❗ Current head 5eac9a9 differs from pull request most recent head 2296f1a. Consider uploading reports for the commit 2296f1a to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.10    #9202      +/-   ##
================================================
- Coverage         10.85%   10.84%   -0.01%     
================================================
  Files               115      116       +1     
  Lines             19158    19171      +13     
================================================
  Hits               2080     2080              
- Misses            17078    17091      +13     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/io/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/text.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4defd25...2296f1a. Read the comment docs.

@vyasr
Copy link
Contributor

vyasr commented Sep 20, 2021

rerun tests

@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 20, 2021

Rerun tests.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments inline. For the record, I really dislike the -NaN replacement requirement. Ideally this should be factored out into an opt-in extension for platform-specific NaN behavior. But I guess libcudf is already riddled with platform-specific special casing at a very low level.

cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved
cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor Author

ttnghia commented Sep 21, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ba2cbd9 into rapidsai:branch-21.10 Sep 21, 2021
@ttnghia ttnghia deleted the drop_list_duplicates_for_structs branch September 21, 2021 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] add nested struct support for collect_set and merge_set aggregations
5 participants