Add struct type support for `drop_list_duplicates` #9202

ttnghia · 2021-09-09T03:11:43Z

This PR add support for struct type into the existing drop_list_duplicates API. This is the first time a nested type is supported in this function. Some more code cleanup has also been done.

To be clear: Only structs of basic types and structs of structs are supported. Structs of lists are not, due to their complex nature.

Closes #8972.
Blocked by #9218 (it is merged).

cpp/include/cudf/detail/gather.cuh

cpp/src/lists/drop_list_duplicates.cu

nvdbaranec · 2021-09-20T15:19:51Z

cpp/src/lists/drop_list_duplicates.cu

+
+    // If nans are considered as NOT equal, even both element(i) and element(j) are NaNs this
+    // comparison will still return `false`. This is the desired behavior in Apache Spark.
+    return lhs_val == rhs_val;


For the purposes of checking for duplicates, are we expecting to be checking for exact duplicates or does potential floating point precision come into play? If this is the latter, it might useful in the future to examine the overlap between this code and the equality/equivalency checking code in cudf_test/column_utilities.

This is an interesting question. I thought about this before when firstly implemented it, but since there is no requirement for adding precision control thus I just did an exact comparison. If there is a request for adding precision control then we can go back and add it easily.

I don't think this is something we want to entertain. The implication would then be that all libcudf functions that are comparing floating point values for equality should allow controlling the precision. This is just part of how floating point values work and while it makes sense to have that control in testing utilities, putting it in actual compute APIs would be very detrimental to maintenance/performance/testing.

cpp/src/lists/drop_list_duplicates.cu

vyasr

A couple of minor questions, but this looks pretty close to ready.

cpp/src/lists/drop_list_duplicates.cu

vyasr · 2021-09-20T18:19:55Z

cpp/src/lists/drop_list_duplicates.cu

  {
-    // Two entries are not considered for equality if they belong to different lists
+    // If both element(i) and element(j) are NaNs and nans are considered as equal value then this
+    // comparison will return `true`. This is the desired behavior in Pandas.


Maybe this is something we can improve in a future PR if it's too big a change now, but do we want to embed pandas/Spark specific behavior this deep in the call stack rather than having some configuration value passed down? I can see that it would need to get passed through about 4 levels of calls to get here, which would be annoying, but it also seems more in line with our general design philosophy to have an enum to configure this behavior rather than encoding it here.

Absolutely agree with this. When this API was first reviewed, there was a long discussion about how to satisfy both Pandas' users and Spark's users, since each side requires a different behavior (NaN are equal or not). In the meantime, I don't have a better idea how to avoid this deep call stack. I'm happy to do it (or anybody can do it) if there is a better way.

Maybe I'm missing something, but would it not be possible to define an enum for different nan behaviors and then pass that down (all the way through get_unique_entries_and_list_offsets->get_unique_entries_dispatch... I know it's a lot of forwarding) to control this behavior based on a templated version of the column_row_comparator_fn? I'm not saying that we need to do it in this PR, but if that solution would work and be preferable we should at least make an issue for this to avoid losing track once the PR is merged.

Thanks. I'll take a note on this. There should be a room for improvement. I'll create an issue for this too.

Done. Here it is: #9257

codecov · 2021-09-20T18:55:34Z

Codecov Report

Merging #9202 (5eac9a9) into branch-21.10 (3ee3ecf) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

❗ Current head 5eac9a9 differs from pull request most recent head 2296f1a. Consider uploading reports for the commit 2296f1a to get more accurate results

@@               Coverage Diff                @@
##           branch-21.10    #9202      +/-   ##
================================================
- Coverage         10.85%   10.84%   -0.01%     
================================================
  Files               115      116       +1     
  Lines             19158    19171      +13     
================================================
  Hits               2080     2080              
- Misses            17078    17091      +13

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/_lib/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/io/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/text.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4defd25...2296f1a. Read the comment docs.

vyasr · 2021-09-20T20:11:56Z

rerun tests

ttnghia · 2021-09-20T20:35:10Z

Rerun tests.

harrism

Minor comments inline. For the record, I really dislike the -NaN replacement requirement. Ideally this should be factored out into an opt-in extension for platform-specific NaN behavior. But I guess libcudf is already riddled with platform-specific special casing at a very low level.

cpp/src/lists/drop_list_duplicates.cu

…nans_dispatch`

ttnghia · 2021-09-21T00:48:48Z

@gpucibot merge

ttnghia added 8 commits September 3, 2021 14:50

Update doxygen

aa92eb4

Change preconditioning

669a8a4

Rewrite tests

206b300

Implement has_negative_nans_fn for structs

82f31cf

Implement replace_negative_nans_fn for structs

dca7b74

Implementation is working

75b246f

Add test

2a7efce

Merge branch 'branch-21.10' into drop_list_duplicates_for_structs

b6b0c45

ttnghia added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Sep 9, 2021

ttnghia self-assigned this Sep 9, 2021

ttnghia added 7 commits September 12, 2021 14:33

Add tests for structs

edca8d5

Fix test for structs

73f4823

Rename structs

d00c192

Access children of structs column by sliced child

5c56282

Rewrite doxygen, rename variable, and various other small changes

a27e186

Add sliced input test

9d54708

Apply upstream gather.cuh

064c958

ttnghia commented Sep 14, 2021

View reviewed changes

cpp/include/cudf/detail/gather.cuh Outdated Show resolved Hide resolved

ttnghia added the 3 - Ready for Review Ready for review by team label Sep 14, 2021

ttnghia marked this pull request as ready for review September 14, 2021 18:20

ttnghia requested a review from a team as a code owner September 14, 2021 18:20

ttnghia requested review from harrism and nvdbaranec September 14, 2021 18:20

ttnghia added the 5 - Merge After Dependencies label Sep 14, 2021

Fix offsets with non-zero base

c89b40b

jrhemstad reviewed Sep 17, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved

rapidsai deleted a comment from codecov bot Sep 18, 2021

nvdbaranec reviewed Sep 20, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Show resolved Hide resolved

nvdbaranec reviewed Sep 20, 2021

View reviewed changes

nvdbaranec requested changes Sep 20, 2021

View reviewed changes

cpp/src/lists/drop_list_duplicates.cu Outdated Show resolved Hide resolved

vyasr requested changes Sep 20, 2021

View reviewed changes

ttnghia added 2 commits September 20, 2021 12:53

Address review comments

f8767f3

Merge branch 'branch-21.10' into drop_list_duplicates_for_structs

ad15972

ttnghia requested review from vyasr and nvdbaranec September 20, 2021 18:55

vyasr approved these changes Sep 20, 2021

View reviewed changes

This was referenced Sep 20, 2021

[FEA] Simplify code for NaN handling in lists/drop_list_duplicates #9257

Closed

[FEA] Add an internal utility API to return an offsets column of a sliced column starting with zero #9256

Open

nvdbaranec approved these changes Sep 20, 2021

View reviewed changes

Remove debug printing

8a2e993

harrism requested changes Sep 20, 2021

View reviewed changes

Add constructors for the functors and add comments for `has_negative_…

2296f1a

…nans_dispatch`

ttnghia requested a review from harrism September 20, 2021 22:49

harrism approved these changes Sep 21, 2021

View reviewed changes

rapids-bot bot merged commit ba2cbd9 into rapidsai:branch-21.10 Sep 21, 2021

ttnghia deleted the drop_list_duplicates_for_structs branch September 21, 2021 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add struct type support for `drop_list_duplicates` #9202

Add struct type support for `drop_list_duplicates` #9202

ttnghia commented Sep 9, 2021 •

edited

Loading

nvdbaranec Sep 20, 2021

ttnghia Sep 20, 2021

jrhemstad Sep 20, 2021

vyasr left a comment

vyasr Sep 20, 2021

ttnghia Sep 20, 2021

vyasr Sep 20, 2021

ttnghia Sep 20, 2021 •

edited

Loading

ttnghia Sep 20, 2021

codecov bot commented Sep 20, 2021 •

edited

Loading

vyasr commented Sep 20, 2021

ttnghia commented Sep 20, 2021

harrism left a comment

ttnghia commented Sep 21, 2021

Add struct type support for drop_list_duplicates #9202

Add struct type support for drop_list_duplicates #9202

Conversation

ttnghia commented Sep 9, 2021 • edited Loading

nvdbaranec Sep 20, 2021

Choose a reason for hiding this comment

ttnghia Sep 20, 2021

Choose a reason for hiding this comment

jrhemstad Sep 20, 2021

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

vyasr Sep 20, 2021

Choose a reason for hiding this comment

ttnghia Sep 20, 2021

Choose a reason for hiding this comment

vyasr Sep 20, 2021

Choose a reason for hiding this comment

ttnghia Sep 20, 2021 • edited Loading

Choose a reason for hiding this comment

ttnghia Sep 20, 2021

Choose a reason for hiding this comment

codecov bot commented Sep 20, 2021 • edited Loading

Codecov Report

vyasr commented Sep 20, 2021

ttnghia commented Sep 20, 2021

harrism left a comment

Choose a reason for hiding this comment

ttnghia commented Sep 21, 2021

Add struct type support for `drop_list_duplicates` #9202

Add struct type support for `drop_list_duplicates` #9202

ttnghia commented Sep 9, 2021 •

edited

Loading

ttnghia Sep 20, 2021 •

edited

Loading

codecov bot commented Sep 20, 2021 •

edited

Loading