Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize compaction operations #10030

Merged
merged 55 commits into from
Feb 2, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Jan 12, 2022

Related to #9413.

This PR adds unordered_drop_duplicates/unordered_distinct_count APIs by using hash-based algorithms. It doesn't close the original issue since adding std::unique-like drop_duplicates is not addressed in this PR. It involves several changes:

  • Change the behavior of the existing distinct_count: counting the number of consecutive groups of equivalent rows instead of total unique.
  • Add hash-based unordered_distinct_count: this new API counts unique rows across the whole table by using a hash map. It requires a newer version of cuco with bug fixing: Fix an insert count bug NVIDIA/cuCollections#132 and Get rid of std::move when using cuco::make_pair NVIDIA/cuCollections#138.
  • Add hash-based unordered_drop_duplicates: similar to drop_duplicates, but this API doesn't support keep option and the output is in an unspecified order.
  • Replace all the cpp-side drop_duplicates/distinct_count use cases with unordered_ versions.
  • Update and replace the existing compaction benchmark with nvbench.

@PointKernel PointKernel added feature request New feature or request 2 - In Progress Currently a work in progress code quality libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue breaking Breaking change labels Jan 12, 2022
@PointKernel PointKernel self-assigned this Jan 12, 2022
@PointKernel PointKernel requested review from a team as code owners January 12, 2022 23:14
@PointKernel PointKernel marked this pull request as draft January 12, 2022 23:14
@github-actions github-actions bot added Java Affects Java cuDF API. Python Affects Python cuDF API. labels Jan 12, 2022
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PointKernel I started another round of review on Friday. I'm going to submit these comments now, but I haven't completed the second pass of review yet. Some of these comments may be outdated now, I apologize!

cpp/src/stream_compaction/drop_duplicates.cu Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/stream_compaction.hpp Show resolved Hide resolved
cpp/src/stream_compaction/drop_duplicates.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/drop_duplicates.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.04@baff5cf). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-22.04   #10030   +/-   ##
===============================================
  Coverage                ?   10.48%           
===============================================
  Files                   ?      122           
  Lines                   ?    20496           
  Branches                ?        0           
===============================================
  Hits                    ?     2148           
  Misses                  ?    18348           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update baff5cf...a60c128. Read the comment docs.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round of feedback attached. Thanks for your persistence, it is a large PR!

cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/tests/stream_compaction/distinct_count_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/stream_compaction/distinct_count_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/stream_compaction/distinct_count_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/stream_compaction/distinct_count_tests.cpp Outdated Show resolved Hide resolved
@PointKernel
Copy link
Member Author

rerun tests

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have only a few minor comments for style/clarity left. I am really happy with the progress this has seen through several iterations, and I appreciate your effort on this PR! I will approve this once the minor comments have been resolved.

cpp/include/cudf/stream_compaction.hpp Show resolved Hide resolved
cpp/src/dictionary/add_keys.cu Outdated Show resolved Hide resolved
cpp/src/dictionary/add_keys.cu Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct_count.cu Show resolved Hide resolved
cpp/tests/stream_compaction/drop_duplicates_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/stream_compaction/drop_duplicates_tests.cpp Outdated Show resolved Hide resolved
@bdice
Copy link
Contributor

bdice commented Feb 2, 2022

Suggested edit:

diff --git a/cpp/src/stream_compaction/distinct_count.cu b/cpp/src/stream_compaction/distinct_count.cu
--- cpp/src/stream_compaction/distinct_count.cu
+++ cpp/src/stream_compaction/distinct_count.cu
@@ -233,21 +233,24 @@
                                          rmm::cuda_stream_view stream)
 {
   if (0 == input.size() or input.null_count() == input.size()) { return 0; }
 
-  // Check for NaNs
-  // Checking for nulls in input and flag nan_handling, as the count will
-  // only get affected if these two conditions are true. NaN will only be
-  // double-counted as a null if nan_handling was NAN_IS_NULL and input also
-  // had null values. If so, we decrement the count.
+  auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
+
+  // Check for nulls. If the null policy is EXCLUDE and null values were found,
+  // we decrement the count.
+  auto const has_null = input.has_nulls();
+  if (has_null and null_handling == null_policy::EXCLUDE) { --count; }
+
+  // Check for NaNs. There are two cases that can lead to decrementing the
+  // count. The first case is when the input has no nulls, but has NaN values
+  // handled as a null via NAN_IS_NULL and has a policy to EXCLUDE null values
+  // from the count. The second case is when the input has null values and NaN
+  // values handled as nulls via NAN_IS_NULL. Regardless of whether the null
+  // policy is set to EXCLUDE, we decrement the count to avoid double-counting
+  // null and NaN as distinct entities.
   auto const has_nan_as_null = (nan_handling == nan_policy::NAN_IS_NULL) and
                                cudf::type_dispatcher(input.type(), has_nans{}, input, stream);
-  auto const has_null = input.has_nulls();
-
-  auto count = detail::unordered_distinct_count(table_view{{input}}, null_equality::EQUAL, stream);
-
-  // if nan is considered null and there are already null values
-  if (null_handling == null_policy::EXCLUDE and has_null) { --count; }
   if (has_nan_as_null and (has_null or null_handling == null_policy::EXCLUDE)) { --count; }
   return count;
 }
 }  // namespace detail

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PointKernel I had one final suggestion relating to #10030 (comment). Diff is above. Approving.

Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b6bb463 into rapidsai:branch-22.04 Feb 2, 2022
@PointKernel PointKernel deleted the optimize-compaction branch May 26, 2022 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants