Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested struct binop comparison #9452

Closed
wants to merge 79 commits into from

Conversation

rwlee
Copy link
Contributor

@rwlee rwlee commented Oct 15, 2021

Partial solution to #8964

Adding column to column nested struct comparison for binary comparison operations. The existing binops cannot currently handle structs or nested types -- this PR does not attempt to add support for list types. It also does not include scalar to column and column to scalar comparison support.

Assumptions

Code currently assumes that the nested structs have the same type structure, if this is a bad assumption they can be added with a CUDF_FAIL result.

Alternatives considered

Using existing binops for the nested child columns and then merging the results was considered and abandoned because of its performance and its handling of nulls within the child column comparisons. The child column comparisons would evaluate a comparison involving null on either side as null, causing invalid results for the struct struct(2, null) == struct(2, null) should be true and struct(2, null) == struct(2, 1) should be false. However the child comparisons for both would result in [true, null] which could not be combined unambiguously.

Design

Leverages the row_equality_comparator and row_lexicographic_comparator to compare flattened tables of the struct column. To support the output type and different operator arguments, a type dispatch and a case switch are used. I'm hopeful that there is a better way to dispatch on the operator type than the case switch that's currently used. Right now each operator is a unique function, though with some cleanup I think they can be merged.

Open questions that need to be answered

Is there a better way to dispatch on the operator type than the case switch currently being used?
Currently this sits separate from the binop operators, but can be integrated into binary_ops.cu. How should this be integrated and should the code be part of struct utilities or the existing binops code?
Should nested null equality be configurable?

Follow ons

Add column to scalar and scalar to columnar support
Add null_equal operator support
JNI support for the new functionality

@rwlee rwlee added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS labels Oct 15, 2021
@github-actions github-actions bot added the CMake CMake build issue label Oct 15, 2021
@rwlee rwlee added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 15, 2021
@codecov
Copy link

codecov bot commented Oct 15, 2021

Codecov Report

Merging #9452 (1dd1159) into branch-22.06 (8d861ce) will decrease coverage by 0.09%.
The diff coverage is 94.84%.

@@               Coverage Diff                @@
##           branch-22.06    #9452      +/-   ##
================================================
- Coverage         86.40%   86.31%   -0.10%     
================================================
  Files               143      144       +1     
  Lines             22448    22648     +200     
================================================
+ Hits              19396    19548     +152     
- Misses             3052     3100      +48     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/string.py 88.78% <ø> (-0.31%) ⬇️
python/cudf/cudf/core/frame.py 93.41% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/io/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/testing/testing.py 81.81% <50.00%> (+0.12%) ⬆️
python/cudf/cudf/io/parquet.py 90.83% <86.60%> (-1.87%) ⬇️
python/cudf/cudf/core/index.py 92.06% <88.88%> (-0.25%) ⬇️
python/cudf/cudf/core/scalar.py 89.01% <90.90%> (-0.31%) ⬇️
python/cudf/cudf/core/dataframe.py 93.78% <96.36%> (+0.08%) ⬆️
python/cudf/cudf/__init__.py 90.69% <100.00%> (+0.22%) ⬆️
... and 29 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cc29a0...1dd1159. Read the comment docs.

@sameerz sameerz linked an issue Oct 19, 2021 that may be closed by this pull request
@rwlee rwlee marked this pull request as ready for review October 21, 2021 23:07
@rwlee rwlee requested a review from a team as a code owner October 21, 2021 23:07
@rwlee rwlee added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 21, 2021
cpp/include/cudf/detail/structs/utilities.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/detail/structs/utilities.hpp Outdated Show resolved Hide resolved
cpp/src/structs/utilities.cpp Outdated Show resolved Hide resolved
cpp/src/binaryop/compiled/binary_ops.hpp Outdated Show resolved Hide resolved
rapids-bot bot pushed a commit that referenced this pull request May 4, 2022
This PR adds some struct utility functions. This change is needed for the eventual support of structs in binary operations. See also: PR #9452.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10776
rapids-bot bot pushed a commit that referenced this pull request May 4, 2022
…#10780)

This PR changes the internal APIs used for binary operations to use `column_view` objects instead of `column_device_view` objects. This change is needed for the eventual support of structs in binary operations. See also: PR #9452.

Authors:
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Ryan Lee (https://github.com/rwlee)
  - Nghia Truong (https://github.com/ttnghia)
  - Jake Hemstad (https://github.com/jrhemstad)

URL: #10780
Comment on lines +315 to +317
template <typename Comparator, weak_ordering... values>
struct weak_ordering_comparator_impl {
__device__ bool operator()(size_type const& lhs, size_type const& rhs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can actually static_assert this requirement to be sure it doesn't happen.

Double check that I'm using the right names here.

Suggested change
template <typename Comparator, weak_ordering... values>
struct weak_ordering_comparator_impl {
__device__ bool operator()(size_type const& lhs, size_type const& rhs)
template <typename Comparator, weak_ordering... values>
struct weak_ordering_comparator_impl {
static_assert( not (sizeof...(values) == 1 and values == weak_ordering::equivalent), "weak_ordering_comparator should not be used for pure equality comparisons. The `row_equality_comparator` should be used instead");
__device__ bool operator()(size_type const& lhs, size_type const& rhs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, the compiler complains about values == weak_ordering::equivalent in the context of the static assert. This forces an expansion --> ((weak_ordering::EQUIVALENT == values) && ...), which subsequently makes the sizeof...(values) == 1 check irrelevant.

binary_op_device_dispatcher<ops::NotEqual>{
*common_dtype, *outd, *lhsd, *rhsd, is_lhs_scalar, is_rhs_scalar});
}
if (is_struct(lhs.type()) && is_struct(rhs.type())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is about 65 lines long. Could we separate out the struct logic into another function and call it here / return early, to reduce the long if/else branches in this function body?

void dispatch_equality_op(...) {
  CUDF_EXPECTS(...);
  if (is_struct(lhs.type()) && is_struct(rhs.type())) {
    // Handle struct data
    dispatch_equality_op_structs(...);
    return;
  }
  // existing function body.
}

@@ -190,6 +192,26 @@ std::optional<data_type> get_common_type(data_type out, data_type lhs, data_type

bool is_supported_operation(data_type out, data_type lhs, data_type rhs, binary_operator op)
{
return double_type_dispatcher(lhs, rhs, is_supported_operation_functor{}, out, op);
return double_type_dispatcher(lhs, rhs, is_supported_operation_functor{}, out, op) ||
(is_struct(lhs) && is_struct(rhs) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be baked into the is_supported_operation_functor rather than treated as a special case here.

Comment on lines +202 to +205
bool is_supported_operation(data_type out,
column_view const& lhs,
column_view const& rhs,
binary_operator op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep only one API for is_supported_operation, rather than one for data types and one for column views. If the above function can be refactored to remove the special case for struct types, then we can replace the call to is_supported_operation(data_type out, data_type lhs, data_type rhs, binary_operator op) with that double type dispatcher call to the impl.

@@ -284,27 +290,71 @@ void apply_binary_op(mutable_column_view& out,
bool is_rhs_scalar,
rmm::cuda_stream_view stream)
{
auto common_dtype = get_common_type(out.type(), lhs.type(), rhs.type());
if (is_struct(lhs.type()) && is_struct(rhs.type())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as below: This function is about 65 lines long. Could we separate out the struct logic into another function and call it here / return early, to reduce the long if/else branches in this function body?

*/
template <typename Nullate>
template <typename Nullate, bool NanConfig = false>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to pull out the NanConfig and related changes into a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rapids-bot bot pushed a commit that referenced this pull request May 11, 2022
This PR changes the experimental `device_row_comparator` to return `weak_ordering` instead of `bool`.

Originally part of PR #9452. Aids PR #10730, which builds strongly-typed two table comparators and should return a `weak_ordering`.

Authors:
  - Ryan Lee (https://github.com/rwlee)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10793
rapids-bot bot pushed a commit that referenced this pull request Jun 3, 2022
Further splitting up #9452 -- split off at the suggestion of @bdice 

Related to #10781 and #4760 -- issues and discussions related to NaN comparison behavior.

Authors:
  - Ryan Lee (https://github.com/rwlee)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Jake Hemstad (https://github.com/jrhemstad)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #10870
@rwlee
Copy link
Contributor Author

rwlee commented Jun 27, 2022

Closing in favor of #11153

@rwlee rwlee closed this Jun 27, 2022
rapids-bot bot pushed a commit that referenced this pull request Jul 8, 2022
Replaces #9452
Takes the core cudf work from 9452 and integrates it into the JNI while removing struct support integration into cudf's binary operator API. Includes scalar comparison support in addition to vector to vector comparison support.

Follow on needed
Add null_equal operator support

Authors:
  - Ryan Lee (https://github.com/rwlee)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - Raza Jafri (https://github.com/razajafri)
  - Bradley Dice (https://github.com/bdice)
  - Nghia Truong (https://github.com/ttnghia)

URL: #11153
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add nested struct support for comparison operations