Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occupancy improvement for Hash table build #15700

Open
wants to merge 50 commits into
base: branch-24.10
Choose a base branch
from

Conversation

tgujar
Copy link
Contributor

@tgujar tgujar commented May 8, 2024

Description

Implements specialized template dispatch for hash joins and mixed semi joins to fix issue describes in #15502.

At a high level, this PR typedef's some types to void depending on the column types in the row's to avoid high register usage for comparator and hasher operations associated with more involved types (lists, structs, string, ...). This is done by dynamic dispatch on CPU side using std::variant+std::visit and dispatching with a specialized template.

This pattern can later be extended to other joins and also to groupby operation. Any operator using row hasher and row comparator should be able to see and improvement in occupancy for hash table build/probe operation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented May 8, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label May 8, 2024
@tgujar
Copy link
Contributor Author

tgujar commented May 8, 2024

I think the approach of specializing the type dispatcher is very cumbersome and will lead to a lot of code replication. Currently, I have the conditional dispatch working for device_row_hasher but I am unsure if there is a better way to implement this. We could introduce a macro here to generate the code, what do you think?

@PointKernel PointKernel added non-breaking Non-breaking change 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function Performance Performance related issue labels May 8, 2024
@PointKernel
Copy link
Member

/ok to test

1 similar comment
@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

PointKernel commented May 14, 2024

@tgujar I've updated the docs to unblock CI. Have you noticed any performance regressions for other use cases? It seems that it improves the performance for mixed join but the performance drops significantly in other cases using row hasher.

Comment on lines 48 to 50
id_to_type<type_id::DECIMAL128>,
id_to_type<type_id::DECIMAL64>,
id_to_type<type_id::DECIMAL32>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think decimal types are complex type. They are just a wrapper around some integer type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equality operator for Decimal will perform scaling which uses exponentiation.

CUDF_HOST_DEVICE inline bool operator==(fixed_point<Rep1, Rad1> const& lhs,

I see a reduction in register usage if I comment out decimal types in #15502. I think we can still decide on the types excluded in the branches later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if we could resolve this. I have addressed this here #15700 (comment)

@PointKernel
Copy link
Member

/ok to test

@PointKernel
Copy link
Member

@tgujar Could you take a look at the failing tests?

@PointKernel
Copy link
Member

/ok to test

1 similar comment
@PointKernel
Copy link
Member

/ok to test

@davidwendt
Copy link
Contributor

This PR needs to be rebased on branch-24.08.

@tgujar
Copy link
Contributor Author

tgujar commented May 30, 2024

Specializing both the comparator and the hasher drops the register usage to 54 instead of the expected 46 for the mixed semi join case. Investigating why the register pressure is different from commenting out the code paths.
The current plan is to avoid using a macro(as mentioned here) and instead do dynamic dispatch on CPU side using std::variant and std::visit

@davidwendt
Copy link
Contributor

/ok to test

@ttnghia
Copy link
Contributor

ttnghia commented Aug 2, 2024

/ok to test

@davidwendt
Copy link
Contributor

/ok to test

@PointKernel
Copy link
Member

/ok to test

Comment on lines +130 to +132
id_to_type<type_id::DECIMAL128>,
id_to_type<type_id::DECIMAL64>,
id_to_type<type_id::DECIMAL32>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decimals are fixed-width types instead of compound. If this is intended, we probably want to use another term here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, Will removing the decimal from the current compound definition impact performance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe row-operators may benefit being dispatched using the dispatch_storage_type defined here:

struct dispatch_storage_type {

Since we do not compare columns of different types, the scale values are the same and so the compare on storage type for decimals/fixed-point will have the same result with less generated code.
Here are a few examples of dispatching with the dispatch_storage_type:
cudf::type_dispatcher<dispatch_storage_type>(input.type(),

cudf::type_dispatcher<dispatch_storage_type>(source_column.type(),

Maybe this can be combined with the dispatch mechanism here and eliminate any need for special handling for fixed-point types.

Copy link
Contributor Author

@tgujar tgujar Sep 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured this by selectively commenting out the types in type_dispatcher.hpp and I see non-trivial speedup. In both the measured benchmarks, other complex types (list, struct, string, dict) are commented out.
I think for future work it would be good to see if we are still latency bound.
Decimal effect on Hash join occupancy - Sheet1.csv

(Speedup numbers in sheet because I learnt later that nvbench compare script only supports json format)

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
@@ -0,0 +1,44 @@
/*
* Copyright (c) 2022-2024, NVIDIA CORPORATION.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (c) 2022-2024, NVIDIA CORPORATION.
* Copyright (c) 2024, NVIDIA CORPORATION.

cpp/src/join/mixed_join_kernels_semi_nested.cu Outdated Show resolved Hide resolved
cpp/src/join/mixed_join_semi.cu Outdated Show resolved Hide resolved
cpp/src/join/mixed_join_semi.cu Outdated Show resolved Hide resolved
@PointKernel
Copy link
Member

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments attached. Thanks for this! There's a lot of heavy templating but it's fairly readable in spite of that.

I am also interested in build time comparisons to the previous code.

src/join/mixed_join_semi.cu
src/join/mixed_join_kernels_semi.cu
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep these filenames alphabetized. If you like, you could rename this to mixed_join_semi_kernels.cu.

@@ -109,8 +130,8 @@ struct distinct_hash_join {
std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
_preprocessed_build; ///< input table preprocssed for row operators
std::shared_ptr<cudf::experimental::row::equality::preprocessed_table>
_preprocessed_probe; ///< input table preprocssed for row operators
hash_table_type _hash_table; ///< hash table built on `_build`
_preprocessed_probe; ///< input table preprocssed for row operators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_preprocessed_probe; ///< input table preprocssed for row operators
_preprocessed_probe; ///< input table preprocessed for row operators

cpp/include/cudf/table/experimental/row_operators.cuh Outdated Show resolved Hide resolved
struct dispatch_void_conditional_generator {
/// The underlying type
template <typename T>
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
using type = dispatch_void_conditional_t<std::disjunction<std::is_same<T, Types>...>::value, T>;
using type = dispatch_void_conditional_t<std::disjunction_v<std::is_same<T, Types>...>, T>;

/// The type to dispatch to if the type is nested
using type = std::conditional_t<t == type_id::STRUCT or t == type_id::LIST, void, id_to_type<t>>;
/// The underlying type
using type = dispatch_void_if_nested_t<id_to_type<t>>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically we define things the other way -- define the dispatch_void_if_nested struct, then define using dispatch_void_if_nested_t in terms of the ::type member of that struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay yep, but I think here I need dispatch_void_if_nested to be templated on cudf::type_id but I need dispatch_void_if_nested_t to be templated on some type T. Maybe they should be named differently?


auto const output_begin =
thrust::make_transform_output_iterator(build_indices->begin(), output_fn{});
// TODO conditional find for nulls once `cuco::static_set::find_if` is added
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature now exists in cuCollections, I think. Let's refactor if we can.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think maybe we could address this in a separate MR since the change wouldn't reflect this MR description. What do you think?

cpp/src/join/mixed_join_kernel_semi_impl.cuh Outdated Show resolved Hide resolved
cpp/src/join/mixed_join_kernel_semi_impl.cuh Outdated Show resolved Hide resolved
cpp/src/join/mixed_join_kernels_semi.cuh Outdated Show resolved Hide resolved
cpp/src/join/mixed_join_kernels_semi_compound.cu Outdated Show resolved Hide resolved
@tgujar
Copy link
Contributor Author

tgujar commented Aug 27, 2024

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR breaks ODR violations as corrected in #16603.

It needs to be refactored so that all kernels are only launched from the TU that holds the implemenation.

@robertmaynard
Copy link
Contributor

@tgujar can you please resolve the merge conflicts against ToT? The build time still appears to be an issue, but we need a successful CI run to confirm.

Unsure how to handle this. #16603 says that we would like the launch and compilation to happen in the same TU for CUDA whole compilation mode. In this PR case, it means that all the instantiation of the kernels happen in same TU. But we split the instantiation in this PR to reduce compilation time for mixed semi join kernels. I think multiple launch functions wouldn't be good design.

You should be able to follow the updated pattern seen in cpp/src/join/mixed_join_kernel_nulls.cu, cpp/src/join/mixed_join_kernel.cu, cpp/src/join/mixed_join_kernel.cuh, and cpp/src/join/mixed_join_kernel.hpp.

That restructing has us separate TU's for the mixed join kernel based on the nullability of the input. This was done by having the intermidate host launch code have a specilization in each TU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: In Progress
Status: Slip
Development

Successfully merging this pull request may close these issues.

8 participants