Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add accurate hash join size functions #8453

Merged
merged 21 commits into from
Jun 14, 2021
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
bbf7ad4
Add hash_join::inner_join_size API
PointKernel Jun 4, 2021
b3d2781
Add left_join_size & full_join_size APIs in hash_join class
PointKernel Jun 4, 2021
e08be99
Add detail::compute_join_output_size function
PointKernel Jun 4, 2021
5213eae
Implement inner/left/full_join_size functions for hash_join class
PointKernel Jun 4, 2021
59c55c8
Add default parameters to the existing join APIs
PointKernel Jun 4, 2021
1a41ea6
Add optional output_size argument to external join APIs
PointKernel Jun 4, 2021
e0df4f6
Add *_join_size APIs in the join header file + doc updates
PointKernel Jun 4, 2021
27be6d6
Updates:
PointKernel Jun 4, 2021
85b109e
Remove the deprecated estimate_join_output_size function
PointKernel Jun 4, 2021
f35d43c
Use std::nullopt instead of uninitialized std::optional variable
PointKernel Jun 4, 2021
a56f2e4
Update join unit tests for new inner/left/full_join_size APIs
PointKernel Jun 4, 2021
52d4f4e
Early exit for trivial left join cases
PointKernel Jun 4, 2021
24f4bfd
Fix bugs in full_join_size: create get_full_join_size function as a t…
PointKernel Jun 7, 2021
34e4ac4
Use optional::value_or to get rid of if-else branches
PointKernel Jun 8, 2021
dc51d99
Merge remote-tracking branch 'upstream/branch-21.08' into hash-join-size
PointKernel Jun 8, 2021
da98f84
Merge remote-tracking branch 'upstream/branch-21.08' into hash-join-size
PointKernel Jun 8, 2021
9a2d99e
Minor doc updates
PointKernel Jun 8, 2021
7c7366c
Minor update: pass stream to device_scalar::value
PointKernel Jun 8, 2021
89f068f
Merge remote-tracking branch 'upstream/branch-21.08' into hash-join-size
PointKernel Jun 9, 2021
8d2c8b1
Remove redundant CUDF_EXPECTS
PointKernel Jun 9, 2021
6ce1562
Merge remote-tracking branch 'upstream/branch-21.08' into hash-join-size
PointKernel Jun 14, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 74 additions & 15 deletions cpp/include/cudf/join.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_uvector.hpp>

#include <optional>
#include <vector>

namespace cudf {
Expand Down Expand Up @@ -522,13 +523,15 @@ class hash_join {

/**
* Returns the row indices that can be used to construct the result of performing
* an inner join between two tables. @see cudf::inner_join().
* an inner join between two tables. @see cudf::inner_join(). Behavior is undefined if the
* provided `output_size` is smaller than the actual output size.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param output_size Optional value which allows users to specify the exact output size.
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned table and columns' device
* memory.
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return A pair of columns [`left_indices`, `right_indices`] that can be used to construct
* the result of performing an inner join between two tables with `build` and `probe`
Expand All @@ -537,19 +540,22 @@ class hash_join {
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
inner_join(cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;
null_equality compare_nulls = null_equality::EQUAL,
std::optional<std::size_t> output_size = {},
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

/**
* Returns the row indices that can be used to construct the result of performing
* a left join between two tables. @see cudf::left_join().
* a left join between two tables. @see cudf::left_join(). Behavior is undefined if the
* provided `output_size` is smaller than the actual output size.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param output_size Optional value which allows users to specify the exact output size.
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned table and columns' device
* memory.
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return A pair of columns [`left_indices`, `right_indices`] that can be used to construct
* the result of performing a left join between two tables with `build` and `probe`
Expand All @@ -558,19 +564,22 @@ class hash_join {
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
left_join(cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;
null_equality compare_nulls = null_equality::EQUAL,
std::optional<std::size_t> output_size = {},
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

/**
* Returns the row indices that can be used to construct the result of performing
* a full join between two tables. @see cudf::full_join().
* a full join between two tables. @see cudf::full_join(). Behavior is undefined if the
* provided `output_size` is smaller than the actual output size.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param output_size Optional value which allows users to specify the exact output size.
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned table and columns' device
* memory.
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return A pair of columns [`left_indices`, `right_indices`] that can be used to construct
* the result of performing a full join between two tables with `build` and `probe`
Expand All @@ -579,9 +588,59 @@ class hash_join {
std::pair<std::unique_ptr<rmm::device_uvector<size_type>>,
std::unique_ptr<rmm::device_uvector<size_type>>>
full_join(cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;
null_equality compare_nulls = null_equality::EQUAL,
std::optional<std::size_t> output_size = {},
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

/**
* Returns the exact number of matches (rows) when performing an inner join with the specified
* probe table.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return The exact number of output when performing an inner join between two tables with
* `build` and `probe` as the the join keys .
*/
std::size_t inner_join_size(cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default) const;

/**
* Returns the exact number of matches (rows) when performing a left join with the specified probe
* table.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param stream CUDA stream used for device memory operations and kernel launches
*
* @return The exact number of output when performing a left join between two tables with `build`
* and `probe` as the the join keys .
*/
std::size_t left_join_size(cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default) const;

/**
* Returns the exact number of matches (rows) when performing a full join with the specified probe
* table.
*
* @param probe The probe table, from which the tuples are probed.
* @param compare_nulls Controls whether null join-key values should match or not.
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the intermediate table and columns' device
* memory.
*
* @return The exact number of output when performing a full join between two tables with `build`
* and `probe` as the the join keys .
*/
std::size_t full_join_size(
cudf::table_view const& probe,
null_equality compare_nulls = null_equality::EQUAL,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource()) const;

private:
struct hash_join_impl;
Expand Down
Loading