Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row hasher with nested column support #10641

Merged
merged 182 commits into from
Apr 29, 2022
Merged
Show file tree
Hide file tree
Changes from 172 commits
Commits
Show all changes
182 commits
Select commit Hold shift + click to select a range
933c974
First commit
devavret Aug 26, 2021
a1636e5
testing and profiling deep single hierarchy struct
devavret Aug 27, 2021
d59f54c
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 12, 2022
765dd8d
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 12, 2022
3d21daf
Make the sandboxed test compile again
devavret Jan 14, 2022
9f32e6b
Update my row_comparator with nullate
devavret Jan 15, 2022
53d3c90
Merge branch 'branch-22.02' into struct-row-comp
devavret Jan 21, 2022
022e2a4
Basic verticalization utility and experimental namespace
devavret Jan 24, 2022
7fef643
clean up most of row operators that I didn't change.
devavret Jan 26, 2022
930d8de
Sliced column test
devavret Jan 27, 2022
0ecc4f8
column order and null precendence support
devavret Jan 28, 2022
ff36d2d
Manually managed stack
devavret Jan 28, 2022
cd0f938
New depth based method to avoid superimpose nulls
devavret Feb 2, 2022
7b8e060
Put sort2 impl in separate TU
devavret Feb 2, 2022
25eb237
Merge branch 'branch-22.04' into struct-row-comp
devavret Feb 2, 2022
c8e527e
Basic working list == comp
devavret Feb 4, 2022
eb87ed7
Merge branch 'branch-22.04' into list-row-eq
devavret Feb 4, 2022
cc1584d
deeper list test
devavret Feb 4, 2022
925481a
benchmark list ==
devavret Feb 7, 2022
b2b41c7
small cleanups
devavret Feb 7, 2022
8aaf6f3
List hash working with drop duplicates
devavret Feb 9, 2022
d2937cf
Merge branch 'branch-22.04' into struct-row-comp
devavret Feb 10, 2022
d55c9c7
Move verticalization code to row_comparator.cpp
devavret Feb 15, 2022
b7cdfe0
Merge branch 'struct-row-comp' into list-row-eq
devavret Feb 15, 2022
8309151
Use regular type dispatcher with new id type map
devavret Feb 15, 2022
8717b9c
Early return from unequal leaf elements
devavret Feb 15, 2022
21df6cf
Combined struct and list equality operator
devavret Feb 17, 2022
fa35461
Add null testing to list level also
devavret Feb 18, 2022
3bd749e
Owning row lex operator
devavret Feb 22, 2022
613d664
merge fixes
devavret Feb 23, 2022
2ef3ac7
Move struct logic out of main row loop and into element_relational_co…
devavret Feb 24, 2022
5577431
pushing even more logic into element_relational_comparator
devavret Feb 24, 2022
f037bc0
More optimizations.
devavret Feb 24, 2022
8c54a85
review changes
devavret Feb 24, 2022
9d24a87
Checks to ensure tables can be compared
devavret Feb 24, 2022
4e5fe21
Merge branch 'struct-row-comp' into list-row-eq
devavret Feb 24, 2022
693dbca
Owning row eq operator
devavret Feb 24, 2022
294b0cf
Another attempt at new API
devavret Mar 2, 2022
a4c799a
Remove stack based struct comparator + cleanups
devavret Mar 7, 2022
ecb2eb0
thrust::pair -> cuda::std::pair
devavret Mar 7, 2022
34a6564
optional device spans
devavret Mar 7, 2022
fa4abb4
Prevent device comparator construction from any table_device_view
devavret Mar 7, 2022
b213210
Nullate default and fix for non nested depth
devavret Mar 7, 2022
6f9bedd
Fix an unsurfaced bug about depth passing
devavret Mar 7, 2022
be69ffa
Switch over sort impl to new comparator
devavret Mar 8, 2022
76d535a
Copyright changes to satiate ci
devavret Mar 8, 2022
78d10fc
Migrate struct sort benchmark to nvbench
devavret Mar 8, 2022
15920ee
Avoid optional::value in favor of *
devavret Mar 8, 2022
d01fc30
throw when trying to sort List
devavret Mar 8, 2022
ac2eb0d
Leftover change for struct sort nvbench
devavret Mar 8, 2022
076c4c1
struct without null pushdown test
devavret Mar 9, 2022
e8a9202
Remove temporary sort2_test
devavret Mar 9, 2022
a4b1167
Remove temporary sort2 files
devavret Mar 9, 2022
62f6914
leftover sort2 in cmake
devavret Mar 9, 2022
8f628ae
cleanup benchmark headers
devavret Mar 9, 2022
dc7d125
Docs
devavret Mar 9, 2022
fa7d940
Merge branch 'branch-22.04' into struct-row-comp
devavret Mar 10, 2022
bdc1cb6
Merge branch 'struct-row-comp' into list-row-eq
devavret Mar 10, 2022
83ba4bf
Match API with self lex comparator
devavret Mar 10, 2022
9c0060f
Guard null check code with nullate
devavret Mar 10, 2022
57fdd1e
remove redundant size check
devavret Mar 10, 2022
76c883f
Apply suggestions from code review
devavret Mar 14, 2022
5fc82a9
Docs
devavret Mar 14, 2022
82db9d6
port benchmark to nvbench
devavret Mar 14, 2022
7871c48
privatise row_equality_comparator's ctor
devavret Mar 14, 2022
c9e5dc3
List rank test cleanup and merge with reduction test
devavret Mar 14, 2022
98b253b
rmm pool in benchmark + style fixes
devavret Mar 14, 2022
38fa66f
Merge branch 'struct-row-comp' into list-row-eq
devavret Mar 14, 2022
3255dc5
Merge branch 'branch-22.04' into struct-row-comp
devavret Mar 14, 2022
44d3735
Merge branch 'struct-row-comp' into list-row-eq
devavret Mar 14, 2022
e2d4b93
run cmake-format
devavret Mar 14, 2022
78e5f4e
Merge branch 'list-row-eq' into list-row-hash
devavret Mar 14, 2022
00fdb00
Update the PR for latest changes to hashing
devavret Mar 15, 2022
96e9855
Remove hash_combine arg
devavret Mar 15, 2022
52e3a35
Review changes
devavret Mar 15, 2022
9470f06
More review changes
devavret Mar 15, 2022
8cb324c
Move benchmark to correct source. Also fix issues in benchmark to be …
devavret Mar 16, 2022
264c9d1
add null list hashing and test. Move test to proper src
devavret Mar 17, 2022
7c897c3
Review changes req by @vyasr
devavret Mar 17, 2022
e0467c7
add a runtime is_relationally_comparable funtion
devavret Mar 17, 2022
fc1e993
Review changes
devavret Mar 18, 2022
096593f
Review changes
devavret Mar 18, 2022
f539647
Avoid WAR of storing a table_device_view
devavret Mar 18, 2022
01be0bc
Rename struct_linearize to decompose_structs and Improve docs
devavret Mar 18, 2022
de95530
review changes req by @ttnghia
devavret Mar 21, 2022
6c45cd4
Namespace changes and making element comparator private
devavret Mar 21, 2022
f72ce8b
Merge branch 'struct-row-comp' into list-row-eq
devavret Mar 21, 2022
c624317
put in row namespace
devavret Mar 21, 2022
0ca2d14
Review on lex applied to equality
devavret Mar 21, 2022
f309837
create method just like struct lex
devavret Mar 21, 2022
81f9ab8
loop 0 to size -> start off to end off
devavret Mar 22, 2022
9bfd08e
Update cpp/include/cudf/table/experimental/row_operators.cuh
devavret Mar 22, 2022
70e4581
Merge branch 'struct-row-comp' into list-row-eq
devavret Mar 22, 2022
dd8650d
We also need to keep the null mask buffers around
devavret Mar 22, 2022
6cab5c4
Fix slices struct issues
devavret Mar 22, 2022
14f9d25
Handle sliced list column
devavret Mar 23, 2022
bcd6962
Merge branch 'branch-22.04' into list-row-eq
devavret Mar 23, 2022
0c12c15
Move equality comparator to experimental header
devavret Mar 23, 2022
b41b3fa
Style fixes
devavret Mar 23, 2022
4919b04
Merge branch 'branch-22.06' into list-row-eq
devavret Mar 23, 2022
3dfc133
Review changes
devavret Mar 24, 2022
9031900
Review changes requested by @hyperbolic2346
devavret Mar 24, 2022
119d830
Add an equality comparable check similar to lex comparable check
devavret Mar 24, 2022
1cefa5a
Move linked column to a common header in utilities
devavret Mar 24, 2022
3781d01
Merge branch 'list-row-eq' into list-row-hash
devavret Mar 24, 2022
34aa66b
Review changes
devavret Mar 28, 2022
e7ea7f9
Change to progressive slicing logic
devavret Mar 28, 2022
e60fbd4
pull type dispatcher out of element compare loop
devavret Mar 28, 2022
42319ad
Move slicing logic to lists_column_device_view and new structs_column…
devavret Mar 28, 2022
67c035c
Move list size iterator and make it only constructible from list_colu…
devavret Mar 28, 2022
e5fe24c
push slice logic into columns_device_view
devavret Mar 29, 2022
85861a9
Add validity safe iterator
devavret Mar 29, 2022
9db7479
Move element_range_comparator to element_comparator's private
devavret Mar 29, 2022
db3b79b
style fixes
devavret Mar 29, 2022
6ca7deb
Docs for the newly added stuff
devavret Mar 29, 2022
2fc3d3d
Merge branch 'branch-22.06' into list-row-eq
devavret Mar 29, 2022
1c3a99d
review changes
devavret Mar 29, 2022
a0e581c
review changes
devavret Mar 30, 2022
046c407
Review changes
devavret Mar 30, 2022
6a282a6
linked_column_view inherit from column_view_base
devavret Mar 30, 2022
0f768ac
spell check
devavret Mar 30, 2022
6c64915
Merge branch 'list-row-eq' into list-row-hash
devavret Mar 30, 2022
92c1ff5
Change composition to private inheritance
devavret Mar 31, 2022
4c0e7fa
Replace __host__ __device__ with macro
devavret Mar 31, 2022
75104bb
Add more null frequencies to benchmark
devavret Mar 31, 2022
1e1053b
Templatize make_validity_iterator
devavret Mar 31, 2022
bcfe91b
Increase testing for null frequency
devavret Mar 31, 2022
981438d
curr_col -> temp_col
devavret Mar 31, 2022
5bbf18e
element_range_comparator -> column_comparator
devavret Mar 31, 2022
8e18d66
cleaner column_view conversion
devavret Mar 31, 2022
7207b0b
Merge branch 'list-row-eq' into list-row-hash
devavret Mar 31, 2022
1c525d8
Make distinct work again after merges
devavret Apr 1, 2022
75eaed4
delete copy ctor and assignment operator
devavret Apr 1, 2022
be98357
iterator docs
devavret Apr 1, 2022
9b7f3a0
use owner API for hash
devavret Apr 4, 2022
dda3c1d
Enable nulls not equal path
devavret Apr 4, 2022
370d3b3
Add struct support to hashing
devavret Apr 4, 2022
f4c509a
Handle empty struct in list equality
devavret Apr 8, 2022
d1386cf
Handle empty list (without offsets)
devavret Apr 8, 2022
6aef29f
Merge branch 'branch-22.06' into list-row-eq
devavret Apr 8, 2022
1ca9bcd
Merge branch 'list-row-eq' into list-row-hash
devavret Apr 8, 2022
2c53501
Add seed support to element hasher
devavret Apr 9, 2022
2a988e4
Change murmur hash table to use new row operators
devavret Apr 10, 2022
124c7df
switch over serial hash to experimental::row::hash::element_hasher to…
devavret Apr 10, 2022
7863017
limit changes to distinct and not other stream compaction ops
devavret Apr 10, 2022
8bb7572
small ctor changes in row hasher
devavret Apr 10, 2022
3fb6865
Use accumulate wherever possible
devavret Apr 11, 2022
3cc1159
Merge branch 'branch-22.06' into list-row-eq
devavret Apr 11, 2022
e76a2f3
Add nulls in benchmark
devavret Apr 12, 2022
4f46db9
Add seed support to row hasher
devavret Apr 12, 2022
cbe757a
adding noexcept speeds up a bit
devavret Apr 12, 2022
92c77a5
hide device_row_hasher ctor to disallow use without owning row_hasher
devavret Apr 12, 2022
655bedd
Move nested traversal into an adapter class
devavret Apr 12, 2022
8078e3c
Column_device_view review changes
devavret Apr 12, 2022
aa5385b
Merge branch 'list-row-eq' into list-row-hash
devavret Apr 12, 2022
9c1b0d0
Add empty nesting tests
devavret Apr 12, 2022
e314601
Merge branch 'branch-22.06' into list-row-hash
devavret Apr 13, 2022
b5ca2e7
style check copyright
devavret Apr 13, 2022
2c40182
Review comments by @bdice
devavret Apr 14, 2022
48938c5
Doc updates
devavret Apr 14, 2022
bede9db
Review changes
devavret Apr 14, 2022
405975c
Fix empty struct bug
devavret Apr 14, 2022
a7f6963
more review changes
devavret Apr 14, 2022
4388ab4
iterate over structs directly instead of transforming from indices
devavret Apr 14, 2022
0150c47
remove null hash as a param from adapter
devavret Apr 14, 2022
65a5cde
review changes
devavret Apr 15, 2022
51789d2
has_nulls -> check_nulls
devavret Apr 15, 2022
023ec1f
Fix a problem with struct offsets being already applied
devavret Apr 15, 2022
1b0238f
leftover has_nulls -> check_nulls, and docs
devavret Apr 17, 2022
4a97709
Change how seed is included in row hashin
devavret Apr 17, 2022
8bdcd9e
change adapter to stored element hasher
devavret Apr 18, 2022
70086b6
Fix broken pytest
devavret Apr 18, 2022
b750402
Merge remote-tracking branch 'origin/branch-22.06' into list-row-hash
vyasr Apr 26, 2022
17e6ec2
default_hash is now in the detail namespace.
vyasr Apr 26, 2022
c0e1cc8
Address remaining critical and simple PR comments.
vyasr Apr 26, 2022
3f9e063
Fix docstring.
bdice Apr 27, 2022
0851dd7
Rename struc -> struct_col.
bdice Apr 27, 2022
a7a3b12
Use hash_value_type.
bdice Apr 27, 2022
5a1ceec
Use device accumulate instead of thrust::reduce because the hash comb…
bdice Apr 28, 2022
e61d781
Update cpp/tests/stream_compaction/distinct_tests.cpp
vyasr Apr 28, 2022
6dbda5f
Update cpp/tests/stream_compaction/distinct_tests.cpp
vyasr Apr 28, 2022
100a180
Merge remote-tracking branch 'upstream/branch-22.06' into list-row-hash
bdice Apr 28, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions cpp/benchmarks/stream_compaction/distinct.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@

#include <cudf/column/column_view.hpp>
#include <cudf/detail/stream_compaction.hpp>
#include <cudf/lists/list_view.cuh>
#include <cudf/types.hpp>

#include <nvbench/nvbench.cuh>
Expand Down Expand Up @@ -55,3 +56,43 @@ NVBENCH_BENCH_TYPES(nvbench_distinct, NVBENCH_TYPE_AXES(data_type))
.set_name("distinct")
.set_type_axes_names({"Type"})
.add_int64_axis("NumRows", {10'000, 100'000, 1'000'000, 10'000'000});

template <typename Type>
void nvbench_distinct_list(nvbench::state& state, nvbench::type_list<Type>)
{
cudf::rmm_pool_raii pool_raii;

auto const size = state.get_int64("ColumnSize");
auto const dtype = cudf::type_to_id<Type>();
double const null_frequency = state.get_float64("null_frequency");

data_profile table_data_profile;
if (dtype == cudf::type_id::LIST) {
table_data_profile.set_distribution_params(dtype, distribution_id::UNIFORM, 0, 4);
table_data_profile.set_distribution_params(
cudf::type_id::INT32, distribution_id::UNIFORM, 0, 4);
table_data_profile.set_list_depth(1);
} else {
// We're comparing distinct() on a non-nested column to that on a list column with the same
// number of distinct rows. The max list size is 4 and the number of distinct values in the
// list's child is 5. So the number of distinct rows in the list = 1 + 5 + 5^2 + 5^3 + 5^4 = 781
// We want this column to also have 781 distinct values.
table_data_profile.set_distribution_params(dtype, distribution_id::UNIFORM, 0, 781);
}
table_data_profile.set_null_frequency(null_frequency);

auto const table = create_random_table(
{dtype}, table_size_bytes{static_cast<size_t>(size)}, table_data_profile, 0);

state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
rmm::cuda_stream_view stream_view{launch.get_stream()};
auto result = cudf::detail::distinct(*table, {0}, cudf::null_equality::EQUAL, stream_view);
});
}

NVBENCH_BENCH_TYPES(nvbench_distinct_list,
NVBENCH_TYPE_AXES(nvbench::type_list<int32_t, cudf::list_view>))
.set_name("distinct_list")
.set_type_axes_names({"Type"})
.add_float64_axis("null_frequency", {0.0, 0.1})
.add_int64_axis("ColumnSize", {100'000'000});
5 changes: 3 additions & 2 deletions cpp/include/cudf/detail/hashing.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,19 +33,20 @@ namespace detail {
std::unique_ptr<column> hash(
table_view const& input,
hash_id hash_function = hash_id::HASH_MURMUR3,
uint32_t seed = 0,
uint32_t seed = cudf::DEFAULT_HASH_SEED,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

std::unique_ptr<column> murmur_hash3_32(
table_view const& input,
uint32_t seed = cudf::DEFAULT_HASH_SEED,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

template <template <typename> class hash_function>
std::unique_ptr<column> serial_murmur_hash3_32(
table_view const& input,
uint32_t seed = 0,
uint32_t seed = cudf::DEFAULT_HASH_SEED,
rmm::cuda_stream_view stream = rmm::cuda_stream_default,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

Expand Down
8 changes: 4 additions & 4 deletions cpp/include/cudf/detail/iterator.cuh
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,7 @@ struct null_replaced_value_accessor {
* @brief validity accessor of column with null bitmask
* A unary functor that returns validity at index `i`.
*
* @tparam safe If false, the accessor with throw logic_error if the column is not nullable. If
* @tparam safe If false, the accessor will throw a logic_error if the column is not nullable. If
* true, the accessor checks for nullability and if col is not nullable, returns true.
*/
template <bool safe = false>
Expand Down Expand Up @@ -306,12 +306,12 @@ auto make_pair_rep_iterator(column_device_view const& column)
*
* Dereferencing the returned iterator for element `i` will return the validity
* of `column[i]`
* This iterator is only allowed for nullable columns if `safe` = false
* If `safe` = false, the column must be nullable.
* When safe = true, if the column is not nullable then the validity is always true.
*
* @throws cudf::logic_error if the column is not nullable when safe = false
* @throws cudf::logic_error if the column is not nullable and safe = false
*
* @tparam safe If false, the accessor with throw logic_error if the column is not nullable. If
* @tparam safe If false, the accessor will throw a logic_error if the column is not nullable. If
* true, the accessor checks for nullability and if col is not nullable, returns true.
* @param column The column to iterate
* @return auto Iterator that returns validities of column elements.
Expand Down
28 changes: 28 additions & 0 deletions cpp/include/cudf/detail/utilities/algorithm.cuh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
/*
* Copyright (c) 2022, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
#pragma once

namespace cudf::detail {

template <typename Iterator, typename T, typename BinaryOp>
__device__ __forceinline__ T accumulate(Iterator first, Iterator last, T init, BinaryOp op)
devavret marked this conversation as resolved.
Show resolved Hide resolved
{
for (; first != last; ++first) {
init = op(std::move(init), *first);
vyasr marked this conversation as resolved.
Show resolved Hide resolved
}
return init;
}
} // namespace cudf::detail
10 changes: 3 additions & 7 deletions cpp/include/cudf/detail/utilities/column.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -72,13 +72,9 @@ struct linked_column_view : public column_view_base {
*/
inline LinkedColVector table_to_linked_columns(table_view const& table)
{
LinkedColVector result;
result.reserve(table.num_columns());
std::transform(table.begin(), table.end(), std::back_inserter(result), [&](column_view const& c) {
return std::make_shared<linked_column_view>(c);
});

return result;
auto linked_it = thrust::make_transform_iterator(
table.begin(), [](auto const& c) { return std::make_shared<linked_column_view>(c); });
return LinkedColVector(linked_it, linked_it + table.num_columns());
}

} // namespace cudf::detail
Loading