Refactor distinct with hashset-based algorithms #15984

srinivasyadav18 · 2024-06-11T23:05:38Z

Description

Refactor distinct algorithm to use cuco::static_set.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

ttnghia · 2024-06-11T23:47:12Z

Please post a benchmark showing how the performance is changed. Thanks.

cpp/src/stream_compaction/distinct_helpers.hpp

cpp/src/stream_compaction/distinct.cu

srinivasyadav18 · 2024-06-17T16:35:54Z

Benchmark comparision of distinct algorithm with static_map (old) vs static_set (new)

['static_map.json', 'static_set.json']

distinct

[0] Tesla T4

Type	NumRows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
bool	10000	210.785 us	1.30%	198.679 us	25.30%	-12.106 us	-5.74%	�[31mFAIL�[39m
bool	100000	253.103 us	7.79%	222.088 us	26.73%	-31.015 us	-12.25%	�[31mFAIL�[39m
bool	1000000	687.758 us	18.16%	342.133 us	26.54%	-345.625 us	-50.25%	�[31mFAIL�[39m
bool	10000000	5.538 ms	1.36%	1.669 ms	4.68%	-3869.028 us	-69.87%	�[31mFAIL�[39m
I8	10000	174.922 us	2.65%	175.165 us	1.58%	0.243 us	0.14%	�[32mPASS�[39m
I8	100000	220.113 us	1.65%	191.545 us	9.57%	-28.567 us	-12.98%	�[31mFAIL�[39m
I8	1000000	678.456 us	0.97%	327.453 us	1.63%	-351.003 us	-51.74%	�[31mFAIL�[39m
I8	10000000	5.514 ms	0.79%	1.670 ms	0.93%	-3843.922 us	-69.71%	�[31mFAIL�[39m
I32	10000	174.878 us	3.58%	175.502 us	1.31%	0.624 us	0.36%	�[32mPASS�[39m
I32	100000	217.380 us	1.86%	188.653 us	3.68%	-28.727 us	-13.22%	�[31mFAIL�[39m
I32	1000000	685.808 us	1.09%	328.207 us	1.16%	-357.601 us	-52.14%	�[31mFAIL�[39m
I32	10000000	5.636 ms	0.76%	1.722 ms	1.19%	-3914.285 us	-69.45%	�[31mFAIL�[39m
I64	10000	175.257 us	2.47%	175.281 us	1.25%	0.024 us	0.01%	�[32mPASS�[39m
I64	100000	217.467 us	2.04%	188.937 us	2.14%	-28.530 us	-13.12%	�[31mFAIL�[39m
I64	1000000	711.008 us	1.20%	335.716 us	2.40%	-375.292 us	-52.78%	�[31mFAIL�[39m
I64	10000000	5.868 ms	0.96%	1.805 ms	1.22%	-4062.509 us	-69.23%	�[31mFAIL�[39m
F32	10000	178.242 us	3.19%	174.956 us	1.34%	-3.287 us	-1.84%	�[31mFAIL�[39m
F32	100000	238.333 us	0.71%	199.978 us	0.85%	-38.355 us	-16.09%	�[31mFAIL�[39m
F32	1000000	1.099 ms	1.84%	573.145 us	1.12%	-525.641 us	-47.84%	�[31mFAIL�[39m
F32	10000000	13.033 ms	0.56%	7.656 ms	0.43%	-5376.663 us	-41.25%	�[31mFAIL�[39m
cudf::timestamp_ms	10000	175.692 us	2.43%	175.332 us	10.54%	-0.360 us	-0.21%	�[32mPASS�[39m
cudf::timestamp_ms	100000	222.362 us	2.93%	188.929 us	2.17%	-33.433 us	-15.04%	�[31mFAIL�[39m
cudf::timestamp_ms	1000000	723.055 us	1.77%	333.130 us	1.34%	-389.925 us	-53.93%	�[31mFAIL�[39m
cudf::timestamp_ms	10000000	5.952 ms	0.67%	1.835 ms	1.27%	-4116.537 us	-69.16%	�[31mFAIL�[39m

distinct_list

[0] Tesla T4

Type	null_probability	ColumnSize	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	0	100000000	12.206 ms	0.58%	3.692 ms	0.45%	-8513.937 us	-69.75%	�[31mFAIL�[39m
I32	0.1	100000000	13.764 ms	0.58%	4.242 ms	0.45%	-9522.336 us	-69.18%	�[31mFAIL�[39m
cudf::list_view	0	100000000	13.145 ms	0.67%	3.567 ms	1.02%	-9578.041 us	-72.86%	�[31mFAIL�[39m
cudf::list_view	0.1	100000000	15.093 ms	0.64%	4.173 ms	0.73%	-10920.317 us	-72.35%	�[31mFAIL�[39m

Summary

Total Matches: 28
- Pass (diff <= min_noise): 4
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 24

cpp/src/stream_compaction/distinct.cu

cpp/src/stream_compaction/distinct_helpers.hpp

cpp/src/stream_compaction/distinct_helpers.cu

ttnghia

reduce_by_row is instantiated of hash_reduce_by_row which is shared across multiple TUs. You should not separate them into multiple functions taking different sets of parameters. That is very difficult to join them again after refactor is done.

cpp/src/stream_compaction/distinct_helpers.cu

cpp/src/stream_compaction/distinct_helpers.hpp

PointKernel

Nice work!

Good to see the build time also reduced.

cpp/src/stream_compaction/distinct.cu

ttnghia · 2024-06-24T22:39:02Z

I thought that we agreed to keep the signature of reduce_by_row unchanged for potential reusing hash_reduce_by_row in the future?

PointKernel · 2024-06-25T18:04:33Z

I thought that we agreed to keep the signature of reduce_by_row unchanged for potential reusing hash_reduce_by_row in the future?

I'm not sure what you mean by keeping the signature of reduce_by_row unchanged. The goal of this PR is to refactor distinct with set. The current reduce_by_row takes map as input thus we need to update the signature accordingly.

PointKernel

Put this on hold since we find more performance optimization to do based on offline discussions.

ttnghia

Approval and excited waiting for the next refactor work.

cpp/src/stream_compaction/distinct_helpers.cu

srinivasyadav18 · 2024-06-26T01:10:27Z

Further performance improvements using set.insert_async with reference as static_set_new from previous comparions results.

distinct

Ref Time = With `insert_async` operation

Cmp Time = With `insert` operation (from previous comment ).

[0] Tesla T4

Type	NumRows	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
bool	10000	169.117 us	32.86%	203.653 us	1.88%	34.536 us	20.42%	FAIL
bool	100000	190.318 us	28.92%	232.877 us	5.05%	42.559 us	22.36%	FAIL
bool	1000000	294.805 us	16.77%	357.525 us	7.37%	62.720 us	21.28%	FAIL
bool	10000000	1.414 ms	6.62%	1.676 ms	9.63%	261.566 us	18.49%	FAIL
I8	10000	150.012 us	2.80%	187.305 us	29.75%	37.293 us	24.86%	FAIL
I8	100000	161.144 us	4.72%	195.348 us	5.06%	34.204 us	21.23%	FAIL
I8	1000000	283.225 us	1.66%	330.493 us	8.80%	47.268 us	16.69%	FAIL
I8	10000000	1.418 ms	2.14%	1.671 ms	2.87%	252.325 us	17.79%	FAIL
I32	10000	149.916 us	2.75%	177.923 us	2.51%	28.006 us	18.68%	FAIL
I32	100000	165.040 us	14.86%	193.024 us	2.05%	27.985 us	16.96%	FAIL
I32	1000000	283.267 us	1.57%	327.740 us	1.13%	44.473 us	15.70%	FAIL
I32	10000000	1.466 ms	2.07%	1.721 ms	1.94%	255.151 us	17.41%	FAIL
I64	10000	150.541 us	2.75%	178.182 us	4.02%	27.642 us	18.36%	FAIL
I64	100000	158.679 us	2.61%	193.019 us	2.71%	34.340 us	21.64%	FAIL
I64	1000000	287.044 us	2.76%	345.719 us	16.33%	58.675 us	20.44%	FAIL
I64	10000000	1.551 ms	1.90%	1.830 ms	5.21%	278.500 us	17.95%	FAIL
F32	10000	151.368 us	5.03%	179.430 us	5.58%	28.062 us	18.54%	FAIL
F32	100000	169.741 us	0.93%	210.323 us	4.70%	40.582 us	23.91%	FAIL
F32	1000000	531.013 us	4.36%	590.247 us	10.19%	59.235 us	11.16%	FAIL
F32	10000000	7.422 ms	0.16%	7.640 ms	0.29%	217.326 us	2.93%	FAIL
cudf::timestamp_ms	10000	150.855 us	7.18%	179.409 us	4.51%	28.554 us	18.93%	FAIL
cudf::timestamp_ms	100000	158.212 us	2.58%	193.409 us	2.16%	35.198 us	22.25%	FAIL
cudf::timestamp_ms	1000000	289.823 us	2.29%	336.280 us	1.06%	46.457 us	16.03%	FAIL
cudf::timestamp_ms	10000000	1.581 ms	1.64%	1.823 ms	1.33%	242.283 us	15.32%	FAIL

distinct_list

[0] Tesla T4

Type	null_probability	ColumnSize	Ref Time	Ref Noise	Cmp Time	Cmp Noise	Diff	%Diff	Status
I32	0	100000000	3.183 ms	1.17%	3.666 ms	0.48%	483.897 us	15.20%	FAIL
I32	0.1	100000000	3.632 ms	1.40%	4.199 ms	0.99%	567.874 us	15.64%	FAIL
cudf::list_view	0	100000000	3.512 ms	0.88%	3.584 ms	0.68%	72.714 us	2.07%	FAIL
cudf::list_view	0.1	100000000	4.124 ms	0.91%	4.206 ms	1.78%	81.445 us	1.97%	FAIL

Summary

Total Matches: 28
- Pass (diff <= min_noise): 0
- Unknown (infinite noise): 0
- Failure (diff > min_noise): 28

PointKernel

Excellent! 🔥 🔥 🔥

Thank you!

srinivasyadav18 · 2024-06-27T00:16:54Z

/merge

PointKernel and others added 3 commits March 12, 2024 15:52

Remove redundant header file

46c0fa0

Migrate to set-based algorithms

7b4d007

add support for KEEP FIRST,LAST and NONE

9da0daf

srinivasyadav18 added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 11, 2024

srinivasyadav18 requested a review from a team as a code owner June 11, 2024 23:05

srinivasyadav18 requested review from hyperbolic2346 and ttnghia June 11, 2024 23:05

srinivasyadav18 mentioned this pull request Jun 11, 2024

Refactor distinct with hashset-based algorithms #15285

Closed

3 tasks

srinivasyadav18 removed the feature request New feature or request label Jun 11, 2024

PointKernel requested changes Jun 12, 2024

View reviewed changes

cpp/src/stream_compaction/distinct_helpers.hpp Outdated Show resolved Hide resolved

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved

PointKernel added the Performance Performance related issue label Jun 12, 2024

PointKernel and others added 6 commits June 17, 2024 15:05

Remove redundant header file

d1f7b80

Migrate to set-based algorithms

a5af297

add support for KEEP FIRST,LAST and NONE

0e2a5cd

cleanup

250998a

Move process_keep_option to distinct_helpers

a00b0db

Merge branch 'set-distinct-v3' into set-distinct

5d86726

header include cleanup

6455411