Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor distinct with hashset-based algorithms #15984

Merged
merged 17 commits into from
Jun 27, 2024

Conversation

srinivasyadav18
Copy link
Contributor

Description

Refactor distinct algorithm to use cuco::static_set.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@srinivasyadav18 srinivasyadav18 added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 11, 2024
@srinivasyadav18 srinivasyadav18 requested a review from a team as a code owner June 11, 2024 23:05
@srinivasyadav18 srinivasyadav18 removed the feature request New feature or request label Jun 11, 2024
@ttnghia
Copy link
Contributor

ttnghia commented Jun 11, 2024

Please post a benchmark showing how the performance is changed. Thanks.

cpp/src/stream_compaction/distinct_helpers.hpp Outdated Show resolved Hide resolved
cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
@PointKernel PointKernel added the Performance Performance related issue label Jun 12, 2024
@srinivasyadav18
Copy link
Contributor Author

Benchmark comparision of distinct algorithm with static_map (old) vs static_set (new)

['static_map.json', 'static_set.json']

distinct

[0] Tesla T4

Type NumRows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
bool 10000 210.785 us 1.30% 198.679 us 25.30% -12.106 us -5.74% �[31mFAIL�[39m
bool 100000 253.103 us 7.79% 222.088 us 26.73% -31.015 us -12.25% �[31mFAIL�[39m
bool 1000000 687.758 us 18.16% 342.133 us 26.54% -345.625 us -50.25% �[31mFAIL�[39m
bool 10000000 5.538 ms 1.36% 1.669 ms 4.68% -3869.028 us -69.87% �[31mFAIL�[39m
I8 10000 174.922 us 2.65% 175.165 us 1.58% 0.243 us 0.14% �[32mPASS�[39m
I8 100000 220.113 us 1.65% 191.545 us 9.57% -28.567 us -12.98% �[31mFAIL�[39m
I8 1000000 678.456 us 0.97% 327.453 us 1.63% -351.003 us -51.74% �[31mFAIL�[39m
I8 10000000 5.514 ms 0.79% 1.670 ms 0.93% -3843.922 us -69.71% �[31mFAIL�[39m
I32 10000 174.878 us 3.58% 175.502 us 1.31% 0.624 us 0.36% �[32mPASS�[39m
I32 100000 217.380 us 1.86% 188.653 us 3.68% -28.727 us -13.22% �[31mFAIL�[39m
I32 1000000 685.808 us 1.09% 328.207 us 1.16% -357.601 us -52.14% �[31mFAIL�[39m
I32 10000000 5.636 ms 0.76% 1.722 ms 1.19% -3914.285 us -69.45% �[31mFAIL�[39m
I64 10000 175.257 us 2.47% 175.281 us 1.25% 0.024 us 0.01% �[32mPASS�[39m
I64 100000 217.467 us 2.04% 188.937 us 2.14% -28.530 us -13.12% �[31mFAIL�[39m
I64 1000000 711.008 us 1.20% 335.716 us 2.40% -375.292 us -52.78% �[31mFAIL�[39m
I64 10000000 5.868 ms 0.96% 1.805 ms 1.22% -4062.509 us -69.23% �[31mFAIL�[39m
F32 10000 178.242 us 3.19% 174.956 us 1.34% -3.287 us -1.84% �[31mFAIL�[39m
F32 100000 238.333 us 0.71% 199.978 us 0.85% -38.355 us -16.09% �[31mFAIL�[39m
F32 1000000 1.099 ms 1.84% 573.145 us 1.12% -525.641 us -47.84% �[31mFAIL�[39m
F32 10000000 13.033 ms 0.56% 7.656 ms 0.43% -5376.663 us -41.25% �[31mFAIL�[39m
cudf::timestamp_ms 10000 175.692 us 2.43% 175.332 us 10.54% -0.360 us -0.21% �[32mPASS�[39m
cudf::timestamp_ms 100000 222.362 us 2.93% 188.929 us 2.17% -33.433 us -15.04% �[31mFAIL�[39m
cudf::timestamp_ms 1000000 723.055 us 1.77% 333.130 us 1.34% -389.925 us -53.93% �[31mFAIL�[39m
cudf::timestamp_ms 10000000 5.952 ms 0.67% 1.835 ms 1.27% -4116.537 us -69.16% �[31mFAIL�[39m

distinct_list

[0] Tesla T4

Type null_probability ColumnSize Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 100000000 12.206 ms 0.58% 3.692 ms 0.45% -8513.937 us -69.75% �[31mFAIL�[39m
I32 0.1 100000000 13.764 ms 0.58% 4.242 ms 0.45% -9522.336 us -69.18% �[31mFAIL�[39m
cudf::list_view 0 100000000 13.145 ms 0.67% 3.567 ms 1.02% -9578.041 us -72.86% �[31mFAIL�[39m
cudf::list_view 0.1 100000000 15.093 ms 0.64% 4.173 ms 0.73% -10920.317 us -72.35% �[31mFAIL�[39m

Summary

  • Total Matches: 28
    • Pass (diff <= min_noise): 4
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 24

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reduce_by_row is instantiated of hash_reduce_by_row which is shared across multiple TUs. You should not separate them into multiple functions taking different sets of parameters. That is very difficult to join them again after refactor is done.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

Good to see the build time also reduced.

cpp/src/stream_compaction/distinct.cu Outdated Show resolved Hide resolved
@ttnghia
Copy link
Contributor

ttnghia commented Jun 24, 2024

I thought that we agreed to keep the signature of reduce_by_row unchanged for potential reusing hash_reduce_by_row in the future?

@PointKernel
Copy link
Member

I thought that we agreed to keep the signature of reduce_by_row unchanged for potential reusing hash_reduce_by_row in the future?

I'm not sure what you mean by keeping the signature of reduce_by_row unchanged. The goal of this PR is to refactor distinct with set. The current reduce_by_row takes map as input thus we need to update the signature accordingly.

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this on hold since we find more performance optimization to do based on offline discussions.

Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approval and excited waiting for the next refactor work.

@srinivasyadav18
Copy link
Contributor Author

Further performance improvements using set.insert_async with reference as static_set_new from previous comparions results.

distinct

Ref Time = With insert_async operation

Cmp Time = With insert operation (from previous comment ).

[0] Tesla T4

Type NumRows Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
bool 10000 169.117 us 32.86% 203.653 us 1.88% 34.536 us 20.42% FAIL
bool 100000 190.318 us 28.92% 232.877 us 5.05% 42.559 us 22.36% FAIL
bool 1000000 294.805 us 16.77% 357.525 us 7.37% 62.720 us 21.28% FAIL
bool 10000000 1.414 ms 6.62% 1.676 ms 9.63% 261.566 us 18.49% FAIL
I8 10000 150.012 us 2.80% 187.305 us 29.75% 37.293 us 24.86% FAIL
I8 100000 161.144 us 4.72% 195.348 us 5.06% 34.204 us 21.23% FAIL
I8 1000000 283.225 us 1.66% 330.493 us 8.80% 47.268 us 16.69% FAIL
I8 10000000 1.418 ms 2.14% 1.671 ms 2.87% 252.325 us 17.79% FAIL
I32 10000 149.916 us 2.75% 177.923 us 2.51% 28.006 us 18.68% FAIL
I32 100000 165.040 us 14.86% 193.024 us 2.05% 27.985 us 16.96% FAIL
I32 1000000 283.267 us 1.57% 327.740 us 1.13% 44.473 us 15.70% FAIL
I32 10000000 1.466 ms 2.07% 1.721 ms 1.94% 255.151 us 17.41% FAIL
I64 10000 150.541 us 2.75% 178.182 us 4.02% 27.642 us 18.36% FAIL
I64 100000 158.679 us 2.61% 193.019 us 2.71% 34.340 us 21.64% FAIL
I64 1000000 287.044 us 2.76% 345.719 us 16.33% 58.675 us 20.44% FAIL
I64 10000000 1.551 ms 1.90% 1.830 ms 5.21% 278.500 us 17.95% FAIL
F32 10000 151.368 us 5.03% 179.430 us 5.58% 28.062 us 18.54% FAIL
F32 100000 169.741 us 0.93% 210.323 us 4.70% 40.582 us 23.91% FAIL
F32 1000000 531.013 us 4.36% 590.247 us 10.19% 59.235 us 11.16% FAIL
F32 10000000 7.422 ms 0.16% 7.640 ms 0.29% 217.326 us 2.93% FAIL
cudf::timestamp_ms 10000 150.855 us 7.18% 179.409 us 4.51% 28.554 us 18.93% FAIL
cudf::timestamp_ms 100000 158.212 us 2.58% 193.409 us 2.16% 35.198 us 22.25% FAIL
cudf::timestamp_ms 1000000 289.823 us 2.29% 336.280 us 1.06% 46.457 us 16.03% FAIL
cudf::timestamp_ms 10000000 1.581 ms 1.64% 1.823 ms 1.33% 242.283 us 15.32% FAIL

distinct_list

[0] Tesla T4

Type null_probability ColumnSize Ref Time Ref Noise Cmp Time Cmp Noise Diff %Diff Status
I32 0 100000000 3.183 ms 1.17% 3.666 ms 0.48% 483.897 us 15.20% FAIL
I32 0.1 100000000 3.632 ms 1.40% 4.199 ms 0.99% 567.874 us 15.64% FAIL
cudf::list_view 0 100000000 3.512 ms 0.88% 3.584 ms 0.68% 72.714 us 2.07% FAIL
cudf::list_view 0.1 100000000 4.124 ms 0.91% 4.206 ms 1.78% 81.445 us 1.97% FAIL

Summary

  • Total Matches: 28
    • Pass (diff <= min_noise): 0
    • Unknown (infinite noise): 0
    • Failure (diff > min_noise): 28

Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! 🔥 🔥 🔥

Thank you!

@srinivasyadav18
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 6eac920 into rapidsai:branch-24.08 Jun 27, 2024
74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants