[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

vyasr · 2022-01-05T17:31:58Z

Is your feature request related to a problem? Please describe.
cuco::static_map is currently slower than cudf's internal unordered map. As part of a general move towards relying on cuCollections for data structures, however, we have made the replacement in #9666.

Describe the solution you'd like
cuco::static_multimap was optimized as part of its incorporation into cudf in #8934, but because of the current structure of cuCollections these optimizations have not been ported to cuco::static_map. While addressing NVIDIA/cuCollections#110 we should profile and make sure that the resulting optimizations are sufficient to close the performance gap.

Additional context
The performance loss was deemed acceptable in order to progress faster towards #9695, which will rely on features that cuco::static_map offers that cudf::concurrent_unordered_map does not.

The text was updated successfully, but these errors were encountered:

abellina · 2022-01-09T05:54:42Z

I am not entirely sure this is the same regression, but I am seeing one after (#9666) was merged. I see less invocations of this kernel, but they are ~10x slower than before the PR was merged. In one of the TPCDS @ 3TB query q87 (and possibly others) I see a loss of 4 seconds (before: 14263 ms, after: 18194 ms)

void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, cudf::detail::left_semi_anti_join(cudf::detail::join_kind, cudf::table_view const&, cudf::table_view const&, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda(int)#2}, int, int*>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, {lambda(int)#2}, int, int*, cub::ScanTileState<int, true>, unsigned long>(thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, {lambda(int)#2}, int, int*, cub::ScanTileState<int, true>, unsigned long)

Does this seem to be the same issue as in this ticket?

I also tried with an early version of the mixed condition PR and this kernel seems to be back to normal. That said, I don't see us invoking the mixed condition kernels. Perhaps there are general performance improvements here: #9917 that help.

abellina · 2022-01-10T15:53:38Z

I also tried with an early version of the mixed condition PR and this kernel seems to be back to normal. That said, I don't see us invoking the mixed condition kernels. Perhaps there are general performance improvements here: #9917 that help.

I realize now that this also means the change (#9666) wasn't merged into that branch. So I don't think the mixed join change will address.

I would still like to confirm that the place I saw the regression is expected of this issue.

vyasr · 2022-01-10T17:22:58Z

The slowdowns that I observed were more in the 1-3.5x range than 10x, so I'm surprised that you're seeing something so significant, but a regression is expected because of this issue. I'm especially surprised because I observed larger slowdowns for smaller data sizes, whereas for larger data sizes the relative slowdown was much smaller (although in absolute terms a 4 second slowdown is of course possible). I wonder (but have no specific evidence to suggest this) whether #9912 may also have led to some slowdown in the specific algorithms used here that is exacerbating the issue.

The mixed condition PR is independent and will not change the perf here.

abellina · 2022-01-10T17:30:58Z

Thanks @vyasr I don't think that PR is related as I used: b1ae789 as my baseline, and that was after the upgrade to thrust. I compared against: 33f7f0d.

github-actions · 2022-02-09T18:07:40Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vyasr added feature request New feature or request Needs Triage Need team to review and classify labels Jan 5, 2022

vyasr mentioned this issue Jan 5, 2022

Replace cudf's concurrent_ordered_map with cuco::static_map in semi/anti joins #9666

Merged

github-actions bot added the inactive-30d label Feb 9, 2022

vyasr added this to the Conditional Joins milestone Apr 26, 2022

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 24, 2022

GregoryKimball modified the milestones: Conditional Joins, Refactor using cuco containers Oct 4, 2022

GregoryKimball mentioned this issue Aug 8, 2023

[FEA] Refactor hash-based algorithms with new cuco data structures #12261

Open

vyasr removed the inactive-30d label Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

vyasr commented Jan 5, 2022

abellina commented Jan 9, 2022 •

edited

Loading

abellina commented Jan 10, 2022

vyasr commented Jan 10, 2022 •

edited

Loading

abellina commented Jan 10, 2022

github-actions bot commented Feb 9, 2022

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

Comments

vyasr commented Jan 5, 2022

abellina commented Jan 9, 2022 • edited Loading

abellina commented Jan 10, 2022

vyasr commented Jan 10, 2022 • edited Loading

abellina commented Jan 10, 2022

github-actions bot commented Feb 9, 2022

abellina commented Jan 9, 2022 •

edited

Loading

vyasr commented Jan 10, 2022 •

edited

Loading