Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Address performance regression in semi/anti joins from switching to cuco #9973

Open
vyasr opened this issue Jan 5, 2022 · 5 comments
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@vyasr
Copy link
Contributor

vyasr commented Jan 5, 2022

Is your feature request related to a problem? Please describe.
cuco::static_map is currently slower than cudf's internal unordered map. As part of a general move towards relying on cuCollections for data structures, however, we have made the replacement in #9666.

Describe the solution you'd like
cuco::static_multimap was optimized as part of its incorporation into cudf in #8934, but because of the current structure of cuCollections these optimizations have not been ported to cuco::static_map. While addressing NVIDIA/cuCollections#110 we should profile and make sure that the resulting optimizations are sufficient to close the performance gap.

Additional context
The performance loss was deemed acceptable in order to progress faster towards #9695, which will rely on features that cuco::static_map offers that cudf::concurrent_unordered_map does not.

@abellina
Copy link
Contributor

abellina commented Jan 9, 2022

I am not entirely sure this is the same regression, but I am seeing one after (#9666) was merged. I see less invocations of this kernel, but they are ~10x slower than before the PR was merged. In one of the TPCDS @ 3TB query q87 (and possibly others) I see a loss of 4 seconds (before: 14263 ms, after: 18194 ms)

void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__copy_if::CopyIfAgent<thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, cudf::detail::left_semi_anti_join(cudf::detail::join_kind, cudf::table_view const&, cudf::table_view const&, cudf::null_equality, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)::{lambda(int)#2}, int, int*>, thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, {lambda(int)#2}, int, int*, cub::ScanTileState<int, true>, unsigned long>(thrust::counting_iterator<int, thrust::use_default, thrust::use_default, thrust::use_default>, thrust::cuda_cub::__copy_if::no_stencil_tag_*, int*, {lambda(int)#2}, int, int*, cub::ScanTileState<int, true>, unsigned long)

Does this seem to be the same issue as in this ticket?

I also tried with an early version of the mixed condition PR and this kernel seems to be back to normal. That said, I don't see us invoking the mixed condition kernels. Perhaps there are general performance improvements here: #9917 that help.

@abellina
Copy link
Contributor

I also tried with an early version of the mixed condition PR and this kernel seems to be back to normal. That said, I don't see us invoking the mixed condition kernels. Perhaps there are general performance improvements here: #9917 that help.

I realize now that this also means the change (#9666) wasn't merged into that branch. So I don't think the mixed join change will address.

I would still like to confirm that the place I saw the regression is expected of this issue.

@vyasr
Copy link
Contributor Author

vyasr commented Jan 10, 2022

The slowdowns that I observed were more in the 1-3.5x range than 10x, so I'm surprised that you're seeing something so significant, but a regression is expected because of this issue. I'm especially surprised because I observed larger slowdowns for smaller data sizes, whereas for larger data sizes the relative slowdown was much smaller (although in absolute terms a 4 second slowdown is of course possible). I wonder (but have no specific evidence to suggest this) whether #9912 may also have led to some slowdown in the specific algorithms used here that is exacerbating the issue.

The mixed condition PR is independent and will not change the perf here.

@abellina
Copy link
Contributor

Thanks @vyasr I don't think that PR is related as I used: b1ae789 as my baseline, and that was after the upgrade to thrust. I compared against: 33f7f0d.

@github-actions
Copy link

github-actions bot commented Feb 9, 2022

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@vyasr vyasr added this to the Conditional Joins milestone Apr 26, 2022
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jun 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

No branches or pull requests

3 participants