Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cudf to compute exact hash join output row sizes #3288

Merged
merged 1 commit into from
Aug 27, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Aug 24, 2021

Fixes #2440

Removes the hash join code to estimate the join size and replaces it with the cudf output join size APIs. This also removes the OOM catch-and-retry logic since it theoretically should no longer be necessary since we are no longer producing an estimate but instead an exact amount.

Posting as a draft for the following reasons:

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
@jlowe jlowe added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Aug 24, 2021
@jlowe jlowe added this to the Aug 16 - Aug 27 milestone Aug 24, 2021
@jlowe jlowe self-assigned this Aug 24, 2021
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the look of things it appears that the builtHash does not out live a single batch being returned. The rest of the changes look good.

@jlowe
Copy link
Member Author

jlowe commented Aug 24, 2021

From the look of things it appears that the builtHash does not out live a single batch being returned.

Yeah, the built hash isn't spillable, and the code tries to make everything spillable before it produces a batch. Therefore I thought it prudent to throw away the built hash to avoid a potential OOM.

@jlowe
Copy link
Member Author

jlowe commented Aug 27, 2021

build

@jlowe
Copy link
Member Author

jlowe commented Aug 27, 2021

I ran Q75 which has quite a few joins in it at SF=100 on my local desktop. It's in the ballpark of 5% faster, around 33 seconds before and 31.5 seconds after. I verified with an Nsight Systems trace that the hash table is getting re-used between the output size and join calculations for an individual batch.

@jlowe jlowe marked this pull request as ready for review August 27, 2021 15:34
@jlowe jlowe merged commit 25bad3d into NVIDIA:branch-21.10 Aug 27, 2021
@jlowe jlowe deleted the hash-join-output-size branch September 10, 2021 15:37
abellina added a commit to abellina/spark-rapids that referenced this pull request Sep 23, 2021
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Sep 24, 2021
jlowe added a commit to jlowe/spark-rapids that referenced this pull request Sep 24, 2021
…3288)"

This reverts commit 25bad3d.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
tgravescs pushed a commit that referenced this pull request Sep 24, 2021
…#3657)

This reverts commit 25bad3d.

Signed-off-by: Jason Lowe <jlowe@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Use CUDF API for getting join output size
2 participants