-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use cudf to compute exact hash join output row sizes #3288
Conversation
Signed-off-by: Jason Lowe <jlowe@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the look of things it appears that the builtHash does not out live a single batch being returned. The rest of the changes look good.
Yeah, the built hash isn't spillable, and the code tries to make everything spillable before it produces a batch. Therefore I thought it prudent to throw away the built hash to avoid a potential OOM. |
build |
I ran Q75 which has quite a few joins in it at SF=100 on my local desktop. It's in the ballpark of 5% faster, around 33 seconds before and 31.5 seconds after. I verified with an Nsight Systems trace that the hash table is getting re-used between the output size and join calculations for an individual batch. |
Fixes #2440
Removes the hash join code to estimate the join size and replaces it with the cudf output join size APIs. This also removes the OOM catch-and-retry logic since it theoretically should no longer be necessary since we are no longer producing an estimate but instead an exact amount.
Posting as a draft for the following reasons: