-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Match pandas join ordering obligations in pandas-compatible mode #14428
Match pandas join ordering obligations in pandas-compatible mode #14428
Conversation
Note that there is still one case where we don't match pandas, but I think that is a pandas bug: pandas-dev/pandas#55992 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Can you share benchmarks? Do we still need to propose moving this logic to libcudf, if the performance is reduced? It sounds like @shwina is approving but I thought we were wanting to confirm performance impact first. |
I think we still want to move to libcudf in the medium term. In a private comment, Lawrence posted some results showing the perf impact that I'm copying here. It seems like a reasonable compromise for the short term:
@beckernick would you agree? |
Great. I'm good with merging this in the short term and then revisiting moving this logic to libcudf. Ideally that could land in 24.02. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. I like the tests -- very thorough!
The additional cost of compat-mode is independent of the merge type (or number of merge keys, etc...). It is effectively the cost of
|
If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees. When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order. While we're here, switch from argsort + gather to sort_by_key when sorting results. - Closes rapidsai#14001
f93d5c3
to
1e9841c
Compare
/merge |
Description
If we pass sort=True to merges we are on the hook to sort the result in order with respect to the key columns. If those key columns have repeated values there is still some space for ambiguity. Currently we get a result back whose order (for the repeated key values) is determined by the gather map that libcudf returns for the join. This does not come with any ordering guarantees.
When sort=False, pandas has join-type dependent ordering guarantees which we also do not match. To fix this, in pandas-compatible mode only, reorder the gather maps according to the order of the input keys. When sort=False this means that our result matches pandas ordering. When sort=True, it ensures that (if we use a stable sort) the tie-break for equal sort keys is the input dataframe order.
While we're here, switch from argsort + gather to sort_by_key when sorting results.
Checklist