-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9259 +/- ##
================================================
+ Coverage 10.79% 10.80% +0.01%
================================================
Files 116 117 +1
Lines 18869 19425 +556
================================================
+ Hits 2036 2098 +62
- Misses 16833 17327 +494
Continue to review full report at Codecov.
|
rerun tests |
Don't seem to be able to replicate these failures locally; bumping my branch to see if that resolves the failures. If not, it should be fine to just revert 5e58ca8 since that's what seems to have caused the failures. |
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a comment suggestion and minor questions.
divisions = divisions.drop_duplicates() | ||
divisions = divisions.drop_duplicates().sort_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop_duplicates
pushes null values first in the series/dataframe, which cause trouble later on in _set_partitions_pre
since searchsorted
is expecting null values to be place last; this is meant to be equivalent to a similar sorting we do for the single-column case:
cudf/python/dask_cudf/dask_cudf/sorting.py
Lines 189 to 192 in dda5210
divisions = sorted( | |
divisions.drop_duplicates().astype(dtype).to_arrow().tolist(), | |
key=lambda x: (x is None, x), | |
) |
That being said, it looks like removing both of those sorts doesn't actually break any of dask-cudf's sorting tests now - it feels like something should be breaking here, as without the sorts we'll sometimes end up assigning the rows of our input dataframe to less unique partitions than the intended number of output partitions; essentially, for this step:
cudf/python/dask_cudf/dask_cudf/sorting.py
Lines 239 to 246 in dda5210
df3 = rearrange_by_column( | |
df2, | |
"_partitions", | |
max_branch=max_branch, | |
npartitions=len(divisions) - 1, | |
shuffle="tasks", | |
ignore_index=ignore_index, | |
).drop(columns=["_partitions"]) |
We would sometimes have npartitions == 3
but df["_partitions"].nunique() == 2
, which in my head should cause an erroneous sort but isn't in any of the test cases.
I'm going to play around with this more and see if I can get a good test case for this situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After digging into rearrange_by_column
, I now understand that this shouldn't cause the sort to fail as long as all rows are assigned to properly ordered partitions (which they are with the additional check in _set_partitions_pre
). However, we would still end up with a dataframe that has empty partitions - is that something we would want to avoid here by sorting divisions
?
# assert that quantile divisions of dataframe contains nulls | ||
divisions = quantile_divisions(ddf, by, ddf.npartitions) | ||
if isinstance(divisions, list): | ||
assert None in divisions | ||
else: | ||
assert all([divisions[col].has_nulls for col in by]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to remember why this check was here... Do we still want to check that the divisions includes null values, or was this check flawed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally this check was here because I wanted to verify that dataframes with nulls are sorted properly even when divisions
contains nulls.
I ended up removing this because it isn't always necessarily true - we may end up adding a test case to this function that consists of a dataframe with nulls with a sort_values
operation that doesn't lead to nulls in divisions
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this because it isn't always necessarily true
Okay, I see - This makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @charlesbluca for the great work here and @rjzamora for the reviews
@gpucibot merge |
rerun tests |
rerun tests |
rerun tests |
partitions[(partitions < 0) | (partitions >= len(divisions) - 1)] = ( | ||
0 if ascending else (len(divisions) - 2) | ||
) | ||
partitions[s._columns[0].isna().values] = len(divisions) - 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a little cumbersome, but was the best way I could think of to replicate the null handling in Dask's set_partitions_pre
:
Really the only difference here is that s
is a dataframe instead of a series, so we need to grab the first column from it (the first sort-by column) before checking isna
.
expect = df.sort_values(by=by, ascending=ascending) | ||
|
||
# check that sorted indices are identical | ||
dd.assert_eq(got.reset_index(), expect.reset_index(), check_index=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe sorting is non-deterministic for nulls here? In that case we may want to remove the reset_index
and just test for equality on the values of the dataframe. Currently testing this out in #9264.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. Looks like the CI passed now too. Feel free to have this taken care of in this PR or in #9264, this should be good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes, my approval triggered a merge since there was already a merge comment. Probably can you take care of it in #9264?
… on null dataframes (rapidsai#9259)" This reverts commit 5bcb3e8.
#9438) … on null dataframes (#9259)" This reverts commit 5bcb3e8. <!-- Thank you for contributing to cuDF :) Here are some guidelines to help the review process go smoothly. 1. Please write a description in this text box of the changes that are being made. 2. Please ensure that you have written units tests for the changes made/features added. 3. There are CI checks in place to enforce that committed code follows our style and syntax standards. Please see our contribution guide in `CONTRIBUTING.MD` in the project root for more information about the checks we perform and how you can run them locally. 4. If you are closing an issue please use one of the automatic closing words as noted here: https://help.github.com/articles/closing-issues-using-keywords/ 5. If your pull request is not ready for review but you want to make use of the continuous integration testing facilities please mark your pull request as Draft. https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft 6. If your pull request is ready to be reviewed without requiring additional work on top of it, then remove it from "Draft" and make it "Ready for Review". https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review If assistance is required to complete the functionality, for example when the C/C++ code of a feature is complete but Python bindings are still required, then add the label `help wanted` so that others can triage and assist. The additional changes then can be implemented on top of the same PR. If the assistance is done by members of the rapidsAI team, then no additional actions are required by the creator of the original PR for this, otherwise the original author of the PR needs to give permission to the person(s) assisting to commit to their personal fork of the project. If that doesn't happen then a new PR based on the code of the original PR can be opened by the person assisting, which then will be the PR that will be merged. 7. Once all work has been done and review has taken place please do not add features or make changes out of the scope of those requested by the reviewer (doing this just add delays as already reviewed code ends up having to be re-reviewed/it is hard to tell what is new etc!). Further, please do not rebase your branch on the target branch, force push, or rewrite history. Doing any of these causes the context of any comments made by reviewers to be lost. If conflicts occur against the target branch they should be resolved by merging the target branch into the branch used for making the pull request. Many thanks in advance for your cooperation! --> Tests were intermittently failing on #9259 but it was erroneously merged in cc @galipremsagar @quasiben Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9438
Closes #9255
Sorts the output of
quantile_divisions
for the multi-column case, as leaving it unsorted causessort_values
to output the incorrect order.Also fixes dask-cudf's null sorting test to actually check that the ordering is correct, which is resulting in another failure I'm currently resolving.