Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

Merged
merged 20 commits into from
Oct 14, 2021

Conversation

charlesbluca
Copy link
Member

Closes #9255

Sorts the output of quantile_divisions for the multi-column case, as leaving it unsorted causes sort_values to output the incorrect order.

Also fixes dask-cudf's null sorting test to actually check that the ordering is correct, which is resulting in another failure I'm currently resolving.

@charlesbluca charlesbluca added bug Something isn't working 2 - In Progress Currently a work in progress dask-cudf non-breaking Non-breaking change labels Sep 20, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 20, 2021
@charlesbluca charlesbluca changed the title [WIP] Sort quantile divisions for dask-cudf multi-column sorting Fix quantile division / partition handling for dask-cudf sort on null dataframes Sep 21, 2021
@charlesbluca charlesbluca marked this pull request as ready for review September 21, 2021 02:23
@charlesbluca charlesbluca requested a review from a team as a code owner September 21, 2021 02:23
@charlesbluca charlesbluca added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 21, 2021
@codecov
Copy link

codecov bot commented Sep 21, 2021

Codecov Report

Merging #9259 (77df746) into branch-21.12 (ab4bfaa) will increase coverage by 0.01%.
The diff coverage is 0.00%.

❗ Current head 77df746 differs from pull request most recent head 0952280. Consider uploading reports for the commit 0952280 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9259      +/-   ##
================================================
+ Coverage         10.79%   10.80%   +0.01%     
================================================
  Files               116      117       +1     
  Lines             18869    19425     +556     
================================================
+ Hits               2036     2098      +62     
- Misses            16833    17327     +494     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_lib/strings/__init__.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/sorting.py 93.52% <0.00%> (-0.60%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cff71ff...0952280. Read the comment docs.

@charlesbluca
Copy link
Member Author

rerun tests

@charlesbluca
Copy link
Member Author

Don't seem to be able to replicate these failures locally; bumping my branch to see if that resolves the failures. If not, it should be fine to just revert 5e58ca8 since that's what seems to have caused the failures.

@charlesbluca
Copy link
Member Author

rerun tests

Copy link
Member

@rjzamora rjzamora left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just a comment suggestion and minor questions.

python/dask_cudf/dask_cudf/tests/test_sort.py Outdated Show resolved Hide resolved
Comment on lines -204 to +203
divisions = divisions.drop_duplicates()
divisions = divisions.drop_duplicates().sort_index()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why this is necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop_duplicates pushes null values first in the series/dataframe, which cause trouble later on in _set_partitions_pre since searchsorted is expecting null values to be place last; this is meant to be equivalent to a similar sorting we do for the single-column case:

divisions = sorted(
divisions.drop_duplicates().astype(dtype).to_arrow().tolist(),
key=lambda x: (x is None, x),
)

That being said, it looks like removing both of those sorts doesn't actually break any of dask-cudf's sorting tests now - it feels like something should be breaking here, as without the sorts we'll sometimes end up assigning the rows of our input dataframe to less unique partitions than the intended number of output partitions; essentially, for this step:

df3 = rearrange_by_column(
df2,
"_partitions",
max_branch=max_branch,
npartitions=len(divisions) - 1,
shuffle="tasks",
ignore_index=ignore_index,
).drop(columns=["_partitions"])

We would sometimes have npartitions == 3 but df["_partitions"].nunique() == 2, which in my head should cause an erroneous sort but isn't in any of the test cases.

I'm going to play around with this more and see if I can get a good test case for this situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After digging into rearrange_by_column, I now understand that this shouldn't cause the sort to fail as long as all rows are assigned to properly ordered partitions (which they are with the additional check in _set_partitions_pre). However, we would still end up with a dataframe that has empty partitions - is that something we would want to avoid here by sorting divisions?

Comment on lines -65 to -73
# assert that quantile divisions of dataframe contains nulls
divisions = quantile_divisions(ddf, by, ddf.npartitions)
if isinstance(divisions, list):
assert None in divisions
else:
assert all([divisions[col].has_nulls for col in by])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying to remember why this check was here... Do we still want to check that the divisions includes null values, or was this check flawed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally this check was here because I wanted to verify that dataframes with nulls are sorted properly even when divisions contains nulls.

I ended up removing this because it isn't always necessarily true - we may end up adding a test case to this function that consists of a dataframe with nulls with a sort_values operation that doesn't lead to nulls in divisions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this because it isn't always necessarily true

Okay, I see - This makes sense

Copy link
Member

@quasiben quasiben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @charlesbluca for the great work here and @rjzamora for the reviews

@quasiben
Copy link
Member

@gpucibot merge

@charlesbluca
Copy link
Member Author

rerun tests

@quasiben
Copy link
Member

rerun tests

@galipremsagar
Copy link
Contributor

rerun tests

partitions[(partitions < 0) | (partitions >= len(divisions) - 1)] = (
0 if ascending else (len(divisions) - 2)
)
partitions[s._columns[0].isna().values] = len(divisions) - 2
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a little cumbersome, but was the best way I could think of to replicate the null handling in Dask's set_partitions_pre:

https://github.com/dask/dask/blob/e98d4f2cf8884d142b93c9fb2405d47e4ad02a54/dask/dataframe/shuffle.py#L811

Really the only difference here is that s is a dataframe instead of a series, so we need to grab the first column from it (the first sort-by column) before checking isna.

expect = df.sort_values(by=by, ascending=ascending)

# check that sorted indices are identical
dd.assert_eq(got.reset_index(), expect.reset_index(), check_index=False)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe sorting is non-deterministic for nulls here? In that case we may want to remove the reset_index and just test for equality on the values of the dataframe. Currently testing this out in #9264.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me. Looks like the CI passed now too. Feel free to have this taken care of in this PR or in #9264, this should be good to go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yikes, my approval triggered a merge since there was already a merge comment. Probably can you take care of it in #9264?

@rapids-bot rapids-bot bot merged commit 5bcb3e8 into rapidsai:branch-21.12 Oct 14, 2021
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 14, 2021
charlesbluca added a commit to charlesbluca/cudf that referenced this pull request Oct 14, 2021
rapids-bot bot pushed a commit that referenced this pull request Oct 14, 2021
#9438)

… on null dataframes (#9259)"

This reverts commit 5bcb3e8.

<!--

Thank you for contributing to cuDF :)

Here are some guidelines to help the review process go smoothly.

1. Please write a description in this text box of the changes that are being
   made.

2. Please ensure that you have written units tests for the changes made/features
   added.

3. There are CI checks in place to enforce that committed code follows our style
   and syntax standards. Please see our contribution guide in `CONTRIBUTING.MD`
   in the project root for more information about the checks we perform and how
   you can run them locally.

4. If you are closing an issue please use one of the automatic closing words as
   noted here: https://help.github.com/articles/closing-issues-using-keywords/

5. If your pull request is not ready for review but you want to make use of the
   continuous integration testing facilities please mark your pull request as Draft.
   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#converting-a-pull-request-to-a-draft

6. If your pull request is ready to be reviewed without requiring additional
   work on top of it, then remove it from "Draft" and make it "Ready for Review".
   https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/changing-the-stage-of-a-pull-request#marking-a-pull-request-as-ready-for-review

   If assistance is required to complete the functionality, for example when the
   C/C++ code of a feature is complete but Python bindings are still required,
   then add the label `help wanted` so that others can triage and assist.
   The additional changes then can be implemented on top of the same PR.
   If the assistance is done by members of the rapidsAI team, then no
   additional actions are required by the creator of the original PR for this,
   otherwise the original author of the PR needs to give permission to the
   person(s) assisting to commit to their personal fork of the project. If that
   doesn't happen then a new PR based on the code of the original PR can be
   opened by the person assisting, which then will be the PR that will be
   merged.

7. Once all work has been done and review has taken place please do not add
   features or make changes out of the scope of those requested by the reviewer
   (doing this just add delays as already reviewed code ends up having to be
   re-reviewed/it is hard to tell what is new etc!). Further, please do not
   rebase your branch on the target branch, force push, or rewrite history.
   Doing any of these causes the context of any comments made by reviewers to be lost.
   If conflicts occur against the target branch they should be resolved by
   merging the target branch into the branch used for making the pull request.

Many thanks in advance for your cooperation!

-->

Tests were intermittently failing on #9259 but it was erroneously merged in

cc @galipremsagar @quasiben

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #9438
@charlesbluca charlesbluca deleted the fix-9255 branch July 19, 2022 14:26
@vyasr vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working dask Dask issue non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Multi-column sorting on dask-cudf dataframe with nulls gives incorrect ordering
5 participants