Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

charlesbluca · 2021-09-20T21:39:35Z

Sorts the output of quantile_divisions for the multi-column case, as leaving it unsorted causes sort_values to output the incorrect order.

Also fixes dask-cudf's null sorting test to actually check that the ordering is correct, which is resulting in another failure I'm currently resolving.

…ultant partition

python/dask_cudf/dask_cudf/sorting.py

codecov · 2021-09-21T03:14:49Z

Codecov Report

Merging #9259 (77df746) into branch-21.12 (ab4bfaa) will increase coverage by 0.01%.
The diff coverage is 0.00%.

❗ Current head 77df746 differs from pull request most recent head 0952280. Consider uploading reports for the commit 0952280 to get more accurate results

@@               Coverage Diff                @@
##           branch-21.12    #9259      +/-   ##
================================================
+ Coverage         10.79%   10.80%   +0.01%     
================================================
  Files               116      117       +1     
  Lines             18869    19425     +556     
================================================
+ Hits               2036     2098      +62     
- Misses            16833    17327     +494

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_lib/__init__.py	`0.00% <ø> (ø)`
python/cudf/cudf/_lib/strings/__init__.py	`0.00% <0.00%> (ø)`
python/dask_cudf/dask_cudf/sorting.py	`93.52% <0.00%> (-0.60%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/hdf.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/orc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_version.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/abc.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
... and 62 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cff71ff...0952280. Read the comment docs.

python/dask_cudf/dask_cudf/tests/test_sort.py

charlesbluca · 2021-10-01T19:46:22Z

rerun tests

charlesbluca · 2021-10-04T15:14:35Z

Don't seem to be able to replicate these failures locally; bumping my branch to see if that resolves the failures. If not, it should be fine to just revert 5e58ca8 since that's what seems to have caused the failures.

This reverts commit 5e58ca8.

charlesbluca · 2021-10-07T22:24:44Z

rerun tests

rjzamora

LGTM. Just a comment suggestion and minor questions.

python/dask_cudf/dask_cudf/tests/test_sort.py

rjzamora · 2021-10-08T18:53:48Z

python/dask_cudf/dask_cudf/sorting.py

-        divisions = divisions.drop_duplicates()
+        divisions = divisions.drop_duplicates().sort_index()


Curious why this is necessary?

drop_duplicates pushes null values first in the series/dataframe, which cause trouble later on in _set_partitions_pre since searchsorted is expecting null values to be place last; this is meant to be equivalent to a similar sorting we do for the single-column case:

cudf/python/dask_cudf/dask_cudf/sorting.py

Lines 189 to 192 in dda5210

divisions = sorted(

divisions.drop_duplicates().astype(dtype).to_arrow().tolist(),

key=lambda x: (x is None, x),

)

That being said, it looks like removing both of those sorts doesn't actually break any of dask-cudf's sorting tests now - it feels like something should be breaking here, as without the sorts we'll sometimes end up assigning the rows of our input dataframe to less unique partitions than the intended number of output partitions; essentially, for this step:

cudf/python/dask_cudf/dask_cudf/sorting.py

Lines 239 to 246 in dda5210

df3 = rearrange_by_column(

df2,

"_partitions",

max_branch=max_branch,

npartitions=len(divisions) - 1,

shuffle="tasks",

ignore_index=ignore_index,

).drop(columns=["_partitions"])

We would sometimes have npartitions == 3 but df["_partitions"].nunique() == 2, which in my head should cause an erroneous sort but isn't in any of the test cases.

I'm going to play around with this more and see if I can get a good test case for this situation.

After digging into rearrange_by_column, I now understand that this shouldn't cause the sort to fail as long as all rows are assigned to properly ordered partitions (which they are with the additional check in _set_partitions_pre). However, we would still end up with a dataframe that has empty partitions - is that something we would want to avoid here by sorting divisions?

rjzamora · 2021-10-08T18:57:53Z

python/dask_cudf/dask_cudf/tests/test_sort.py

-    # assert that quantile divisions of dataframe contains nulls
-    divisions = quantile_divisions(ddf, by, ddf.npartitions)
-    if isinstance(divisions, list):
-        assert None in divisions
-    else:
-        assert all([divisions[col].has_nulls for col in by])


I'm trying to remember why this check was here... Do we still want to check that the divisions includes null values, or was this check flawed?

Originally this check was here because I wanted to verify that dataframes with nulls are sorted properly even when divisions contains nulls.

I ended up removing this because it isn't always necessarily true - we may end up adding a test case to this function that consists of a dataframe with nulls with a sort_values operation that doesn't lead to nulls in divisions.

I ended up removing this because it isn't always necessarily true

Okay, I see - This makes sense

quasiben

Thanks @charlesbluca for the great work here and @rjzamora for the reviews

quasiben · 2021-10-12T14:40:55Z

@gpucibot merge

charlesbluca · 2021-10-12T17:32:26Z

rerun tests

quasiben · 2021-10-13T01:13:27Z

rerun tests

python/dask_cudf/dask_cudf/tests/test_sort.py

galipremsagar · 2021-10-13T15:04:12Z

rerun tests

charlesbluca · 2021-10-13T20:21:54Z

python/dask_cudf/dask_cudf/sorting.py

+    partitions[(partitions < 0) | (partitions >= len(divisions) - 1)] = (
+        0 if ascending else (len(divisions) - 2)
+    )
+    partitions[s._columns[0].isna().values] = len(divisions) - 2


This seems a little cumbersome, but was the best way I could think of to replicate the null handling in Dask's set_partitions_pre:

https://github.com/dask/dask/blob/e98d4f2cf8884d142b93c9fb2405d47e4ad02a54/dask/dataframe/shuffle.py#L811

Really the only difference here is that s is a dataframe instead of a series, so we need to grab the first column from it (the first sort-by column) before checking isna.

python/dask_cudf/dask_cudf/tests/test_sort.py

charlesbluca · 2021-10-14T03:28:00Z

python/dask_cudf/dask_cudf/tests/test_sort.py

+        expect = df.sort_values(by=by, ascending=ascending)
+
+    # check that sorted indices are identical
+    dd.assert_eq(got.reset_index(), expect.reset_index(), check_index=False)


Maybe sorting is non-deterministic for nulls here? In that case we may want to remove the reset_index and just test for equality on the values of the dataframe. Currently testing this out in #9264.

Sounds good to me. Looks like the CI passed now too. Feel free to have this taken care of in this PR or in #9264, this should be good to go.

Yikes, my approval triggered a merge since there was already a merge comment. Probably can you take care of it in #9264?

… on null dataframes (rapidsai#9259)" This reverts commit 5bcb3e8.

@galipremsagar

#9438) … on null dataframes (#9259)" This reverts commit 5bcb3e8.  Tests were intermittently failing on #9259 but it was erroneously merged in cc @galipremsagar @quasiben Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #9438

charlesbluca added 2 commits September 20, 2021 17:25

Fix null sort test, add more failures

8f5de21

Sort quantile divisions for multi column case

ea21e28

charlesbluca added bug Something isn't working 2 - In Progress Currently a work in progress dask-cudf non-breaking Non-breaking change labels Sep 20, 2021

github-actions bot added the Python Affects Python cuDF API. label Sep 20, 2021

Make sure values less than first quantile get mapped to the first res…

8d187ef

…ultant partition

charlesbluca changed the title ~~[WIP] Sort quantile divisions for dask-cudf multi-column sorting~~ Fix quantile division / partition handling for dask-cudf sort on null dataframes Sep 21, 2021

charlesbluca marked this pull request as ready for review September 21, 2021 02:23

charlesbluca requested a review from a team as a code owner September 21, 2021 02:23

charlesbluca added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 21, 2021

charlesbluca commented Sep 21, 2021

View reviewed changes

python/dask_cudf/dask_cudf/sorting.py Outdated Show resolved Hide resolved

rjzamora reviewed Sep 21, 2021

View reviewed changes

python/dask_cudf/dask_cudf/tests/test_sort.py Outdated Show resolved Hide resolved

charlesbluca added 2 commits September 21, 2021 16:04

Check output index of sort_values

c9f430d

Simplify last partition assignment

5e58ca8

Merge remote-tracking branch 'upstream/branch-21.12' into fix-9255

4bed8a3

charlesbluca added 2 commits October 4, 2021 09:47

Revert "Simplify last partition assignment"

194cac0

This reverts commit 5e58ca8.

Merge remote-tracking branch 'upstream/branch-21.12' into fix-9255

b32cb5f

Reimplement last partition assignment

b956c7c

rjzamora reviewed Oct 8, 2021

View reviewed changes

Add clarification for reset_index calls

55150d6

quasiben approved these changes Oct 12, 2021

View reviewed changes

Use np.random.seed(0) in null sort tests

0eaa4e7

galipremsagar reviewed Oct 13, 2021

View reviewed changes

python/dask_cudf/dask_cudf/tests/test_sort.py Show resolved Hide resolved

python/dask_cudf/dask_cudf/tests/test_sort.py Show resolved Hide resolved

galipremsagar added 5 commits October 12, 2021 23:56

Update python/dask_cudf/dask_cudf/tests/test_sort.py

7862a85

Update python/dask_cudf/dask_cudf/tests/test_sort.py

2afaa4d

style

b799e0a

Merge remote-tracking branch 'upstream/branch-21.12' into fix-9255

0058cfa

test single-threaded

81cb66a

charlesbluca added 2 commits October 13, 2021 08:21

Merge remote-tracking branch 'upstream/branch-21.12' into fix-9255

7239a27

Fix test failures from upstream merge

bf3f7a4

charlesbluca commented Oct 13, 2021

View reviewed changes

Remove commented out partitions setting

d5f807d

galipremsagar reviewed Oct 14, 2021

View reviewed changes

python/dask_cudf/dask_cudf/tests/test_sort.py Outdated Show resolved Hide resolved

Update python/dask_cudf/dask_cudf/tests/test_sort.py

0952280

charlesbluca commented Oct 14, 2021

View reviewed changes

galipremsagar approved these changes Oct 14, 2021

View reviewed changes

rapids-bot bot merged commit 5bcb3e8 into rapidsai:branch-21.12 Oct 14, 2021

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Oct 14, 2021

charlesbluca added a commit to charlesbluca/cudf that referenced this pull request Oct 14, 2021

Revert "Fix quantile division / partition handling for dask-cudf sort…

766be78

… on null dataframes (rapidsai#9259)" This reverts commit 5bcb3e8.

charlesbluca mentioned this pull request Oct 14, 2021

Revert "Fix quantile division / partition handling for dask-cudf sort… #9438

Merged

charlesbluca deleted the fix-9255 branch July 19, 2022 14:26

vyasr added dask Dask issue and removed dask-cudf labels Feb 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

charlesbluca commented Sep 20, 2021

codecov bot commented Sep 21, 2021 •

edited

Loading

charlesbluca commented Oct 1, 2021

charlesbluca commented Oct 4, 2021

charlesbluca commented Oct 7, 2021

rjzamora left a comment

rjzamora Oct 8, 2021

charlesbluca Oct 8, 2021

charlesbluca Oct 11, 2021

rjzamora Oct 8, 2021

charlesbluca Oct 8, 2021

rjzamora Oct 8, 2021

quasiben left a comment

quasiben commented Oct 12, 2021

charlesbluca commented Oct 12, 2021

quasiben commented Oct 13, 2021

galipremsagar commented Oct 13, 2021

charlesbluca Oct 13, 2021

charlesbluca Oct 14, 2021

galipremsagar Oct 14, 2021

galipremsagar Oct 14, 2021

		divisions = divisions.drop_duplicates()
		divisions = divisions.drop_duplicates().sort_index()

	divisions = sorted(
	divisions.drop_duplicates().astype(dtype).to_arrow().tolist(),
	key=lambda x: (x is None, x),
	)

	df3 = rearrange_by_column(
	df2,
	"_partitions",
	max_branch=max_branch,
	npartitions=len(divisions) - 1,
	shuffle="tasks",
	ignore_index=ignore_index,
	).drop(columns=["_partitions"])

Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

Fix quantile division / partition handling for dask-cudf sort on null dataframes #9259

Conversation

charlesbluca commented Sep 20, 2021

codecov bot commented Sep 21, 2021 • edited Loading

Codecov Report

charlesbluca commented Oct 1, 2021

charlesbluca commented Oct 4, 2021

charlesbluca commented Oct 7, 2021

rjzamora left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

quasiben left a comment

Choose a reason for hiding this comment

quasiben commented Oct 12, 2021

charlesbluca commented Oct 12, 2021

quasiben commented Oct 13, 2021

galipremsagar commented Oct 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 21, 2021 •

edited

Loading