Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dask-cuDF cumulative groupby ops #10889
Dask-cuDF cumulative groupby ops #10889
Changes from 8 commits
26bff13
5b92522
f9297e7
346d0e8
1c9e7bf
12f3c17
5eb5bb8
5fd707c
20fa2d2
10c6234
034f13c
2379a0a
096f5d9
9992e2e
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also be testing on series groupbys here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I added
Series
tests here and encountered what I think might be a bug in upstream dask. Here's a reproducer with no cuDF, modeled off of these tests:It's a little hard for me to reason about what the result "should" be here (we're aggregating one column of a dataframe groupby and taking...the cumulative sum of that?) but the above nets me different results for the last two lines. What do you think the best thing to do here is? I could file an issue and solve it before merging this, I could xfail this, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this! IMO I would add the tests and xfail, this is what I've done for other tests that would otherwise fail here due to upstream Dask issues, for example:
cudf/python/dask_cudf/dask_cudf/tests/test_groupby.py
Lines 288 to 294 in 97adac5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raised dask/dask#9313
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicky, but we might want to keep this as
SUPPORTED_AGGS
to make sure we don't eventually mess something up with support for new aggregations down the line:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, disregard this, I am forgetting that
_aggs_supported
really only needs to be tested for different groupby agg structures 😅 I think that a reasonable way to check that all aggregations are actually "supported" (i.e. use dask-cudf's groupby codepath) is to add the layer check I proposed in #10853