-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871
Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.10 #8871 +/- ##
===============================================
Coverage ? 10.62%
===============================================
Files ? 116
Lines ? 18683
Branches ? 0
===============================================
Hits ? 1986
Misses ? 16697
Partials ? 0 Continue to review full report at Codecov.
|
size=col.size, | ||
offset=col.offset, | ||
ordered=False, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot nicer with supported categorical column creation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that the new change here was a bit inefficent. We were creating a column with the partition-based category repeated in every element, and then converting it to a categorical column. It makes a bit more sense to repeat the index of the partition-based category in every element, and build the categorical column directly.
@rjzamora left one small comment (small typo) . I'll approve now though as i might not be available tomorrow morning |
Co-authored-by: Benjamin Zaitlen <quasiben@users.noreply.github.com>
Thanks for the review @quasiben ! Note that I did revise a few small things since you approved the PR yesterday. |
rerun tests |
@gpucibot merge |
Closes #8656
Addresses the impending deprecation of the ArrowLegacyEngine (which dask-cudf currently depends on), by migrating the
CudfEngine
backend to the newerArrowDatasetEngine
.TODO:
Benchmark/check for any (significant) performance regressions(EDIT: pyarrow-deprecations in pyarrow-5 make this migration necessary IMO)