Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

Merged
merged 22 commits into from
Aug 13, 2021

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Jul 27, 2021

Closes #8656

Addresses the impending deprecation of the ArrowLegacyEngine (which dask-cudf currently depends on), by migrating the CudfEngine backend to the newer ArrowDatasetEngine.

TODO:

@rjzamora rjzamora added 2 - In Progress Currently a work in progress dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 27, 2021
@rjzamora rjzamora self-assigned this Jul 27, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jul 27, 2021
@rjzamora rjzamora marked this pull request as ready for review July 30, 2021 14:56
@rjzamora rjzamora requested a review from a team as a code owner July 30, 2021 14:56
@codecov
Copy link

codecov bot commented Jul 30, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@8e9f0aa). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head b1226f6 differs from pull request most recent head d672ab5. Consider uploading reports for the commit d672ab5 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #8871   +/-   ##
===============================================
  Coverage                ?   10.62%           
===============================================
  Files                   ?      116           
  Lines                   ?    18683           
  Branches                ?        0           
===============================================
  Hits                    ?     1986           
  Misses                  ?    16697           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e9f0aa...d672ab5. Read the comment docs.

size=col.size,
offset=col.offset,
ordered=False,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot nicer with supported categorical column creation!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that the new change here was a bit inefficent. We were creating a column with the partition-based category repeated in every element, and then converting it to a categorical column. It makes a bit more sense to repeat the index of the partition-based category in every element, and build the categorical column directly.

@quasiben
Copy link
Member

@rjzamora left one small comment (small typo) . I'll approve now though as i might not be available tomorrow morning

@rjzamora
Copy link
Member Author

Thanks for the review @quasiben !

Note that I did revise a few small things since you approved the PR yesterday.

@galipremsagar
Copy link
Contributor

rerun tests

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Aug 13, 2021
@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 3be1c4c into rapidsai:branch-21.10 Aug 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Adopt ArrowDatasetEngine for Dask-cuDF read_parquet
3 participants