Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

rjzamora · 2021-07-27T20:29:18Z

Addresses the impending deprecation of the ArrowLegacyEngine (which dask-cudf currently depends on), by migrating the CudfEngine backend to the newer ArrowDatasetEngine.

TODO:

~~Benchmark/check for any (significant) performance regressions~~ (EDIT: pyarrow-deprecations in pyarrow-5 make this migration necessary IMO)

…parquet

…-parquet-backend

…quet-backend

codecov · 2021-07-30T22:09:39Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@8e9f0aa). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head b1226f6 differs from pull request most recent head d672ab5. Consider uploading reports for the commit d672ab5 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #8871   +/-   ##
===============================================
  Coverage                ?   10.62%           
===============================================
  Files                   ?      116           
  Lines                   ?    18683           
  Branches                ?        0           
===============================================
  Hits                    ?     1986           
  Misses                  ?    16697           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8e9f0aa...d672ab5. Read the comment docs.

quasiben · 2021-08-10T01:39:22Z

python/dask_cudf/dask_cudf/io/parquet.py

-                    size=col.size,
-                    offset=col.offset,
-                    ordered=False,
+


A lot nicer with supported categorical column creation!

I just realized that the new change here was a bit inefficent. We were creating a column with the partition-based category repeated in every element, and then converting it to a categorical column. It makes a bit more sense to repeat the index of the partition-based category in every element, and build the categorical column directly.

python/dask_cudf/dask_cudf/io/tests/test_parquet.py

quasiben · 2021-08-10T01:40:59Z

@rjzamora left one small comment (small typo) . I'll approve now though as i might not be available tomorrow morning

Co-authored-by: Benjamin Zaitlen <quasiben@users.noreply.github.com>

rjzamora · 2021-08-10T15:16:54Z

Thanks for the review @quasiben !

Note that I did revise a few small things since you approved the PR yesterday.

galipremsagar · 2021-08-13T08:13:30Z

rerun tests

galipremsagar · 2021-08-13T13:38:59Z

@gpucibot merge

rjzamora added 13 commits April 21, 2021 11:25

save possible changes to enable multi-file parquet

52a643c

Merge remote-tracking branch 'upstream/branch-21.06' into multi-file-…

3b3dadb

…parquet

align with latest upstream PR

54644e1

trigger format check

c5e660f

save some possible work aimed at uisng ParquetDatasetEngine

e8cbc26

Merge remote-tracking branch 'upstream/branch-21.08' into multi-file-…

2ab6838

…parquet

Merge remote-tracking branch 'origin/multi-file-parquet' into migrate…

3079fff

…-parquet-backend

add test coverage

d67f711

trigger format

a175daf

Merge branch 'multi-file-parquet' into migrate-parquet-backend

14038e9

Merge remote-tracking branch 'upstream/branch-21.08' into migrate-par…

2407b89

…quet-backend

remove commented code

648c3ce

remove commented code

494d585

rjzamora added 2 - In Progress Currently a work in progress dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jul 27, 2021

rjzamora self-assigned this Jul 27, 2021

github-actions bot added the Python Affects Python cuDF API. label Jul 27, 2021

rjzamora added 4 commits July 27, 2021 15:31

Merge branch 'branch-21.10' into migrate-parquet-backend

2ccc046

trigger formatting

07ed3c9

try reformatting

5a918d8

Merge remote-tracking branch 'upstream/branch-21.10' into migrate-par…

f6a4ad6

…quet-backend

rjzamora mentioned this pull request Jul 30, 2021

pyarrow=5.0.0 compatiblity dask/dask#7961

Closed

rjzamora marked this pull request as ready for review July 30, 2021 14:56

rjzamora requested a review from a team as a code owner July 30, 2021 14:56

rjzamora mentioned this pull request Jul 30, 2021

[REVIEW] Upgrade arrow & pyarrow to 5.0.0 #8908

Merged

2 tasks

rjzamora added 2 commits July 30, 2021 10:57

Merge remote-tracking branch 'upstream/branch-21.10' into migrate-par…

1473f36

…quet-backend

test tweak

3602394

quasiben reviewed Aug 10, 2021

View reviewed changes

python/dask_cudf/dask_cudf/io/tests/test_parquet.py Outdated Show resolved Hide resolved

quasiben approved these changes Aug 10, 2021

View reviewed changes

rjzamora and others added 3 commits August 9, 2021 21:45

Update python/dask_cudf/dask_cudf/io/tests/test_parquet.py

1db936a

Co-authored-by: Benjamin Zaitlen <quasiben@users.noreply.github.com>

build cat column in a more efficient way

675c314

make cat column creation more efficient and fix schema-missmatch test

d672ab5

galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Aug 13, 2021

galipremsagar approved these changes Aug 13, 2021

View reviewed changes

rapids-bot bot merged commit 3be1c4c into rapidsai:branch-21.10 Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

rjzamora commented Jul 27, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading

quasiben Aug 10, 2021

rjzamora Aug 10, 2021

quasiben commented Aug 10, 2021

rjzamora commented Aug 10, 2021

galipremsagar commented Aug 13, 2021

galipremsagar commented Aug 13, 2021

Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

Migrate dask-cudf CudfEngine to leverage ArrowDatasetEngine #8871

Conversation

rjzamora commented Jul 27, 2021 • edited Loading

codecov bot commented Jul 30, 2021 • edited Loading

Codecov Report

quasiben Aug 10, 2021

Choose a reason for hiding this comment

rjzamora Aug 10, 2021

Choose a reason for hiding this comment

quasiben commented Aug 10, 2021

rjzamora commented Aug 10, 2021

galipremsagar commented Aug 13, 2021

galipremsagar commented Aug 13, 2021

rjzamora commented Jul 27, 2021 •

edited

Loading

codecov bot commented Jul 30, 2021 •

edited

Loading