Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable backend dispatching for Dask-DataFrame creation #11920

Merged
merged 27 commits into from
Oct 20, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 14, 2022

Description

This PR depends on dask/dask#9475 (Now Merged)

After dask#9475, external libraries are now able to implement (and expose) their own DataFrameBackendEntrypoint definitions to specify custom creation functions for DataFrame collections. This PR introduces the CudfBackendEntrypoint class to create dask_cudf.DataFrame collections using the dask.dataframe API. By installing dask_cudf with this entrypoint definition in place, you get the following behavior in dask.dataframe:

import dask.dataframe as dd
import dask

# Tell Dask that you want to create DataFrame collections
# with the "cudf" backend (for supported creation functions).
# This can also be used in a context, or set in a yaml file
dask.config.set({"dataframe.backend": "cudf"})

ddf = dd.from_dict({"a": range(10)}, npartitions=2)
type(ddf)  # dask_cudf.core.DataFrame

Note that the code snippet above does not require an explicit import of cudf or dask_cudf. The following creation functions will support backend dispatching after dask#9475:

  • from_dict
  • read_paquet
  • read_json
  • read_orc
  • read_csv
  • read_hdf

See also: dask/design-docs#1

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change helps: Dask labels Oct 14, 2022
@github-actions github-actions bot added the Python Affects Python cuDF API. label Oct 14, 2022
@codecov
Copy link

codecov bot commented Oct 14, 2022

Codecov Report

Base: 87.40% // Head: 88.13% // Increases project coverage by +0.72% 🎉

Coverage data is based on head (720e702) compared to base (f72c4ce).
Patch coverage: 89.92% of modified lines in pull request are covered.

❗ Current head 720e702 differs from pull request most recent head 3719558. Consider uploading reports for the commit 3719558 to get more accurate results

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11920      +/-   ##
================================================
+ Coverage         87.40%   88.13%   +0.72%     
================================================
  Files               133      133              
  Lines             21833    21987     +154     
================================================
+ Hits              19084    19379     +295     
+ Misses             2749     2608     -141     
Impacted Files Coverage Δ
python/cudf/cudf/core/dataframe.py 93.77% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 92.03% <ø> (ø)
python/cudf/cudf/core/udf/__init__.py 97.05% <ø> (+47.05%) ⬆️
python/cudf/cudf/io/orc.py 92.94% <ø> (-0.09%) ⬇️
python/cudf/cudf/testing/dataset_generator.py 72.83% <ø> (-0.42%) ⬇️
...thon/dask_cudf/dask_cudf/tests/test_distributed.py 18.86% <ø> (+4.94%) ⬆️
python/cudf/cudf/core/_base_index.py 82.20% <43.75%> (-3.35%) ⬇️
python/cudf/cudf/io/text.py 91.66% <66.66%> (-8.34%) ⬇️
python/strings_udf/strings_udf/__init__.py 84.31% <76.00%> (-12.57%) ⬇️
python/dask_cudf/dask_cudf/backends.py 84.90% <82.92%> (-0.37%) ⬇️
... and 27 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@rjzamora rjzamora marked this pull request as ready for review October 17, 2022 15:34
@rjzamora rjzamora requested a review from a team as a code owner October 17, 2022 15:34
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good Rick, few comments before this is ready to merge.

def from_dict(data, npartitions, orient="columns", **kwargs):
if orient != "columns":
raise ValueError(f"orient={orient} is not supported")
return dd.from_pandas(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is from_pandas required here because we don't support cudf.DataFrame.from_dict API yet? If so, can we add a todo here to change this after #11934 is resolved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is from_pandas required here because we don't support cudf.DataFrame.from_dict API yet?

Yes and no - We should certainly use cudf.from_dict when it is supported (I'll add a TODO). However, I'll change the dd.from_pandas code to dask_cudf.from_cudf for clarity (from_cudf is just a cudf-friendly alias for from_pandas).

python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/core.py Outdated Show resolved Hide resolved
Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rjzamora !

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the entrypoint need to be added to setup.py/cfg or similar?

python/dask_cudf/dask_cudf/backends.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Show resolved Hide resolved
python/dask_cudf/dask_cudf/backends.py Show resolved Hide resolved
@rjzamora
Copy link
Member Author

Does the entrypoint need to be added to setup.py/cfg or similar?

Correct - The entrypoint is defined in dask_cudf's setup.cfg file.

@wence-
Copy link
Contributor

wence- commented Oct 18, 2022

Does the entrypoint need to be added to setup.py/cfg or similar?

Correct - The entrypoint is defined in dask_cudf's setup.cfg file.

Ah, somehow I missed that this had already been done back in March

@rjzamora rjzamora marked this pull request as draft October 19, 2022 13:23
@rjzamora rjzamora marked this pull request as ready for review October 19, 2022 17:09
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 2 - In Progress Currently a work in progress labels Oct 20, 2022
@rjzamora
Copy link
Member Author

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants