Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

Closed
miguelusque opened this issue Jan 12, 2021 · 13 comments · Fixed by #9585
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@miguelusque
Copy link
Member

Is your feature request related to a problem? Please describe.
Hi!

While trying to convert a cudf dataframe to a cupy array, I have noticed a different behaviour between two cudf methods:

  • _df_.as_gpu_matrix()
  • _df_.to_dlpack()

While the first method works well when the dataframe contains different numeric types (int64 and float64), the second method returns the following error:

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1607613794356/work/cpp/src/interop/dlpack.cpp:199: All columns required to have same data type

The following code works:

import cudf
import cupy

df = cudf.DataFrame({'a': 3, 'b': 3.0})
arr = cupy.asarray(df.as_gpu_matrix())

The following code fails:

import cudf
import cupy

df = cudf.DataFrame({'a': 3, 'b': 3.0})
arr = cupy.fromDlpack(df.to_dlpack())

Hope it helps!
Miguel

Describe the solution you'd like
It would be great if _df_.to_dlpack() method could behave similarly to _df_.as_gpu_matrix() method.

Describe alternatives you've considered
Using _df_.as_gpu_matrix() method.

Additional context
Using RAPIDS 0.17, the latest stable version on Ubuntu 20.04.

@miguelusque miguelusque added Needs Triage Need team to review and classify feature request New feature or request labels Jan 12, 2021
@harrism
Copy link
Member

harrism commented Jan 13, 2021

This would require changing the C++ implementation, so I'd like to understand... what is the type of the data that as_gpu_matrix() returns in your example? Does it promote everything to float64? Does the user have control over the resulting data type?

This requires casts, so at the C++ level we would likely require the user to explicitly specify the output type, and reserve the right to error. We would still error if they try to do something impossible like convert a table with one column of ints and one column of strings (or lists of lists of strings!) into a tensor of floats.

@miguelusque
Copy link
Member Author

miguelusque commented Jan 13, 2021

Hi @harrism ,

In the example above, as_gpu_matrix() method casts everything to float64.

I understand casting is needed, but maybe it is well-worth to double-check how as_gpu_matrix() is implemented because I didn´t need to specify any output type when invoking it.

An additional output type parameter would be nice. Nevertheless, I would make it optional. When not present, I would cast to the bigger datatype which is able to store the data in the dataframe.

About the strings, I´d say there is no problem because, IIRC, dataframes with only numeric series are allowed so far.

@jrhemstad
Copy link
Contributor

This would require changing the C++ implementation, so I'd like to understand... what is the type of the data that as_gpu_matrix() returns in your example? Does it promote everything to float64? Does the user have control over the resulting data type?

This requires casts, so at the C++ level we would likely require the user to explicitly specify the output type, and reserve the right to error. We would still error if they try to do something impossible like convert a table with one column of ints and one column of strings (or lists of lists of strings!) into a tensor of floats.

We should maintain the current C++ behavior of requiring everything to be the same type. It's the callers responsibility (i.e., the cuDF Python wrappers) to cast all the columns to the desired type beforehand.

@miguelusque
Copy link
Member Author

miguelusque commented Jan 13, 2021

Hi @jrhemstad ,

Can I ask why we should maintain current implementation? I mean, is there any technical reason why it couldn’t be changed?

@kkraus14
Copy link
Collaborator

@miguelusque I think the point thus far has been that the C++ function in libcudf shouldn't do any automatic casting as that's unexpected behavior from the perspective of C++.

On the Python side it could be argued that we should handle automatic casting for dlpack similar to what's done for as_gpu_matrix / .values.

@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 13, 2021
@miguelusque
Copy link
Member Author

miguelusque commented Jan 13, 2021

HI @kkraus14 ,

Sure! My only concern is the performance penalty by performing the casting in the Python side instead of the c++ side (I do not have any data to support my concern).

I mean, in the worst scenario, the user will need to cast his data, so the fastest it could be done, the better.

In the Python side, it can always be invoked df.astype(cp.float64).to_dlpack() as a workaround, but I do not know the performance implications.

Nevertheless, thank you for considering this feature request!

@harrism
Copy link
Member

harrism commented Jan 13, 2021

The actual casting would be done on the C++ side, just explicitly invoked from the Python implementation on an as-needed basis.

@github-actions
Copy link

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

@github-actions github-actions bot added the stale label Feb 16, 2021
@miguelusque
Copy link
Member Author

I think this feature request is still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@miguelusque
Copy link
Member Author

I think this issue is still relevant.

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@miguelusque
Copy link
Member Author

I think this feature request is still relevant.

rapids-bot bot pushed a commit that referenced this issue Nov 5, 2021
Resolves: #7123

This PR adds a common dtype casting as requested here: #7123 (comment)

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - https://github.com/brandon-b-miller
  - Michael Wang (https://github.com/isVoid)

URL: #9585
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants