[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

miguelusque · 2021-01-12T12:25:09Z

Is your feature request related to a problem? Please describe.
Hi!

While trying to convert a cudf dataframe to a cupy array, I have noticed a different behaviour between two cudf methods:

_df_.as_gpu_matrix()
_df_.to_dlpack()

While the first method works well when the dataframe contains different numeric types (int64 and float64), the second method returns the following error:

RuntimeError: cuDF failure at: /opt/conda/envs/rapids/conda-bld/libcudf_1607613794356/work/cpp/src/interop/dlpack.cpp:199: All columns required to have same data type

The following code works:

import cudf
import cupy

df = cudf.DataFrame({'a': 3, 'b': 3.0})
arr = cupy.asarray(df.as_gpu_matrix())

The following code fails:

import cudf
import cupy

df = cudf.DataFrame({'a': 3, 'b': 3.0})
arr = cupy.fromDlpack(df.to_dlpack())

Hope it helps!
Miguel

Describe the solution you'd like
It would be great if _df_.to_dlpack() method could behave similarly to _df_.as_gpu_matrix() method.

Describe alternatives you've considered
Using _df_.as_gpu_matrix() method.

Additional context
Using RAPIDS 0.17, the latest stable version on Ubuntu 20.04.

The text was updated successfully, but these errors were encountered:

harrism · 2021-01-13T01:48:39Z

This would require changing the C++ implementation, so I'd like to understand... what is the type of the data that as_gpu_matrix() returns in your example? Does it promote everything to float64? Does the user have control over the resulting data type?

This requires casts, so at the C++ level we would likely require the user to explicitly specify the output type, and reserve the right to error. We would still error if they try to do something impossible like convert a table with one column of ints and one column of strings (or lists of lists of strings!) into a tensor of floats.

miguelusque · 2021-01-13T13:07:25Z

Hi @harrism ,

In the example above, as_gpu_matrix() method casts everything to float64.

I understand casting is needed, but maybe it is well-worth to double-check how as_gpu_matrix() is implemented because I didn´t need to specify any output type when invoking it.

An additional output type parameter would be nice. Nevertheless, I would make it optional. When not present, I would cast to the bigger datatype which is able to store the data in the dataframe.

About the strings, I´d say there is no problem because, IIRC, dataframes with only numeric series are allowed so far.

jrhemstad · 2021-01-13T16:51:22Z

This would require changing the C++ implementation, so I'd like to understand... what is the type of the data that as_gpu_matrix() returns in your example? Does it promote everything to float64? Does the user have control over the resulting data type?

This requires casts, so at the C++ level we would likely require the user to explicitly specify the output type, and reserve the right to error. We would still error if they try to do something impossible like convert a table with one column of ints and one column of strings (or lists of lists of strings!) into a tensor of floats.

We should maintain the current C++ behavior of requiring everything to be the same type. It's the callers responsibility (i.e., the cuDF Python wrappers) to cast all the columns to the desired type beforehand.

miguelusque · 2021-01-13T16:58:16Z

Hi @jrhemstad ,

Can I ask why we should maintain current implementation? I mean, is there any technical reason why it couldn’t be changed?

kkraus14 · 2021-01-13T17:02:38Z

@miguelusque I think the point thus far has been that the C++ function in libcudf shouldn't do any automatic casting as that's unexpected behavior from the perspective of C++.

On the Python side it could be argued that we should handle automatic casting for dlpack similar to what's done for as_gpu_matrix / .values.

miguelusque · 2021-01-13T17:43:26Z

HI @kkraus14 ,

Sure! My only concern is the performance penalty by performing the casting in the Python side instead of the c++ side (I do not have any data to support my concern).

I mean, in the worst scenario, the user will need to cast his data, so the fastest it could be done, the better.

In the Python side, it can always be invoked df.astype(cp.float64).to_dlpack() as a workaround, but I do not know the performance implications.

Nevertheless, thank you for considering this feature request!

harrism · 2021-01-13T22:13:24Z

The actual casting would be done on the C++ side, just explicitly invoked from the Python implementation on an as-needed basis.

github-actions · 2021-02-16T20:19:31Z

This issue has been marked stale due to no recent activity in the past 30d. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be marked rotten if there is no activity in the next 60d.

miguelusque · 2021-02-20T11:16:25Z

I think this feature request is still relevant.

github-actions · 2021-03-22T12:23:37Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

miguelusque · 2021-03-22T13:06:54Z

I think this issue is still relevant.

github-actions · 2021-04-21T14:06:00Z

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

miguelusque · 2021-05-01T10:24:49Z

I think this feature request is still relevant.

Resolves: #7123 This PR adds a common dtype casting as requested here: #7123 (comment) Authors: - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - https://github.com/brandon-b-miller - Michael Wang (https://github.com/isVoid) URL: #9585

miguelusque added Needs Triage Need team to review and classify feature request New feature or request labels Jan 12, 2021

kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 13, 2021

github-actions bot added the stale label Feb 16, 2021

github-actions bot removed the inactive-30d label Feb 20, 2021

github-actions bot added the inactive-30d label Mar 22, 2021

github-actions bot removed the inactive-30d label Mar 22, 2021

github-actions bot added the inactive-30d label Apr 21, 2021

github-actions bot removed the inactive-30d label May 6, 2021

beckernick added this to the Tabular Data for Deep Learning milestone Jul 29, 2021

galipremsagar self-assigned this Nov 3, 2021

galipremsagar mentioned this issue Nov 3, 2021

[REVIEW] Add handling of mixed numeric types in to_dlpack #9585

Merged

rapids-bot bot closed this as completed in #9585 Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

miguelusque commented Jan 12, 2021

harrism commented Jan 13, 2021

miguelusque commented Jan 13, 2021 •

edited

Loading

jrhemstad commented Jan 13, 2021

miguelusque commented Jan 13, 2021 •

edited

Loading

kkraus14 commented Jan 13, 2021

miguelusque commented Jan 13, 2021 •

edited

Loading

harrism commented Jan 13, 2021

github-actions bot commented Feb 16, 2021

miguelusque commented Feb 20, 2021

github-actions bot commented Mar 22, 2021

miguelusque commented Mar 22, 2021

github-actions bot commented Apr 21, 2021

miguelusque commented May 1, 2021

[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

[FEA] Remove "All columns required to have same data type" requirement in .to_dlpack() method. #7123

Comments

miguelusque commented Jan 12, 2021

harrism commented Jan 13, 2021

miguelusque commented Jan 13, 2021 • edited Loading

jrhemstad commented Jan 13, 2021

miguelusque commented Jan 13, 2021 • edited Loading

kkraus14 commented Jan 13, 2021

miguelusque commented Jan 13, 2021 • edited Loading

harrism commented Jan 13, 2021

github-actions bot commented Feb 16, 2021

miguelusque commented Feb 20, 2021

github-actions bot commented Mar 22, 2021

miguelusque commented Mar 22, 2021

github-actions bot commented Apr 21, 2021

miguelusque commented May 1, 2021

miguelusque commented Jan 13, 2021 •

edited

Loading

miguelusque commented Jan 13, 2021 •

edited

Loading

miguelusque commented Jan 13, 2021 •

edited

Loading