Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Unstack function changes column names to numeric format #7365

Closed
BenikaHall opened this issue Feb 10, 2021 · 7 comments · Fixed by #8560
Closed

[BUG] Unstack function changes column names to numeric format #7365

BenikaHall opened this issue Feb 10, 2021 · 7 comments · Fixed by #8560
Assignees
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.

Comments

@BenikaHall
Copy link

BenikaHall commented Feb 10, 2021

Describe the bug
After applying the unstack function, the variable names change to numeric format.

Steps/Code to reproduce bug

def get_df(length, num_cols, num_months, acc_offset):
    cols = [ 'var_{}'.format(i) for i in range(num_cols)]
    df = cudf.DataFrame({col: cupy.random.rand(length * num_months) for col in cols})
    df['acc_id'] = cupy.repeat(cupy.arange(length), num_months) + acc_offset
    months = cupy.repeat(cupy.arange(length), num_months, axis=0).reshape(length, num_months)
    cupy.random.shuffle(months)
    df['month_id'] = months.T.flatten()
    return df

num_cols = 10
acc_len = 20
num_partitions = 4
num_months = 24

df = get_df(acc_len, num_cols, num_months, 0)

cols = [ 'var_{}'.format(i) for i in range(num_cols)]
unpivot = cudf.melt(df, id_vars=['acc_id','month_id'], value_vars=cols, var_name='name')

sorted_df = unpivot.sort_values(['acc_id', 'month_id'])

sorted_df.set_index(['acc_id', 'month_id', 'name']).unstack('name')

Expected behavior
After applying the unstack function, the column names should be preserved and not changed to 0,1,2,3,4,....

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Rapids 0.17 container from NGC
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

@BenikaHall BenikaHall added Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2021
@BenikaHall
Copy link
Author

It appears that the dtype of the column we're trying to unstack is "category." It may be worth investigating why the "unstack" function changes the variable names for "category" dtypes. To resolve this, I set the column we wanted to unstack astype "object" and it worked seamlessly.

Adjusting the code shown in the initial comment, it now works as follows.

def get_df(length, num_cols, num_months, acc_offset):
    cols = [ 'var_{}'.format(i) for i in range(num_cols)]
    df = cudf.DataFrame({col: cupy.random.rand(length * num_months) for col in cols})
    df['acc_id'] = cupy.repeat(cupy.arange(length), num_months) + acc_offset
    months = cupy.repeat(cupy.arange(length), num_months, axis=0).reshape(length, num_months)
    cupy.random.shuffle(months)
    df['month_id'] = months.T.flatten()
    return df

num_cols = 10
acc_len = 20
num_partitions = 4
num_months = 24

df = get_df(acc_len, num_cols, num_months, 0)

cols = [ 'var_{}'.format(i) for i in range(num_cols)]
unpivot = cudf.melt(df, id_vars=['acc_id','month_id'], value_vars=cols, var_name='name')

sorted_df = unpivot.sort_values(['acc_id', 'month_id'])

sorted_df['name'] = sorted_df['name'].astype('object')
sorted_df.set_index(['acc_id', 'month_id', 'name']).unstack('name')
sorted_df

@BenikaHall BenikaHall changed the title [BUG] [BUG] Unstack function changes column names to numeric format Feb 10, 2021
@shwina
Copy link
Contributor

shwina commented Feb 11, 2021

Thanks for reporting. I hope you don't mind that I reopened this issue because this certainly sound like there's still a bug somewhere. One shouldn't have to convert their categorical data to object (i.e. strings in cuDF) for unstack to work correctly.

@shwina shwina reopened this Feb 11, 2021
@shwina
Copy link
Contributor

shwina commented Feb 11, 2021

I was able to track this down to a bug in the Frame._encode function:

def _encode(self):
keys, indices = libcudf.transform.table_encode(self)
keys = self.__class__._from_table(keys)
return keys, indices

Before returning keys, we should call keys._copy_type_metadata(self) to copy the category information from self onto keys.

@shwina shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 11, 2021
@BenikaHall
Copy link
Author

That's very true. Thank you for reopening @shwina. I'll follow up on this as well.

@shwina
Copy link
Contributor

shwina commented Feb 11, 2021

@BenikaHall - if you (or anyone else) would like to contribute a fix here, it should be relatively easy 🙂

In addition to the fix above, we'll need a couple of tests for this case in test_reshape.py.

@shwina shwina added the good first issue Good for newcomers label Feb 11, 2021
@BenikaHall
Copy link
Author

BenikaHall commented Feb 11, 2021

I'll see if I have some time next week to get a few tests done and a PR submitted. Thanks again @shwina.

@esnvidia
Copy link

esnvidia commented Apr 20, 2021

Slightly related - calling unstack on a Series:

import cudf
import numpy as np
gdf = cudf.DataFrame(data=[[i for i in range(1990,2021) for j in range(500)],\
                     np.random.randint(0,3, size=(2021-1990)*500)]).T
gdf.columns=['year', 'use chip']
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().unstack(fill_value=0) # Fails
AttributeError: 'Series' object has no attribute 'unstack'
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_pandas().unstack(fill_value=0)  # success
cudf.__version__ --> 0.18.1

working around by calling to_frame() instead of to_pandas() causes MultiIndex to be created:

gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_frame().unstack().columns
MultiIndex([(0, 0),
            (0, 1),
            (0, 2)],
           names=[None, 'use chip'])
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_pandas().unstack().columns
Int64Index([0, 1, 2], dtype='int64', name='use chip')

@charlesbluca charlesbluca self-assigned this Jun 18, 2021
rapids-bot bot pushed a commit that referenced this issue Jul 16, 2021
Fixes #7365 

Applies column metadata to the output columns of `keys` in `Frame._encode`; skipping this step meant that the output of `DataFrame.unstack` would not have the expected metadata for index columns:

```python
import pandas as pd
import cudf

pdf = pd.DataFrame(
    {
        "foo": ["one", "one", "one", "two", "two", "two"],
        "bar": pd.Categorical(["A", "B", "C", "A", "B", "C"]),
        "baz": [1, 2, 3, 4, 5, 6],
        "zoo": ["x", "y", "z", "q", "w", "t"],
    }).set_index(["foo", "bar", "baz"])
gdf = cudf.from_pandas(pdf)

pdf.unstack("baz")
         zoo                         
baz        1    2    3    4    5    6
foo bar                              
one A      x  NaN  NaN  NaN  NaN  NaN
    B    NaN    y  NaN  NaN  NaN  NaN
    C    NaN  NaN    z  NaN  NaN  NaN
two A    NaN  NaN  NaN    q  NaN  NaN
    B    NaN  NaN  NaN  NaN    w  NaN
    C    NaN  NaN  NaN  NaN  NaN    t

gdf.unstack("baz")
          zoo                              
baz         1     2     3     4     5     6
foo bar                                    
one 0       x  <NA>  <NA>  <NA>  <NA>  <NA>
    1    <NA>     y  <NA>  <NA>  <NA>  <NA>
    2    <NA>  <NA>     z  <NA>  <NA>  <NA>
two 0    <NA>  <NA>  <NA>     q  <NA>  <NA>
    1    <NA>  <NA>  <NA>  <NA>     w  <NA>
    2    <NA>  <NA>  <NA>  <NA>  <NA>     t
```

Authors:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

URL: #8560
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants