[BUG] Unstack function changes column names to numeric format #7365

BenikaHall · 2021-02-10T18:36:37Z

Describe the bug
After applying the unstack function, the variable names change to numeric format.

Steps/Code to reproduce bug

def get_df(length, num_cols, num_months, acc_offset):
    cols = [ 'var_{}'.format(i) for i in range(num_cols)]
    df = cudf.DataFrame({col: cupy.random.rand(length * num_months) for col in cols})
    df['acc_id'] = cupy.repeat(cupy.arange(length), num_months) + acc_offset
    months = cupy.repeat(cupy.arange(length), num_months, axis=0).reshape(length, num_months)
    cupy.random.shuffle(months)
    df['month_id'] = months.T.flatten()
    return df

num_cols = 10
acc_len = 20
num_partitions = 4
num_months = 24

df = get_df(acc_len, num_cols, num_months, 0)

cols = [ 'var_{}'.format(i) for i in range(num_cols)]
unpivot = cudf.melt(df, id_vars=['acc_id','month_id'], value_vars=cols, var_name='name')

sorted_df = unpivot.sort_values(['acc_id', 'month_id'])

sorted_df.set_index(['acc_id', 'month_id', 'name']).unstack('name')

Expected behavior
After applying the unstack function, the column names should be preserved and not changed to 0,1,2,3,4,....

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Rapids 0.17 container from NGC
- If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

BenikaHall · 2021-02-10T20:50:56Z

It appears that the dtype of the column we're trying to unstack is "category." It may be worth investigating why the "unstack" function changes the variable names for "category" dtypes. To resolve this, I set the column we wanted to unstack astype "object" and it worked seamlessly.

Adjusting the code shown in the initial comment, it now works as follows.

def get_df(length, num_cols, num_months, acc_offset):
    cols = [ 'var_{}'.format(i) for i in range(num_cols)]
    df = cudf.DataFrame({col: cupy.random.rand(length * num_months) for col in cols})
    df['acc_id'] = cupy.repeat(cupy.arange(length), num_months) + acc_offset
    months = cupy.repeat(cupy.arange(length), num_months, axis=0).reshape(length, num_months)
    cupy.random.shuffle(months)
    df['month_id'] = months.T.flatten()
    return df

num_cols = 10
acc_len = 20
num_partitions = 4
num_months = 24

df = get_df(acc_len, num_cols, num_months, 0)

cols = [ 'var_{}'.format(i) for i in range(num_cols)]
unpivot = cudf.melt(df, id_vars=['acc_id','month_id'], value_vars=cols, var_name='name')

sorted_df = unpivot.sort_values(['acc_id', 'month_id'])

sorted_df['name'] = sorted_df['name'].astype('object')
sorted_df.set_index(['acc_id', 'month_id', 'name']).unstack('name')
sorted_df

shwina · 2021-02-11T10:01:30Z

Thanks for reporting. I hope you don't mind that I reopened this issue because this certainly sound like there's still a bug somewhere. One shouldn't have to convert their categorical data to object (i.e. strings in cuDF) for unstack to work correctly.

shwina · 2021-02-11T11:46:50Z

I was able to track this down to a bug in the Frame._encode function:

cudf/python/cudf/cudf/core/frame.py

Lines 3493 to 3496 in aa72df7

    
           def _encode(self): 
        
               keys, indices = libcudf.transform.table_encode(self) 
        
               keys = self.__class__._from_table(keys) 
        
               return keys, indices

Before returning keys, we should call keys._copy_type_metadata(self) to copy the category information from self onto keys.

BenikaHall · 2021-02-11T15:02:49Z

That's very true. Thank you for reopening @shwina. I'll follow up on this as well.

shwina · 2021-02-11T16:32:05Z

@BenikaHall - if you (or anyone else) would like to contribute a fix here, it should be relatively easy 🙂

In addition to the fix above, we'll need a couple of tests for this case in test_reshape.py.

BenikaHall · 2021-02-11T17:08:17Z

I'll see if I have some time next week to get a few tests done and a PR submitted. Thanks again @shwina.

esnvidia · 2021-04-20T20:23:42Z

Slightly related - calling unstack on a Series:

import cudf
import numpy as np
gdf = cudf.DataFrame(data=[[i for i in range(1990,2021) for j in range(500)],\
                     np.random.randint(0,3, size=(2021-1990)*500)]).T
gdf.columns=['year', 'use chip']
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().unstack(fill_value=0) # Fails
AttributeError: 'Series' object has no attribute 'unstack'
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_pandas().unstack(fill_value=0)  # success
cudf.__version__ --> 0.18.1

working around by calling to_frame() instead of to_pandas() causes MultiIndex to be created:

gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_frame().unstack().columns
MultiIndex([(0, 0),
            (0, 1),
            (0, 2)],
           names=[None, 'use chip'])
gdf[['year', 'use chip']].groupby(['year', 'use chip']).size().to_pandas().unstack().columns
Int64Index([0, 1, 2], dtype='int64', name='use chip')

Fixes #7365 Applies column metadata to the output columns of `keys` in `Frame._encode`; skipping this step meant that the output of `DataFrame.unstack` would not have the expected metadata for index columns: ```python import pandas as pd import cudf pdf = pd.DataFrame( { "foo": ["one", "one", "one", "two", "two", "two"], "bar": pd.Categorical(["A", "B", "C", "A", "B", "C"]), "baz": [1, 2, 3, 4, 5, 6], "zoo": ["x", "y", "z", "q", "w", "t"], }).set_index(["foo", "bar", "baz"]) gdf = cudf.from_pandas(pdf) pdf.unstack("baz") zoo baz 1 2 3 4 5 6 foo bar one A x NaN NaN NaN NaN NaN B NaN y NaN NaN NaN NaN C NaN NaN z NaN NaN NaN two A NaN NaN NaN q NaN NaN B NaN NaN NaN NaN w NaN C NaN NaN NaN NaN NaN t gdf.unstack("baz") zoo baz 1 2 3 4 5 6 foo bar one 0 x <NA> <NA> <NA> <NA> <NA> 1 <NA> y <NA> <NA> <NA> <NA> 2 <NA> <NA> z <NA> <NA> <NA> two 0 <NA> <NA> <NA> q <NA> <NA> 1 <NA> <NA> <NA> <NA> w <NA> 2 <NA> <NA> <NA> <NA> <NA> t ``` Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #8560

BenikaHall added Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2021

BenikaHall closed this as completed Feb 10, 2021

BenikaHall changed the title ~~[BUG]~~ [BUG] Unstack function changes column names to numeric format Feb 10, 2021

shwina reopened this Feb 11, 2021

shwina added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Feb 11, 2021

shwina added the good first issue Good for newcomers label Feb 11, 2021

charlesbluca mentioned this issue Jun 18, 2021

Apply metadata to keys before returning in Frame._encode #8560

Merged

charlesbluca self-assigned this Jun 18, 2021

charlesbluca mentioned this issue Jul 14, 2021

[FEA] Adding support for categorical column indexes #8743

Open

rapids-bot bot closed this as completed in #8560 Jul 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Unstack function changes column names to numeric format #7365

[BUG] Unstack function changes column names to numeric format #7365

BenikaHall commented Feb 10, 2021 •

edited

Loading

BenikaHall commented Feb 10, 2021

shwina commented Feb 11, 2021

shwina commented Feb 11, 2021

BenikaHall commented Feb 11, 2021

shwina commented Feb 11, 2021

BenikaHall commented Feb 11, 2021 •

edited

Loading

esnvidia commented Apr 20, 2021 •

edited

Loading

[BUG] Unstack function changes column names to numeric format #7365

[BUG] Unstack function changes column names to numeric format #7365

Comments

BenikaHall commented Feb 10, 2021 • edited Loading

BenikaHall commented Feb 10, 2021

shwina commented Feb 11, 2021

shwina commented Feb 11, 2021

BenikaHall commented Feb 11, 2021

shwina commented Feb 11, 2021

BenikaHall commented Feb 11, 2021 • edited Loading

esnvidia commented Apr 20, 2021 • edited Loading

BenikaHall commented Feb 10, 2021 •

edited

Loading

BenikaHall commented Feb 11, 2021 •

edited

Loading

esnvidia commented Apr 20, 2021 •

edited

Loading