Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply metadata to keys before returning in Frame._encode #8560

Merged
merged 4 commits into from
Jul 16, 2021

Conversation

charlesbluca
Copy link
Member

Fixes #7365

Applies column metadata to the output columns of keys in Frame._encode; skipping this step meant that the output of DataFrame.unstack would not have the expected metadata for index columns:

import pandas as pd
import cudf

pdf = pd.DataFrame(
    {
        "foo": ["one", "one", "one", "two", "two", "two"],
        "bar": pd.Categorical(["A", "B", "C", "A", "B", "C"]),
        "baz": [1, 2, 3, 4, 5, 6],
        "zoo": ["x", "y", "z", "q", "w", "t"],
    }).set_index(["foo", "bar", "baz"])
gdf = cudf.from_pandas(pdf)

pdf.unstack("baz")
         zoo                         
baz        1    2    3    4    5    6
foo bar                              
one A      x  NaN  NaN  NaN  NaN  NaN
    B    NaN    y  NaN  NaN  NaN  NaN
    C    NaN  NaN    z  NaN  NaN  NaN
two A    NaN  NaN  NaN    q  NaN  NaN
    B    NaN  NaN  NaN  NaN    w  NaN
    C    NaN  NaN  NaN  NaN  NaN    t

gdf.unstack("baz")
          zoo                              
baz         1     2     3     4     5     6
foo bar                                    
one 0       x  <NA>  <NA>  <NA>  <NA>  <NA>
    1    <NA>     y  <NA>  <NA>  <NA>  <NA>
    2    <NA>  <NA>     z  <NA>  <NA>  <NA>
two 0    <NA>  <NA>  <NA>     q  <NA>  <NA>
    1    <NA>  <NA>  <NA>  <NA>     w  <NA>
    2    <NA>  <NA>  <NA>  <NA>  <NA>     t

@charlesbluca charlesbluca added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. non-breaking Non-breaking change labels Jun 18, 2021
@charlesbluca charlesbluca requested a review from a team as a code owner June 18, 2021 21:25
@charlesbluca
Copy link
Member Author

charlesbluca commented Jun 18, 2021

The tests are failing when comparing pdf.columns to gdf.columns:

>       assert_eq(
            pdf.unstack(level=level), gdf.unstack(level=level), check_dtype=False,
        )

cudf/tests/test_reshape.py:425: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
cudf/tests/utils.py:99: in assert_eq
    tm.assert_frame_equal(left, right, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = CategoricalIndex(['A', 'B', 'C', 'A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, name='bar', dtype='category')
right = Index(['A', 'B', 'C', 'A', 'B', 'C'], dtype='object', name='bar'), obj = 'MultiIndex level [2]'

    def _check_types(left, right, obj="Index"):
        if exact:
>           assert_class_equal(left, right, exact=exact, obj=obj)
E           AssertionError: MultiIndex level [2] are different
E           
E           MultiIndex level [2] classes are not equivalent
E           [left]:  CategoricalIndex(['A', 'B', 'C', 'A', 'B', 'C'], categories=['A', 'B', 'C'], ordered=False, name='bar', dtype='category')
E           [right]: Index(['A', 'B', 'C', 'A', 'B', 'C'], dtype='object', name='bar')

../../../compose/etc/conda/cuda_11.2/envs/rapids/lib/python3.8/site-packages/pandas/_testing.py:740: AssertionError

It looks like this is because when we are constructing a Pandas MultiIndex from the cuDF dataframe's column accessor, we set a dtype of object:

def to_pandas_index(self) -> pd.Index:
""""
Convert the keys of the ColumnAccessor to a Pandas Index object.
"""
if self.multiindex and len(self.level_names) > 0:
# Using `from_frame()` instead of `from_tuples`
# prevents coercion of values to a different type
# (e.g., ''->NaT)
result = pd.MultiIndex.from_frame(
pd.DataFrame(
self.names, columns=self.level_names, dtype="object"
),
)
else:
result = pd.Index(self.names, name=self.name, tupleize_cols=False)
return result

Is there anything we could do here to recreate the exact Pandas MultiIndex, or should we just create a different test case for this change that doesn't check equality with Pandas?

@charlesbluca
Copy link
Member Author

Ended up xfailing the tests, as it seems like the underlying issue here is that we don't fully support categorical column indexes. Opened up an issue discussing this #8743

@codecov
Copy link

codecov bot commented Jul 14, 2021

Codecov Report

Merging #8560 (22d8f3b) into branch-21.08 (d05de97) will increase coverage by 0.01%.
The diff coverage is 0.00%.

❗ Current head 22d8f3b differs from pull request most recent head 636fdb8. Consider uploading reports for the commit 636fdb8 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.08    #8560      +/-   ##
================================================
+ Coverage         10.66%   10.67%   +0.01%     
================================================
  Files               109      109              
  Lines             18302    18669     +367     
================================================
+ Hits               1951     1993      +42     
- Misses            16351    16676     +325     
Impacted Files Coverage Δ
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/hdf.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/orc.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/abc.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/dlpack.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/feather.py 0.00% <0.00%> (ø)
... and 45 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d05de97...636fdb8. Read the comment docs.

@galipremsagar
Copy link
Contributor

@gpucibot merge

@rapids-bot rapids-bot bot merged commit cd1866e into rapidsai:branch-21.08 Jul 16, 2021
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Jul 16, 2021
@charlesbluca charlesbluca deleted the fix-7365 branch August 3, 2021 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Unstack function changes column names to numeric format
2 participants