Skip to content

Commit

Permalink
Convert Column Name to String Before Using Struct Column Factory (NVI…
Browse files Browse the repository at this point in the history
…DIA#10156)

Closes NVIDIA#10155 

`build_struct_column` requires that the field names to be strings. But dataframe column names can be any hashable types. Passing in column names as field names in `to_struct` is thus unsafe. This PR adds a check and raise a warning if the cast to string is required to take place.

Authors:
  - Michael Wang (https://github.com/isVoid)
  - Bradley Dice (https://github.com/bdice)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Sheilah Kirui (https://github.com/skirui-source)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai/cudf#10156
  • Loading branch information
isVoid authored Feb 14, 2022
1 parent 374b387 commit a443dd1
Show file tree
Hide file tree
Showing 3 changed files with 19 additions and 3 deletions.
4 changes: 2 additions & 2 deletions python/cudf/cudf/core/column/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -1602,8 +1602,8 @@ def build_struct_column(
Parameters
----------
names : list-like
Field names to map to children dtypes
names : sequence of strings
Field names to map to children dtypes, must be strings.
children : tuple
mask: Buffer
Expand Down
10 changes: 9 additions & 1 deletion python/cudf/cudf/core/dataframe.py
Original file line number Diff line number Diff line change
Expand Up @@ -5864,8 +5864,16 @@ def to_struct(self, name=None):
-----
Note that a copy of the columns is made.
"""
if not all(isinstance(name, str) for name in self._data.names):
warnings.warn(
"DataFrame contains non-string column name(s). Struct column "
"requires field name to be string. Non-string column names "
"will be casted to string as the field name."
)
field_names = [str(name) for name in self._data.names]

col = cudf.core.column.build_struct_column(
names=self._data.names, children=self._data.columns, size=len(self)
names=field_names, children=self._data.columns, size=len(self)
)
return cudf.Series._from_data(
cudf.core.column_accessor.ColumnAccessor(
Expand Down
8 changes: 8 additions & 0 deletions python/cudf/cudf/tests/test_struct.py
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,14 @@ def test_dataframe_to_struct():
df["a"][0] = 5
assert_eq(got, expect)

# check that a non-string (but convertible to string) named column can be
# converted to struct
df = cudf.DataFrame([[1, 2], [3, 4]], columns=[(1, "b"), 0])
expect = cudf.Series([{"(1, 'b')": 1, "0": 2}, {"(1, 'b')": 3, "0": 4}])
with pytest.warns(UserWarning, match="will be casted"):
got = df.to_struct()
assert_eq(got, expect)


@pytest.mark.parametrize(
"series, slce",
Expand Down

0 comments on commit a443dd1

Please sign in to comment.