Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]cudf.get_dummies fails if symbols ( $,( ) are present in data #8832

Closed
VibhuJawa opened this issue Jul 22, 2021 · 1 comment · Fixed by #8834
Closed

[BUG]cudf.get_dummies fails if symbols ( $,( ) are present in data #8832

VibhuJawa opened this issue Jul 22, 2021 · 1 comment · Fixed by #8834
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@VibhuJawa
Copy link
Member

Describe the bug
cudf.get_dummies fails if dollar symbol is present in data

Steps/Code to reproduce bug

import cudf
df = cudf.DataFrame({"a":["$ 1", "$ 2"]})
df['a']=df['a'].astype('category')
cudf.get_dummies(df,columns=['a'])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_26534/211579206.py in <module>
      2 df = cudf.DataFrame({"a":["$ 1", "$ 2"]})
      3 df['a']=df['a'].astype('category')
----> 4 cudf.get_dummies(df,columns=['a'])

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/reshape.py in get_dummies(df, prefix, prefix_sep, dummy_na, columns, cats, sparse, drop_first, dtype)
    700                 unique = _get_unique(column=df._data[name], dummy_na=dummy_na)
    701 
--> 702                 col_enc_df = df.one_hot_encoding(
    703                     name,
    704                     prefix=prefix_map.get(name, prefix),

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/dataframe.py in one_hot_encoding(self, column, prefix, cats, prefix_sep, dtype)
   3655             for cat in cats
   3656         ]
-> 3657         newcols = self[column].one_hot_encoding(cats=cats, dtype=dtype)
   3658         outdf = self.copy()
   3659         for name, col in zip(newnames, newcols):

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in one_hot_encoding(self, cats, dtype)
   3792                 return (self == cat).fillna(False)
   3793 
-> 3794         return [encode(cat).astype(dtype) for cat in cats]
   3795 
   3796     def label_encoding(self, cats, dtype=None, na_sentinel=-1):

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in <listcomp>(.0)
   3792                 return (self == cat).fillna(False)
   3793 
-> 3794         return [encode(cat).astype(dtype) for cat in cats]
   3795 
   3796     def label_encoding(self, cats, dtype=None, na_sentinel=-1):

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in encode(cat)
   3790                 return self.__class__(libcudf.unary.is_nan(self._column))
   3791             else:
-> 3792                 return (self == cat).fillna(False)
   3793 
   3794         return [encode(cat).astype(dtype) for cat in cats]

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/frame.py in __eq__(self, other)
   3580     # Binary rich comparison operations.
   3581     def __eq__(self, other):
-> 3582         return self._binaryop(other, "eq")
   3583 
   3584     def __ne__(self, other):

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/series.py in _binaryop(self, other, fn, fill_value, reflect, can_reindex, *args, **kwargs)
   1364 
   1365         # Note that we call the super on lhs, not self.
-> 1366         return super(Series, lhs)._binaryop(other, fn, fill_value, reflect)
   1367 
   1368     def add(self, other, fill_value=None, axis=0):

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/frame.py in _binaryop(self, other, fn, fill_value, reflect, *args, **kwargs)
   3958 
   3959         return self._copy_construct(
-> 3960             data=type(self)._colwise_binop(operands, fn)[result_name],
   3961             name=result_name,
   3962         )

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/frame.py in _colwise_binop(cls, operands, fn)
   3431                 )
   3432             elif not isinstance(right_column, ColumnBase):
-> 3433                 right_column = left_column.normalize_binop_value(right_column)
   3434 
   3435             fn_apply = fn

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/categorical.py in normalize_binop_value(self, other)
    931 
    932         ary = cudf.utils.utils.scalar_broadcast_to(
--> 933             self._encode(other), size=len(self), dtype=self.codes.dtype
    934         )
    935         col = column.build_categorical_column(

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/categorical.py in _encode(self, value)
   1046 
   1047     def _encode(self, value) -> ScalarLike:
-> 1048         return self.categories.find_first_value(value)
   1049 
   1050     def _decode(self, value: int) -> ScalarLike:

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/string.py in find_first_value(self, value, closest)
   5313         self, value: ScalarLike, closest: bool = False
   5314     ) -> int:
-> 5315         return self._find_first_and_last(value)[0]
   5316 
   5317     def find_last_value(self, value: ScalarLike, closest: bool = False) -> int:

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/string.py in _find_first_and_last(self, value)
   5306         found_indices = libstrings.contains_re(self, f"^{value}$")
   5307         found_indices = libcudf.unary.cast(found_indices, dtype=np.int32)
-> 5308         first = column.as_column(found_indices).find_first_value(np.int32(1))
   5309         last = column.as_column(found_indices).find_last_value(np.int32(1))
   5310         return first, last

/nvme/0/vjawa/conda/envs/rapids-21.08/lib/python3.8/site-packages/cudf/core/column/numerical.py in find_first_value(self, value, closest)
    421                     raise ValueError("value not found")
    422         elif found == -1:
--> 423             raise ValueError("value not found")
    424         return found
    425 

ValueError: value not found

Expected behavior

I would expect it to work.

Additional context
Looking at the trace it seems like we are not handling regex correctly at line here:

 found_indices = libstrings.contains_re(self, f"^{value}$")
@VibhuJawa VibhuJawa added bug Something isn't working Needs Triage Need team to review and classify labels Jul 22, 2021
@VibhuJawa VibhuJawa added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jul 22, 2021
@galipremsagar galipremsagar self-assigned this Jul 22, 2021
@VibhuJawa VibhuJawa changed the title [BUG]cudf.get_dummies fails if dollar symbol is present in data [BUG]cudf.get_dummies fails if symbols ( $,( ) are present in data Jul 22, 2021
@VibhuJawa
Copy link
Member Author

Also breaks for ( this sybmol.

rapids-bot bot pushed a commit that referenced this issue Jul 23, 2021
Fixes: #8832 

This PR fixes `contains` check in the `StringColumn`.  We were using `f"^{item}$"` to generate a regex and do a `contains_re` to check for an exact match for `item` in the `StringColumn`, but this approach would break if `item` by itself has some regex special characters, so replaced these checks with `libcudf.search.contains` which does the exact check for `item` in the `StringColumn`.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #8834
rapids-bot bot pushed a commit that referenced this issue Oct 29, 2021
Closes #9463 
Closes #9434 

This adds a small section to the [Regex Features](https://docs.rapids.ai/api/libcudf/stable/md_regex.html) page describing invalid regex patterns may result in undefined behavior. The list here includes current issues as well as ones opened in the past:
#3732, #8832, #5234, #4746, #3725

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Paul Taylor (https://github.com/trxcllnt)
  - MithunR (https://github.com/mythrocks)

URL: #9473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants