Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RunTimeError in extract function for a valid regex #5229

Closed
galipremsagar opened this issue May 19, 2020 · 2 comments
Closed

[BUG] RunTimeError in extract function for a valid regex #5229

galipremsagar opened this issue May 19, 2020 · 2 comments
Labels
bug Something isn't working invalid This doesn't seem right strings strings issues (C++ and Python)

Comments

@galipremsagar
Copy link
Contributor

galipremsagar commented May 19, 2020

Describe the bug
extract API seems to error on a valid regex.

Steps/Code to reproduce bug

>>> import cudf
>>> s = cudf.Series(['a1', 'b2', 'c3'])
>>> s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 452, in extract
    out = cpp_extract(self._column, pat)
  File "cudf/_lib/strings/extract.pyx", line 32, in cudf._lib.strings.extract.extract
RuntimeError: cuDF failure at: /cudf/cpp/src/strings/regex/regcomp.cpp:398: invalid regex pattern: nothing to repeat at position 1
>>> s.to_pandas().str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
  letter digit
0      a     1
1      b     2
2    NaN   NaN


# The above reg-ex seems to be a valid one.
>>> import re
>>> re.compile(r'(?P<letter>[ab])(?P<digit>\d)')
re.compile('(?P<letter>[ab])(?P<digit>\\d)')

Expected behavior
I think we shouldn't be erroring in case of this regex, a followup to this is maybe we'll need an API to extract the column names from regex like in this case.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source[branch-0.14]

Additional context
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify strings strings issues (C++ and Python) labels May 19, 2020
@davidwendt
Copy link
Contributor

Not sure why 2 separate issues are opened in one issue.

For the first issue, the (?P) pattern is not supported. Supported regex features are documented here: https://docs.rapids.ai/api/libcudf/nightly/md_regex.html

@galipremsagar
Copy link
Contributor Author

Thanks for the documentations David, closing this issue as I separated the 2nd Issue : #5234

@galipremsagar galipremsagar added invalid This doesn't seem right bug Something isn't working and removed Needs Triage Need team to review and classify bug Something isn't working labels May 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid This doesn't seem right strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

2 participants