Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] String match returns incorrect result in some cases #9434

Closed
ayushdg opened this issue Oct 14, 2021 · 6 comments · Fixed by #9473
Closed

[BUG] String match returns incorrect result in some cases #9434

ayushdg opened this issue Oct 14, 2021 · 6 comments · Fixed by #9473
Labels
bug Something isn't working doc Documentation strings strings issues (C++ and Python)

Comments

@ayushdg
Copy link
Member

ayushdg commented Oct 14, 2021

Describe the bug
series.str.match returns incorrect results in the case of some regular expressions.

Steps/Code to reproduce bug

import cudf

regex = "^\\^|()\\-*\\[\\]\\$$"
pdf = pd.DataFrame({"a": ["a normal string", "%_%", "^|()-*[]$"]})
df = cudf.from_pandas(pdf)

print(df.a.str.match(regex))
0    True
1    True
2    True
Name: a, dtype: bool

Expected behavior
In the above example

print(pdf.a.str.match(regex))
0    False
1    False
2     True
Name: a, dtype: bool

Environment overview (please complete the following information)

  • Environment location: Bare-metal
  • Method of cuDF install: conda
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details

Additional context
Came up while adding gpu tests for string functionality in dask-sql: dask-contrib/dask-sql#256

@ayushdg ayushdg added bug Something isn't working Needs Triage Need team to review and classify labels Oct 14, 2021
@davidwendt
Copy link
Contributor

It looks like there are a lot of unescaped regex characters in the pattern that may be confusing the code. Here is the documentation that includes the meaning for the special regex characters. Looks like if you escape them all, you will get the correct result.

import cudf
regex = "^\\^\\|\\(\\)\\-\\*\\[\\]\\$$"
s = cudf.Series(["a normal string", "%_%", "^|()-*[]$"])
print(s.str.match(regex))

0    False
1    False
2     True
dtype: bool

@ayushdg
Copy link
Member Author

ayushdg commented Oct 14, 2021

Thanks for the context and link.
Are the regex rules for cudf string pattern matching slightly different than the standard python regex rules? (In terms of requiring to escape certain chars)

Trying to evaluate if dask-sql will need a special case to handle regexes differently for gpu dataframes vs pandas df's.

@davidwendt
Copy link
Contributor

I would expect that all regex characters would need to be escaped if they are to be matched literally in a target string. Perhaps python has done some work to guess when the escape was intended? I don't have a clue how they would know to do that.

@andygrove
Copy link
Contributor

I am also running into this issue. Here is one example:

Python

>>> pattern = re.compile('a[-+]')
>>> values = ['a-', 'a+', 'a']
>>> [pattern.search(x) for x in values]
[<re.Match object; span=(0, 2), match='a-'>, <re.Match object; span=(0, 2), match='a+'>, None]
>>> 

cuDF with escaping (correct results)

>>> cudf.Series(['a+', 'a-', 'a']).str.contains('a[\-\+]', regex=True)
0     True
1     True
2    False
dtype: bool

cuDF without escaping

>>> cudf.Series(['a+', 'a-', 'a']).str.contains('a[-+]', regex=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/andy/miniconda3/envs/rapids-21.10/lib/python3.7/site-packages/cudf/core/column/string.py", line 743, in contains
    result_col = libstrings.contains_re(self._column, pat)
  File "cudf/_lib/strings/contains.pyx", line 29, in cudf._lib.strings.contains.contains_re
RuntimeError: cuDF failure at: ../src/strings/regex/regcomp.cpp:397: invalid regex pattern: nothing to repeat at position 3

@davidwendt
Copy link
Contributor

davidwendt commented Oct 14, 2021

The - inside [] is expecting a character range like [a-z] or [1-9] but is missing the first range character before the -.
So the - will need to be escaped here. Looks like + is a special character too.

@davidwendt
Copy link
Contributor

@beckernick beckernick added strings strings issues (C++ and Python) doc Documentation and removed Needs Triage Need team to review and classify labels Oct 25, 2021
rapids-bot bot pushed a commit that referenced this issue Oct 29, 2021
Closes #9463 
Closes #9434 

This adds a small section to the [Regex Features](https://docs.rapids.ai/api/libcudf/stable/md_regex.html) page describing invalid regex patterns may result in undefined behavior. The list here includes current issues as well as ones opened in the past:
#3732, #8832, #5234, #4746, #3725

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Paul Taylor (https://github.com/trxcllnt)
  - MithunR (https://github.com/mythrocks)

URL: #9473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working doc Documentation strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants