-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] String match returns incorrect result in some cases #9434
Comments
It looks like there are a lot of unescaped regex characters in the pattern that may be confusing the code. Here is the documentation that includes the meaning for the special regex characters. Looks like if you escape them all, you will get the correct result.
|
Thanks for the context and link. Trying to evaluate if dask-sql will need a special case to handle regexes differently for gpu dataframes vs pandas df's. |
I would expect that all regex characters would need to be escaped if they are to be matched literally in a target string. Perhaps python has done some work to guess when the escape was intended? I don't have a clue how they would know to do that. |
I am also running into this issue. Here is one example: Python>>> pattern = re.compile('a[-+]')
>>> values = ['a-', 'a+', 'a']
>>> [pattern.search(x) for x in values]
[<re.Match object; span=(0, 2), match='a-'>, <re.Match object; span=(0, 2), match='a+'>, None]
>>> cuDF with escaping (correct results)>>> cudf.Series(['a+', 'a-', 'a']).str.contains('a[\-\+]', regex=True)
0 True
1 True
2 False
dtype: bool cuDF without escaping>>> cudf.Series(['a+', 'a-', 'a']).str.contains('a[-+]', regex=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/andy/miniconda3/envs/rapids-21.10/lib/python3.7/site-packages/cudf/core/column/string.py", line 743, in contains
result_col = libstrings.contains_re(self._column, pat)
File "cudf/_lib/strings/contains.pyx", line 29, in cudf._lib.strings.contains.contains_re
RuntimeError: cuDF failure at: ../src/strings/regex/regcomp.cpp:397: invalid regex pattern: nothing to repeat at position 3 |
The |
Documenting this as undefined behavior here https://github.com/rapidsai/cudf/pull/9473/files#diff-5c59d57ecc67d4a26c406b9f0a4976a9c49c74c7c9ac8e9e4d3ba947fa065cc6R17-R23 |
Closes #9463 Closes #9434 This adds a small section to the [Regex Features](https://docs.rapids.ai/api/libcudf/stable/md_regex.html) page describing invalid regex patterns may result in undefined behavior. The list here includes current issues as well as ones opened in the past: #3732, #8832, #5234, #4746, #3725 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Karthikeyan (https://github.com/karthikeyann) - Paul Taylor (https://github.com/trxcllnt) - MithunR (https://github.com/mythrocks) URL: #9473
Describe the bug
series.str.match
returns incorrect results in the case of some regular expressions.Steps/Code to reproduce bug
Expected behavior
In the above example
Environment overview (please complete the following information)
docker pull
&docker run
commands usedEnvironment details
Please run and paste the output of the
cudf/print_env.sh
script here, to gather any other relevant environment detailsAdditional context
Came up while adding gpu tests for string functionality in dask-sql: dask-contrib/dask-sql#256
The text was updated successfully, but these errors were encountered: