Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Certain quantifier regex patterns erroneously fail to compile #4746

Closed
kuhushukla opened this issue Mar 30, 2020 · 1 comment · Fixed by #4756
Closed

[BUG] Certain quantifier regex patterns erroneously fail to compile #4746

kuhushukla opened this issue Mar 30, 2020 · 1 comment · Fixed by #4756
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@kuhushukla
Copy link
Contributor

Describe the bug
A recent change in cudf regex parser return error for invalid quantifier per 575e944f777df0b904be0a8f1a9722ded0c045a6 fails some patterns that passed earlier that seem reasonable to pass in the first place.

Steps/Code to reproduce bug
Following tests fail :

 TEST_F(StringsContainsTests, custom_anynl)
    {
        std::string medium_regex = "(.|\\n)*";

        std::vector<const char*> h_strings{
            "hello"
        };
        cudf::test::strings_column_wrapper strings( h_strings.begin(), h_strings.end(),
            thrust::make_transform_iterator( h_strings.begin(), [] (auto str) { return str!=nullptr; }));

        auto strings_view = cudf::strings_column_view(strings);
        {
            auto results = cudf::strings::matches_re(strings_view, medium_regex);
            cudf::experimental::bool8 h_expected[] = {true};
            cudf::test::fixed_width_column_wrapper<cudf::experimental::bool8> expected( h_expected, h_expected+h_strings.size(),
                thrust::make_transform_iterator( h_strings.begin(), [] (auto str) { return str!=nullptr; }));
            cudf::test::expect_columns_equal(*results,expected);
        }
    }

    TEST_F(StringsContainsTests, dotStarWithBraces)
        {
            std::string medium_regex = "(.)*";

            std::vector<const char*> h_strings{
                "hello"
            };
            cudf::test::strings_column_wrapper strings( h_strings.begin(), h_strings.end(),
                thrust::make_transform_iterator( h_strings.begin(), [] (auto str) { return str!=nullptr; }));

            auto strings_view = cudf::strings_column_view(strings);
            {
                auto results = cudf::strings::matches_re(strings_view, medium_regex);
                cudf::experimental::bool8 h_expected[] = {true};
                cudf::test::fixed_width_column_wrapper<cudf::experimental::bool8> expected( h_expected, h_expected+h_strings.size(),
                    thrust::make_transform_iterator( h_strings.begin(), [] (auto str) { return str!=nullptr; }));
                cudf::test::expect_columns_equal(*results,expected);
            }
        }

they thorw

C++ exception with description "cuDF failure at: /home/kuhus/Reps/cudf/cpp/src/strings/regex/regcomp.cpp:504: invalid regex pattern: nothing to repeat at position 6" thrown in the test body.

and,

C++ exception with description "cuDF failure at: /home/kuhus/Reps/cudf/cpp/src/strings/regex/regcomp.cpp:504: invalid regex pattern: nothing to repeat at position 3" thrown in the test body.

respectively.
Expected behavior
These tests above passed before the change.

Environment overview (please complete the following information)
cudf 0.14 on Ubuntu 18.04

@kuhushukla kuhushukla added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Mar 30, 2020
@davidwendt davidwendt self-assigned this Mar 30, 2020
@davidwendt davidwendt added the strings strings issues (C++ and Python) label Mar 30, 2020
@davidwendt
Copy link
Contributor

Another example with the same error and conditions:

>>> s = cudf.Series(["0.0.0.0", "5.79.97.178"])
>>> reg=r"^(100\.(6[4-9]|[7-9][0-9]|1([0-1][0-9]|2[0-7]))\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$)"
>>> s.str.match(reg)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/conda/envs/cudf/lib/python3.7/site-packages/cudf/core/column/string.py", line 1736, in match
    cpp_match_re(self._column, pat), **kwargs
  File "cudf/_libxx/strings/contains.pyx", line 67, in cudf._libxx.strings.contains.match_re
RuntimeError: cuDF failure at: /cudf/cpp/src/strings/regex/regcomp.cpp:504: invalid regex pattern: nothing to repeat at position 80

Also, I was able to reproduce with the minimal pattern of `r"(9)|2" so it looks like the ")" plus quantifier is causing the problem.

@harrism harrism removed the Needs Triage Need team to review and classify label Mar 31, 2020
rapids-bot bot pushed a commit that referenced this issue Oct 29, 2021
Closes #9463 
Closes #9434 

This adds a small section to the [Regex Features](https://docs.rapids.ai/api/libcudf/stable/md_regex.html) page describing invalid regex patterns may result in undefined behavior. The list here includes current issues as well as ones opened in the past:
#3732, #8832, #5234, #4746, #3725

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Paul Taylor (https://github.com/trxcllnt)
  - MithunR (https://github.com/mythrocks)

URL: #9473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants