Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RegExp parse fails to parse character ranges containing escaped characters #4505

Closed
andygrove opened this issue Jan 11, 2022 · 2 comments · Fixed by #5532
Closed

[BUG] RegExp parse fails to parse character ranges containing escaped characters #4505

andygrove opened this issue Jan 11, 2022 · 2 comments · Fixed by #5532
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
The regular expression [\a-z] does not get parsed correctly because we currently only support character ranges with non-escaped characters. The transpiler then converts this to [\a\-z] when passing it to cuDF, resulting in incorrect results.

Steps/Code to reproduce bug
This was found by the new test code in #4504

Expected behavior
This expression should produce the same results on CPU and GPU.

Additional context
None

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jan 11, 2022
@andygrove
Copy link
Contributor Author

Note that I don't think this is a very common use case so not the highest priority but needs fixing to unblock the new AST-based fuzzing approach.

Some notes on how I would implement this, in case someone else picks this issue up:

  • RegexChar and RegexEscapedChar should be combined as RegChar(ch: Char, escaped: Boolean)
  • RegexCharacterRange should be updated to accept RegexChar instead of Char
  • Add unit tests to RegularExpressionParserSuite for parsing [\a-z] and assert that this produces a character range
  • Update RegexParser as required

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Jan 18, 2022
@anthony-chang anthony-chang self-assigned this May 12, 2022
@anthony-chang
Copy link
Contributor

Related to this issue: with the newly added support of predefined character classes (eg. \s, \h, \v etc), we need to handle the cases where these are nested in another character class (eg. [abc\s]).

cuDF doesn't support nested character classes but we should be able to expand and flatten the nested predefined classes, ie. [abc\s] becomes [abc \t\n\x0B\f\r]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants