-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement check for non-regexp strings using RegexParser #4681
Conversation
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala
Outdated
Show resolved
Hide resolved
…ParserSuite.scala Co-authored-by: Nghia Truong <ttnghia@users.noreply.github.com>
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionParserSuite.scala
Show resolved
Hide resolved
Signed-off-by: Andy Grove <andygrove@nvidia.com>
} | ||
|
||
test("detect non-regexp strings") { | ||
val strings = Seq("\\.", "A", ",", "\t", ":", "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong. For example if I did a split with "\\."
as the separator on "a.b.c"
I would expect an output of ["a", "b", "c"]
. But I cannot take "\\."
and pass it into a non-regexp string split, because it will try to look for a "\\"
character in front of the '.'
. Am I wrong/missing something? I think we could still support it, but we would have to transpile the regular expression into a constant string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, this isn't what we want here. Moving this to draft for now and will rethink this approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated to treat any string containing escaped characters as a regular expression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a follow on issue to do the transpile that I suggested. From a performance and memory standpoint NOT doing a regular expression is going to be much faster than trying to do it.
I've filed #4685 for this |
build |
This PR reimplements the check used in GpuStringSplit to determine if a string is a regular expression or not. The previous approach had limitations and now that we have a regular expression parser we can improve the check so that there are fewer false positives.
This change is needed to support the work for #3542