Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines #5610

Merged
merged 34 commits into from
Jun 3, 2022
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
79883d4
detect unsupported regexp edge cases
andygrove May 24, 2022
1e6f126
bug fix
andygrove May 24, 2022
43011cd
Update sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.…
andygrove May 24, 2022
46471ef
simplify code and remove duplicate checks
andygrove May 24, 2022
224bfb2
address feedback
andygrove May 24, 2022
74dd695
save
andygrove May 24, 2022
d98ff09
save
andygrove May 24, 2022
969a045
Use only distinct components in transpiled negated character classes
anthony-chang May 24, 2022
3a10384
Merge remote-tracking branch 'anthony-chang/handle-negated-class-edge…
andygrove May 24, 2022
f9ec1fb
Merge remote-tracking branch 'nvidia/branch-22.06' into handle-regexp…
andygrove May 25, 2022
d856a15
improved checks
andygrove May 25, 2022
4d4d250
revert
andygrove May 25, 2022
a87fb5d
enable another test
andygrove May 25, 2022
78be837
revert another test change
andygrove May 25, 2022
df337b0
revert fuzzer char change
andygrove May 25, 2022
0e128f3
Revert to broad checks for unsupported patterns
andygrove May 26, 2022
2e9f4e0
update expected error
andygrove May 26, 2022
ef9fc5d
add more characters to pattern generator
andygrove May 26, 2022
c6c010d
fail on error
andygrove May 26, 2022
4aa0be8
rename methods from isMaybe to contains
andygrove May 26, 2022
04b2619
update compatibility guide
andygrove May 26, 2022
53330ef
address feedback
andygrove May 26, 2022
a641fad
address feedback
andygrove May 26, 2022
a364dba
update tests for \f with end of line anchor
andygrove May 26, 2022
8ea6c4b
add extra cases to containsNewline
andygrove May 26, 2022
d9a414f
ignore two tests
andygrove May 26, 2022
12607b7
upmerge
andygrove May 27, 2022
2fd9251
update expected error message
andygrove May 27, 2022
7a81dba
revert change to number of generated patterns
andygrove May 31, 2022
7cc007a
Merge remote-tracking branch 'nvidia/branch-22.06' into handle-regexp…
andygrove May 31, 2022
33cf006
remove \u2028 and \u2028 from fuzz data
andygrove May 31, 2022
d83ac04
scalastyle
andygrove May 31, 2022
ef22adc
try removing more chars
andygrove May 31, 2022
9765ab8
remove \u0085 from fuzz test
andygrove May 31, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -587,6 +587,8 @@ Here are some examples of regular expression patterns that are not supported on
- Line anchor `$` is not supported by `regexp_replace`, and in some rare contexts.
- String anchor `\Z` is not supported by `regexp_replace`, and in some rare contexts.
- String anchor `\z` is not supported by `regexp_replace`
- Patterns containing an end of line or string anchor immediately next to a newline or repetition that produces zero
or more results
- Line and string anchors are not supported by `string_split` and `str_to_map`
- Non-digit character class `\D`
- Non-word character class `\W`
Expand Down
118 changes: 117 additions & 1 deletion sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ import java.sql.SQLException

import scala.collection.mutable.ListBuffer

import com.nvidia.spark.rapids.RegexParser.toReadableString

/**
* Regular expression parser based on a Pratt Parser design.
*
Expand Down Expand Up @@ -509,6 +511,21 @@ object RegexParser {
true
}
}

def toReadableString(x: String): String = {
x.map {
case '\r' => "\\r"
case '\n' => "\\n"
case '\t' => "\\t"
case '\f' => "\\f"
case '\u000b' => "\\u000b"
case '\u0085' => "\\u0085"
case '\u2028' => "\\u2028"
case '\u2029' => "\\u2029"
case other => other
}.mkString
}

}

sealed trait RegexMode
Expand Down Expand Up @@ -559,7 +576,7 @@ class CudfRegexTranspiler(mode: RegexMode) {
val replacement = repl.map(s => new RegexParser(s).parseReplacement(countCaptureGroups(regex)))

// validate that the regex is supported by cuDF
val cudfRegex = rewrite(regex, replacement, None)
val cudfRegex = transpile(regex, replacement, None)
// write out to regex string, performing minor transformations
// such as adding additional escaping
(cudfRegex.toRegexString, replacement.map(_.toRegexString))
Expand Down Expand Up @@ -696,6 +713,90 @@ class CudfRegexTranspiler(mode: RegexMode) {
}
}

private def transpile(regex: RegexAST, replacement: Option[RegexReplacement],
previous: Option[RegexAST]): RegexAST = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: replacement and previous is never used here, just passed on to rewrite. It may be more descriptive to remove these, rename this method to checkUnsupported (take the code out of the closure), and then call this and rewrite inside transpile


def containsBeginAnchor(regex: RegexAST): Boolean = {
contains(regex, {
case RegexChar('^') | RegexEscaped('A') => true
case _ => false
})
}

def containsEndAnchor(regex: RegexAST): Boolean = {
contains(regex, {
case RegexChar('$') | RegexEscaped('z') | RegexEscaped('Z') => true
Copy link
Contributor

@anthony-chang anthony-chang May 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we need to do (probably not in this PR) is to not check inside character classes to fix false positives such as 0*[D$3]. Same thing with the begin anchor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another case would be something like \na$ will work but (\na)$ will not be supported. We should only be checking the node closest to the $ in groups

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #5659 to track this

case _ => false
})
}

def containsNewline(regex: RegexAST): Boolean = {
contains(regex, {
case RegexChar('\r') | RegexEscaped('r') => true
case RegexChar('\n') | RegexEscaped('n') => true
case RegexChar('\f') | RegexEscaped('f') => true
andygrove marked this conversation as resolved.
Show resolved Hide resolved
case RegexChar('\u0085') | RegexChar('\u2028') | RegexChar('\u2029') => true
case RegexEscaped('s') | RegexEscaped('v') | RegexEscaped('R') => true
case RegexEscaped('W') | RegexEscaped('D') =>
// these would get transpiled to negated character classes
// that include newlines
true
case RegexCharacterClass(true, _) => true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: there is an edge case here that can prevent this from being a newline, and that being a negated character class that includes all new line characters. I see no need to handle this at the moment, but should be noted in case this potentially comes up in some testing scenarios.

case _ => false
})
}

def containsEmpty(regex: RegexAST): Boolean = {
contains(regex, {
case RegexRepetition(_, term) => term match {
case SimpleQuantifier('*') | SimpleQuantifier('?') => true
case QuantifierFixedLength(0) => true
case QuantifierVariableLength(0, _) => true
case _ => false
}
case _ => false
})
}

def checkPair(r1: RegexAST, r2: RegexAST): Unit = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be nice for this to have a more descriptive name, like "checkEndAnchorContext", so it's more obvious to the reader what this is checking.

if ((containsEndAnchor(r1) &&
(containsNewline(r2) || containsEmpty(r2) || containsBeginAnchor(r2))) ||
(containsEndAnchor(r2) &&
(containsNewline(r1) || containsEmpty(r1) || containsBeginAnchor(r1)))) {
throw new RegexUnsupportedException(
s"End of line/string anchor is not supported in this context: " +
s"${toReadableString(r1.toRegexString)}" +
s"${toReadableString(r2.toRegexString)}")
}
}

def checkUnsupported(regex: RegexAST): Unit = {
regex match {
andygrove marked this conversation as resolved.
Show resolved Hide resolved
case RegexSequence(parts) =>
// check each pair of regex ast nodes for unsupported combinations
// of end string/line anchors and newlines or optional items
for (i <- 1 until parts.length) {
checkPair(parts(i - 1), parts(i))
}
case RegexChoice(l, r) =>
checkUnsupported(l)
checkUnsupported(r)
case RegexGroup(_, term) => checkUnsupported(term)
case RegexRepetition(ast, _) => checkUnsupported(ast)
case RegexCharacterClass(_, components) =>
for (i <- 1 until components.length) {
checkPair(components(i - 1), components(i))
}
case _ =>
// ignore
}
}

checkUnsupported(regex)

rewrite(regex, replacement, previous)
}

private def rewrite(regex: RegexAST, replacement: Option[RegexReplacement],
previous: Option[RegexAST]): RegexAST = {
regex match {
Expand Down Expand Up @@ -1162,6 +1263,21 @@ class CudfRegexTranspiler(mode: RegexMode) {
}
}

private def contains(regex: RegexAST, f: RegexAST => Boolean): Boolean = {
if (f(regex)) {
true
} else {
regex match {
case RegexSequence(parts) => parts.exists(x => contains(x, f))
case RegexGroup(_, term) => contains(term, f)
case RegexChoice(l, r) => contains(l, f) || contains(r, f)
case RegexRepetition(term, _) => contains(term, f)
case RegexCharacterClass(_, chars) => chars.exists(ch => contains(ch, f))
case leaf => f(leaf)
}
}
}

private def isBeginOrEndLineAnchor(regex: RegexAST): Boolean = regex match {
case RegexSequence(parts) => parts.nonEmpty && parts.forall(isBeginOrEndLineAnchor)
case RegexGroup(_, term) => isBeginOrEndLineAnchor(term)
Expand Down
Loading