Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Make RLike support consistent with Apache Spark #3797

Closed
andygrove opened this issue Oct 12, 2021 · 6 comments · Fixed by #4044
Closed

[FEA] Make RLike support consistent with Apache Spark #3797

andygrove opened this issue Oct 12, 2021 · 6 comments · Fixed by #4044
Assignees
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request

Comments

@andygrove
Copy link
Contributor

andygrove commented Oct 12, 2021

Is your feature request related to a problem? Please describe.
PR #3796 added an initial RLike implementation but there are a number of differences with the CPU implementation as noted in the documentation:

## RLike

The GPU implementation of RLike has a number of known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

A summary of known issues is shown below but this is not intended to be a comprehensive list. We recommend that you
do your own testing to verify whether the GPU implementation of `RLike` is suitable for your use case.

### Multi-line handling

The GPU implementation of RLike supports `^` and `$` to represent the start and end of lines within a string but
Spark uses `^` and `$` to refer to the start and end of the entire string (equivalent to `\A` and `\Z`).

| Pattern | Input  | Spark on CPU | Spark on GPU |
|---------|--------|--------------|--------------|
| `^A`    | `A\nB` | Match        | Match        |
| `A$`    | `A\nB` | No Match     | Match        |
| `^B`    | `A\nB` | No Match     | Match        |
| `B$`    | `A\nB` | Match        | Match        |

### Null character in input

The GPU implementation of RLike will not match anything after a null character within a string.

| Pattern   | Input     | Spark on CPU | Spark on GPU |
|-----------|-----------|--------------|--------------|
| `A`       | `\u0000A` | Match        | No Match     |

### Qualifiers with nothing to repeat

Spark supports qualifiers in cases where there is nothing to repeat. For example, Spark supports `a*+` and this
will match all inputs. The GPU implementation of RLike does not support this syntax and will throw an exception with
the message `nothing to repeat at position 0`.

### Stricter escaping requirements

The GPU implementation of RLike has stricter requirements around escaping special characters in some cases.

| Pattern   | Input  | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `a[-+]`   | `a-`   | Match        | No Match     |
| `a[\-\+]` | `a-`   | Match        | Match        |

### Emprty Groups

The GPU implementation of RLike does not support empty groups correctly.

| Pattern   | Input  | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?`    | `a`    | No Match     | Match        |
| `z()*`    | `a`    | No Match     | Match        |

Describe the solution you'd like

We should make the GPU behavior consistent with CPU. We could also fall back to CPU in some cases.

Describe alternatives you've considered
None

Additional context
N/A

@andygrove andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 12, 2021
@Salonijain27 Salonijain27 removed the ? - Needs Triage Need team to review and classify label Oct 12, 2021
@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 12, 2021
@andygrove
Copy link
Contributor Author

andygrove commented Oct 13, 2021

Here are some additional notes, comparing regex support in Spark, Python, and cuDF.

Comparing Regex support between Python, cuDF, and Java/Spark

Methodology

A custom fuzzing tool was used to generate a Parquet file with a single column containing random strings.

Regular expressions were also randomly generated and verified by calling Pattern.compile. Expressions that were invalid in Java were rejected.

Each expression was evaluated against the Parquet file with and without the RAPIDS Accelerator enabled and the results
were compared.

Expressions that either caused exceptions in cuDF or produced different results between GPU and CPU were then
analyzed and reproduced in Spark and Python repls and simplified where possible.

Scala / Java

This Java code matches Spark's implementation of regex in the rlike expression.

scala> import org.apache.commons.text.StringEscapeUtils
import org.apache.commons.text.StringEscapeUtils

scala> import java.util.regex.Pattern
import java.util.regex.Pattern

scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("foo").find(0)
res12: Boolean = true

scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("bar").find(0)
res11: Boolean = false

Spark

scala> Seq("foo", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'o{2}'")).show
+---+-----+
| c0|rlike|
+---+-----+
|foo| true|
|bar|false|
+---+-----+

Python

>>> import re
>>> print(re.compile('o{2}').search("foo"))
<re.Match object; span=(1, 3), match='oo'>
>>> print(re.compile('o{2}').search("bar"))
None

cuDF

>>> import cudf
>>> s1 = cudf.Series(['foo', 'bar'])
>>> s1.str.contains('o{2}', regex=True)
0     True
1    False
dtype: bool

Categories of Issue

Spark supports stacked quantifiers but Python and cuDF do not

Pattern Spark Python cuDF
a*+ Matches all inputs multiple repeat at position 2 nothing to repeat at position 2
|a Matches all inputs multiple repeat at position 0 nothing to repeat at position 0
scala> Seq("", "a", "b", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'a*+'")).show
+---+-----+
| c0|rlike|
+---+-----+
|   | true|
|  a| true|
|  b| true|
|bar| true|
+---+-----+

cuDF supports multi-line inputs

Pattern Input Spark Python cuDF
A$ A\nB false false true
^B A\nB false false true

Null character in input

Pattern Input Spark Python cuDF
]B \u0000]B true true false

TBD - to be analyzed still

Pattern Input Spark Python cuDF
8b()?1+ a12b false false true

@andygrove
Copy link
Contributor Author

@revans2 Here are the results of the audit of regex behavior between Spark/Java, Python, and cuDF. These are the issues that I keep running into and once these are resolved there may be others that I have not seen yet.

@revans2
Copy link
Collaborator

revans2 commented Oct 13, 2021

Thanks. It looks like we should start with the multi-line and null support simply because they are the same in all CPU environments.

@andygrove
Copy link
Contributor Author

andygrove commented Oct 14, 2021

There is an existing cuDF issue related to cuDF's stricter escaping requirements compared to Python and Java

@andygrove
Copy link
Contributor Author

andygrove commented Oct 14, 2021

@andygrove
Copy link
Contributor Author

andygrove commented Oct 18, 2021

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf_dependency An issue or PR with this label depends on a new feature in cudf feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants