[FEA] Make RLike support consistent with Apache Spark #3797

andygrove · 2021-10-12T14:06:03Z

Is your feature request related to a problem? Please describe.
PR #3796 added an initial RLike implementation but there are a number of differences with the CPU implementation as noted in the documentation:

## RLike

The GPU implementation of RLike has a number of known issues where behavior is not consistent with Apache Spark and
this expression is disabled by default. It can be enabled setting `spark.rapids.sql.expression.RLike=true`.

A summary of known issues is shown below but this is not intended to be a comprehensive list. We recommend that you
do your own testing to verify whether the GPU implementation of `RLike` is suitable for your use case.

### Multi-line handling

The GPU implementation of RLike supports `^` and `$` to represent the start and end of lines within a string but
Spark uses `^` and `$` to refer to the start and end of the entire string (equivalent to `\A` and `\Z`).

| Pattern | Input  | Spark on CPU | Spark on GPU |
|---------|--------|--------------|--------------|
| `^A`    | `A\nB` | Match        | Match        |
| `A$`    | `A\nB` | No Match     | Match        |
| `^B`    | `A\nB` | No Match     | Match        |
| `B$`    | `A\nB` | Match        | Match        |

### Null character in input

The GPU implementation of RLike will not match anything after a null character within a string.

| Pattern   | Input     | Spark on CPU | Spark on GPU |
|-----------|-----------|--------------|--------------|
| `A`       | `\u0000A` | Match        | No Match     |

### Qualifiers with nothing to repeat

Spark supports qualifiers in cases where there is nothing to repeat. For example, Spark supports `a*+` and this
will match all inputs. The GPU implementation of RLike does not support this syntax and will throw an exception with
the message `nothing to repeat at position 0`.

### Stricter escaping requirements

The GPU implementation of RLike has stricter requirements around escaping special characters in some cases.

| Pattern   | Input  | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `a[-+]`   | `a-`   | Match        | No Match     |
| `a[\-\+]` | `a-`   | Match        | Match        |

### Emprty Groups

The GPU implementation of RLike does not support empty groups correctly.

| Pattern   | Input  | Spark on CPU | Spark on GPU |
|-----------|--------|--------------|--------------|
| `z()?`    | `a`    | No Match     | Match        |
| `z()*`    | `a`    | No Match     | Match        |

Describe the solution you'd like

We should make the GPU behavior consistent with CPU. We could also fall back to CPU in some cases.

Describe alternatives you've considered
None

Additional context
N/A

The text was updated successfully, but these errors were encountered:

andygrove · 2021-10-13T16:56:27Z

Here are some additional notes, comparing regex support in Spark, Python, and cuDF.

Comparing Regex support between Python, cuDF, and Java/Spark

Methodology

A custom fuzzing tool was used to generate a Parquet file with a single column containing random strings.

Regular expressions were also randomly generated and verified by calling Pattern.compile. Expressions that were invalid in Java were rejected.

Each expression was evaluated against the Parquet file with and without the RAPIDS Accelerator enabled and the results
were compared.

Expressions that either caused exceptions in cuDF or produced different results between GPU and CPU were then
analyzed and reproduced in Spark and Python repls and simplified where possible.

Scala / Java

This Java code matches Spark's implementation of regex in the rlike expression.

scala> import org.apache.commons.text.StringEscapeUtils
import org.apache.commons.text.StringEscapeUtils

scala> import java.util.regex.Pattern
import java.util.regex.Pattern

scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("foo").find(0)
res12: Boolean = true

scala> Pattern.compile(StringEscapeUtils.escapeJava("o{2}")).matcher("bar").find(0)
res11: Boolean = false

Spark

scala> Seq("foo", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'o{2}'")).show
+---+-----+
| c0|rlike|
+---+-----+
|foo| true|
|bar|false|
+---+-----+

Python

>>> import re
>>> print(re.compile('o{2}').search("foo"))
<re.Match object; span=(1, 3), match='oo'>
>>> print(re.compile('o{2}').search("bar"))
None

cuDF

>>> import cudf
>>> s1 = cudf.Series(['foo', 'bar'])
>>> s1.str.contains('o{2}', regex=True)
0     True
1    False
dtype: bool

Categories of Issue

Spark supports stacked quantifiers but Python and cuDF do not

Pattern	Spark	Python	cuDF
`a*+`	Matches all inputs	multiple repeat at position 2	nothing to repeat at position 2
`\|a`	Matches all inputs	multiple repeat at position 0	nothing to repeat at position 0

scala> Seq("", "a", "b", "bar").toDF("c0").withColumn("rlike", expr("c0 RLIKE 'a*+'")).show
+---+-----+
| c0|rlike|
+---+-----+
|   | true|
|  a| true|
|  b| true|
|bar| true|
+---+-----+

cuDF supports multi-line inputs

Pattern	Input	Spark	Python	cuDF
`A$`	`A\nB`	false	false	true
`^B`	`A\nB`	false	false	true

Null character in input

Pattern	Input	Spark	Python	cuDF
`]B`	`\u0000]B`	true	true	false

TBD - to be analyzed still

Pattern	Input	Spark	Python	cuDF
`8b()?1+`	`a12b`	false	false	true

andygrove · 2021-10-13T16:59:23Z

@revans2 Here are the results of the audit of regex behavior between Spark/Java, Python, and cuDF. These are the issues that I keep running into and once these are resolved there may be others that I have not seen yet.

revans2 · 2021-10-13T21:05:05Z

Thanks. It looks like we should start with the multi-line and null support simply because they are the same in all CPU environments.

andygrove · 2021-10-14T18:32:25Z

There is an existing cuDF issue related to cuDF's stricter escaping requirements compared to Python and Java

[BUG] String match returns incorrect result in some cases rapidsai/cudf#9434

andygrove · 2021-10-14T19:29:57Z

null support was already filed:

[FEA] Allow regular expressions to not treat a nul character as the end of the string rapidsai/cudf#6196

andygrove · 2021-10-18T21:20:39Z

Empty group handling:

[BUG] Regex: behavior of empty groups differs from Python rapidsai/cudf#9463

andygrove added feature request New feature or request ? - Needs Triage Need team to review and classify labels Oct 12, 2021

andygrove mentioned this issue Oct 12, 2021

Add Rlike support #3796

Merged

Salonijain27 removed the ? - Needs Triage Need team to review and classify label Oct 12, 2021

sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Oct 12, 2021

This was referenced Oct 14, 2021

[FEA] Regex: Provide option for ^ and $ to only match beginning and end of input string rapidsai/cudf#9439

Closed

[BUG] Regex stops parsing input when a null character is encountered rapidsai/cudf#9440

Closed

andygrove mentioned this issue Oct 20, 2021

Add integration test for RLike with embedded null in input #3871

Merged

andygrove mentioned this issue Nov 1, 2021

Enable test_rlike_multi_line and update documentation #3988

Closed

andygrove self-assigned this Nov 2, 2021

andygrove added this to the Nov 1 - Nov 12 milestone Nov 2, 2021

andygrove mentioned this issue Nov 5, 2021

RLike: Fall back to CPU for regex that would produce incorrect results #4044

Merged

andygrove closed this as completed in #4044 Nov 9, 2021

andygrove mentioned this issue Jan 12, 2022

[FEA] Enable regular expressions by default #4509

Open

61 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Make RLike support consistent with Apache Spark #3797

[FEA] Make RLike support consistent with Apache Spark #3797

andygrove commented Oct 12, 2021 •

edited

Loading

andygrove commented Oct 13, 2021 •

edited

Loading

andygrove commented Oct 13, 2021

revans2 commented Oct 13, 2021

andygrove commented Oct 14, 2021 •

edited

Loading

andygrove commented Oct 14, 2021 •

edited

Loading

andygrove commented Oct 18, 2021 •

edited

Loading

[FEA] Make RLike support consistent with Apache Spark #3797

[FEA] Make RLike support consistent with Apache Spark #3797

Comments

andygrove commented Oct 12, 2021 • edited Loading

andygrove commented Oct 13, 2021 • edited Loading

Comparing Regex support between Python, cuDF, and Java/Spark

Methodology

Scala / Java

Spark

Python

cuDF

Categories of Issue

Spark supports stacked quantifiers but Python and cuDF do not

cuDF supports multi-line inputs

Null character in input

TBD - to be analyzed still

andygrove commented Oct 13, 2021

revans2 commented Oct 13, 2021

andygrove commented Oct 14, 2021 • edited Loading

andygrove commented Oct 14, 2021 • edited Loading

andygrove commented Oct 18, 2021 • edited Loading

andygrove commented Oct 12, 2021 •

edited

Loading

andygrove commented Oct 13, 2021 •

edited

Loading

andygrove commented Oct 14, 2021 •

edited

Loading

andygrove commented Oct 14, 2021 •

edited

Loading

andygrove commented Oct 18, 2021 •

edited

Loading