Skip to content

Commit

Permalink
Implement regexp_extract
Browse files Browse the repository at this point in the history
Signed-off-by: Andy Grove <andygrove@nvidia.com>
  • Loading branch information
andygrove committed Dec 6, 2021
1 parent d3c5847 commit d8c9c0e
Show file tree
Hide file tree
Showing 7 changed files with 583 additions and 287 deletions.
4 changes: 2 additions & 2 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -481,6 +481,7 @@ The following Apache Spark regular expression functions and expressions are supp
- `regexp`
- `regexp_like`
- `regexp_replace`
- `regexp_extract`

These operations are disabled by default because of known incompatibilities between the Java regular expression
engine that Spark uses and the cuDF regular expression engine on the GPU, and also because the regular expression
Expand All @@ -490,6 +491,7 @@ These operations can be enabled on the GPU with the following configuration sett

- `spark.rapids.sql.expression.RLike=true` (for `RLIKE`, `regexp`, and `regexp_like`)
- `spark.rapids.sql.expression.RegExpReplace=true` for `regexp_replace`
- `spark.rapids.sql.expression.RegExpExtract=true` for `regexp_extract`

Even when these expressions are enabled, there are instances where regular expression operations will fall back to
CPU when the RAPIDS Accelerator determines that a pattern is either unsupported or would produce incorrect results on the GPU.
Expand All @@ -508,8 +510,6 @@ Here are some examples of regular expression patterns that are not supported on

In addition to these cases that can be detected, there are also known issues that can cause incorrect results:

- `$` does not match the end of a string if the string ends with a line-terminator
([cuDF issue #9620](https://github.com/rapidsai/cudf/issues/9620))
- Character classes for negative matches have different behavior between CPU and GPU for multiline
strings. The pattern `[^a]` will match line-terminators on CPU but not on GPU.

Expand Down
1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.RLike"></a>spark.rapids.sql.expression.RLike|`rlike`|RLike|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.Rand"></a>spark.rapids.sql.expression.Rand|`random`, `rand`|Generate a random column with i.i.d. uniformly distributed values in [0, 1)|true|None|
<a name="sql.expression.Rank"></a>spark.rapids.sql.expression.Rank|`rank`|Window function that returns the rank value within the aggregation window|true|None|
<a name="sql.expression.RegExpExtract"></a>spark.rapids.sql.expression.RegExpExtract|`regexp_extract`|RegExpExtract|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.RegExpReplace"></a>spark.rapids.sql.expression.RegExpReplace|`regexp_replace`|RegExpReplace support for string literal input patterns|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.Remainder"></a>spark.rapids.sql.expression.Remainder|`%`, `mod`|Remainder or modulo|true|None|
<a name="sql.expression.Rint"></a>spark.rapids.sql.expression.Rint|`rint`|Rounds up a double value to the nearest double equal to an integer|true|None|
Expand Down
Loading

0 comments on commit d8c9c0e

Please sign in to comment.