Skip to content

Commit

Permalink
Add support for regexp_extract on the GPU (#4285)
Browse files Browse the repository at this point in the history
* Implement regexp_extract

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* Add support for idx = 0 and add idx bounds checks

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* scalastyle and update docs

Signed-off-by: Andy Grove <andygrove@nvidia.com>

* fix resource leak

* rework error handling for idx out of range

* move dataType definition up to GpuRegExpTernaryBase

* address feedback

* update no_match test to test other values for idx
  • Loading branch information
andygrove authored Dec 13, 2021
1 parent dd6b2b1 commit 49c36ea
Show file tree
Hide file tree
Showing 7 changed files with 640 additions and 287 deletions.
4 changes: 2 additions & 2 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -446,6 +446,7 @@ The following Apache Spark regular expression functions and expressions are supp

- `RLIKE`
- `regexp`
- `regexp_extract`
- `regexp_like`
- `regexp_replace`

Expand All @@ -457,6 +458,7 @@ These operations can be enabled on the GPU with the following configuration sett

- `spark.rapids.sql.expression.RLike=true` (for `RLIKE`, `regexp`, and `regexp_like`)
- `spark.rapids.sql.expression.RegExpReplace=true` for `regexp_replace`
- `spark.rapids.sql.expression.RegExpExtract=true` for `regexp_extract`

Even when these expressions are enabled, there are instances where regular expression operations will fall back to
CPU when the RAPIDS Accelerator determines that a pattern is either unsupported or would produce incorrect results on the GPU.
Expand All @@ -475,8 +477,6 @@ Here are some examples of regular expression patterns that are not supported on

In addition to these cases that can be detected, there are also known issues that can cause incorrect results:

- `$` does not match the end of a string if the string ends with a line-terminator
([cuDF issue #9620](https://github.com/rapidsai/cudf/issues/9620))
- Character classes for negative matches have different behavior between CPU and GPU for multiline
strings. The pattern `[^a]` will match line-terminators on CPU but not on GPU.

Expand Down
1 change: 1 addition & 0 deletions docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,7 @@ Name | SQL Function(s) | Description | Default Value | Notes
<a name="sql.expression.RLike"></a>spark.rapids.sql.expression.RLike|`rlike`|RLike|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.Rand"></a>spark.rapids.sql.expression.Rand|`random`, `rand`|Generate a random column with i.i.d. uniformly distributed values in [0, 1)|true|None|
<a name="sql.expression.Rank"></a>spark.rapids.sql.expression.Rank|`rank`|Window function that returns the rank value within the aggregation window|true|None|
<a name="sql.expression.RegExpExtract"></a>spark.rapids.sql.expression.RegExpExtract|`regexp_extract`|RegExpExtract|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.RegExpReplace"></a>spark.rapids.sql.expression.RegExpReplace|`regexp_replace`|RegExpReplace support for string literal input patterns|false|This is disabled by default because the implementation is not 100% compatible. See the compatibility guide for more information.|
<a name="sql.expression.Remainder"></a>spark.rapids.sql.expression.Remainder|`%`, `mod`|Remainder or modulo|true|None|
<a name="sql.expression.Rint"></a>spark.rapids.sql.expression.Rint|`rint`|Rounds up a double value to the nearest double equal to an integer|true|None|
Expand Down
Loading

0 comments on commit 49c36ea

Please sign in to comment.