Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle regexp_replace inconsistency with empty strings and zero-repetition patterns [databricks] #5740

Merged
merged 18 commits into from
Jun 8, 2022

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented Jun 3, 2022

Fixes #5456

When using regexp_replace on an empty string using a regex pattern containing only zero-repetitions (eg. *, ?, {0,}, etc), the behaviour is different depending on which version of Spark.

Given the input

df = spark.sparkContext.parallelize([[""],[""],["AAA"]]).toDF(["a"])
df.selectExpr("regexp_replace(a,'A*','_REPLACED_')").show()

Spark versions 3.1.3, 3.1.4, 3.2.2, 3.3.0+ gives

+-------------------------------------+
|regexp_replace(a, A*, _REPLACED_, 1) |
+-------------------------------------+
|                          _REPLACED_ |
|                          _REPLACED_ |
|                _REPLACED__REPLACED_ |
+-------------------------------------+

but all other versions gives

+-------------------------------------+
|regexp_replace(a, A*, _REPLACED_, 1) |
+-------------------------------------+
|                                     |
|                                     |
|                _REPLACED__REPLACED_ |
+-------------------------------------+

This PR adds shims for the 2nd case to short circuit the plugin regex handling.

Signed-off-by: Anthony Chang antchang@nvidia.com

andygrove and others added 4 commits June 3, 2022 11:35
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang self-assigned this Jun 3, 2022
@sameerz sameerz added the audit_3.3.0 Audit related tasks for 3.3.0 label Jun 4, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@andygrove
Copy link
Contributor

@anthony-chang I know this is still WIP but could you update the PR description to explain what is changed in this PR

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

1 similar comment
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang changed the title [WIP] Handle regexp_replace inconsistency with empty strings and zero-repetition patterns [WIP] Handle regexp_replace inconsistency with empty strings and zero-repetition patterns [databricks] Jun 7, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang marked this pull request as ready for review June 8, 2022 16:30
@anthony-chang anthony-chang changed the title [WIP] Handle regexp_replace inconsistency with empty strings and zero-repetition patterns [databricks] Handle regexp_replace inconsistency with empty strings and zero-repetition patterns [databricks] Jun 8, 2022
Co-authored-by: Andy Grove <andygrove73@gmail.com>
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.3.0 Audit related tasks for 3.3.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Handle regexp_replace inconsistency from https://issues.apache.org/jira/browse/SPARK-39107
4 participants