Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regular expression support to string_split #4714

Merged
merged 16 commits into from
Feb 14, 2022

Conversation

andygrove
Copy link
Contributor

@andygrove andygrove commented Feb 7, 2022

Signed-off-by: Andy Grove andygrove@nvidia.com

Closes #4003

Depends on rapidsai/cudf#10139

Follow-on issues:

Status:

  • Draft implementation
  • Move some logic from GpuStringSplit to GpuStringSplitMeta
  • Update compatibility guide
  • File follow-on issues for limit = 0 or 1, and supporting line and string anchors

Signed-off-by: Andy Grove <andygrove@nvidia.com>
@andygrove andygrove added this to the Jan 31 - Feb 11 milestone Feb 7, 2022
@andygrove andygrove self-assigned this Feb 7, 2022
Comment on lines +1337 to +1338
case class GpuStringSplit(str: Expression, regex: Expression, limit: Expression,
isRegExp: Boolean, pattern: String)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of the regex expression completely? It is now useless since we use pattern instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this can be achieved if you override columnarEval instead of doColumnar similar to https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1432

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be switching this from a GpuTernaryExpression to a GpuBinaryExpression. I personally don't see it as a big deal either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to stick with GpuTernaryExpression to match Spark

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't have to, right? I don't see any benefit from keeping it a TernaryExpr instead of just UnaryExpr/GpuExpr. I tried to implement GpuStringToMap to inherit GpuExpression and the evaluation function is super short: https://github.com/NVIDIA/spark-rapids/pull/4636/files#diff-a12810882b81a4eb395c03a80951f96ec080db793ffed6755739eeb2122840ccR1507-R1518

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically we have evaluated the literal delimiter pattern before calling to the Gpu override, thus we only pass in ONE input string expression.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't I still need to pass in all of the expressions though so that I can implement children() correctly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the original delimiter expression still needs to be passed in to initialize children, but it is not used anywhere in the evaluation later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delimiter expression already isn't used in the evaluation. It is only referenced in override def second: Expression = regex which is just used to construct children in final def children: Seq[Expression] = IndexedSeq(first, second, third).

I'm not against making the change and am curious to see what the benefits are but I would rather do this as a follow-on issue and review how similar regexp expressions are implemented since they all follow this same pattern.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing big, this is looking good.

integration_tests/src/main/python/string_test.py Outdated Show resolved Hide resolved
Comment on lines +1337 to +1338
case class GpuStringSplit(str: Expression, regex: Expression, limit: Expression,
isRegExp: Boolean, pattern: String)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be switching this from a GpuTernaryExpression to a GpuBinaryExpression. I personally don't see it as a big deal either way.

@andygrove andygrove changed the title WIP: Draft implementation of string_split with regexp support WIP: Draft implementation of string_split with regexp support [databricks] Feb 9, 2022
@andygrove
Copy link
Contributor Author

build

@andygrove andygrove changed the title WIP: Draft implementation of string_split with regexp support [databricks] WIP: Add regular expression support to string_split [databricks] Feb 11, 2022
Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped reviewing after a bit because it looks like there might be some other code in here by accident.

Copy link
Collaborator

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@sameerz sameerz added the feature request New feature or request label Feb 11, 2022
@andygrove andygrove changed the title WIP: Add regular expression support to string_split [databricks] WIP: Add regular expression support to string_split Feb 11, 2022
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this pull request Feb 14, 2022
This PR adds Java binding for the new strings API `strings::split_re` and `strings::split_record_re`, which allows splitting strings by regular expression delimiters.

In addition, the Java string split overloads with default split pattern (an empty string) are removed in this PR. That is because with default empty pattern the Java's split API produces different results than cudf.

Finally, some cleanup has been perform automatically thanks to IntelliJ IDE.

Depends on #10128.

This is breaking change which is fixed by NVIDIA/spark-rapids#4714. Thus, it should be merged at the same time with NVIDIA/spark-rapids#4714.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Andy Grove (https://github.com/andygrove)

URL: #10139
@andygrove andygrove changed the title WIP: Add regular expression support to string_split Add regular expression support to string_split Feb 14, 2022
@andygrove andygrove marked this pull request as ready for review February 14, 2022 16:12
@revans2
Copy link
Collaborator

revans2 commented Feb 14, 2022

build

@revans2 revans2 merged commit 3c48c96 into NVIDIA:branch-22.04 Feb 14, 2022
@andygrove andygrove deleted the string-split-regexp branch February 14, 2022 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add regular expression support to GPU implementation of StringSplit
4 participants