Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for word boundaries \b and \B #5479

Merged

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented May 12, 2022

Closes #4517, closes #4289

cuDF now supports word boundaries so we no longer need to fallback to CPU for them. However, there are consistencies from cuDF (see #5478) so word boundaries are still disabled for GpuStringSplit.

Signed-off-by: Anthony Chang antchang@nvidia.com

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A different situation here; can you add a unit test for the fallback to CPU when in split mode?

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@sameerz sameerz added the feature request New feature or request label May 13, 2022
@sameerz sameerz added this to the May 2 - May 20 milestone May 13, 2022
NVnavkumar
NVnavkumar previously approved these changes May 13, 2022
@NVnavkumar NVnavkumar self-requested a review May 13, 2022 19:09
@NVnavkumar NVnavkumar dismissed their stale review May 13, 2022 19:10

fat-fingered this approval by mistake

Signed-off-by: Anthony Chang <antchang@nvidia.com>
revans2
revans2 previously approved these changes May 17, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

…pport-word-boundaries

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang marked this pull request as draft May 26, 2022 16:33
@anthony-chang
Copy link
Contributor Author

The tests fail with this case

javaPattern=(?:w)\Z\B, cudfPattern=(?:w)([\n\r\u0085\u2028\u2029]|\r\n)?$\B, input='z11\u000b^+ww\r', cpu=z11\u000b^+ww\r, gpu=z11\u000b^+w_RE\PLACE_\r

I will need to look into whether we can support word boundaries surrounding string anchors. If not, we will need to fallback to CPU similar to in #5610

…pport-word-boundaries

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang marked this pull request as ready for review May 30, 2022 20:46
Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I see that we already cover \b and \B in the scala fuzz tests.

Copy link
Contributor

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be for 22.08 now

@anthony-chang anthony-chang changed the base branch from branch-22.06 to branch-22.08 May 31, 2022 17:59
@andygrove
Copy link
Contributor

build

1 similar comment
@sameerz
Copy link
Collaborator

sameerz commented Jun 6, 2022

build

@anthony-chang anthony-chang merged commit 9334d01 into NVIDIA:branch-22.08 Jun 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add support for word boundaries \b and \B in regular expressions
5 participants