Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for \h, \H, \v, \V, and \R character classes #5477

Merged

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented May 12, 2022

Closes #4605

Java 8 added support for the character classes \h, \H (horizontal whitespace), \v, \V (vertical whitespace), and \R (unicode linebreak sequence). This adds support for these character classes on GPU by transpiling to equivalent character classes, which can be found here: https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html.

EDIT:
A couple updates to this PR:

  • I've moved the transpiling of the character classes from rewrite to the parse section to handle the case where cuDF and Java treats newlines for negative character matches differently (which is dealt with in the RegexCharacterClass case in rewrite). By expanding the character classes earlier in parse, we can then make any modifications to the AST such as this case in rewrite
  • I've disabled support for repetitions of \R due to an inconsistency within Java. The Java docs says \R is equivalent to \r\n|[\n\u000B\u000C\r\u0085\u2028\u2029] but if we run the pattern \R{2} against the input a\r\nb, Java finds no matches, whereas cuDF finds 1 match (which is correct).

Signed-off-by: Anthony Chang antchang@nvidia.com

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang changed the title Add support for \h, \H, \v, \V, and \R Add support for \h, \H, \v, \V, and \R character classes May 12, 2022
@anthony-chang anthony-chang self-assigned this May 12, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
…into support-new-java8-char-classes

Signed-off-by: Anthony Chang <antchang@nvidia.com>
revans2
revans2 previously approved these changes May 12, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@sameerz sameerz added the feature request New feature or request label May 13, 2022
@sameerz sameerz added this to the May 2 - May 20 milestone May 13, 2022
… and add support for reptitions

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@revans2 revans2 merged commit 61ada07 into NVIDIA:branch-22.06 May 17, 2022
anthony-chang added a commit to anthony-chang/spark-rapids that referenced this pull request May 17, 2022
…VIDIA#5477)

Signed-off-by: Anthony Chang <antchang@nvidia.com>
anthony-chang added a commit to anthony-chang/spark-rapids that referenced this pull request May 17, 2022
…VIDIA#5477)

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Add regular expression support for new character classes introduced in Java 8
4 participants