Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for null characters in regular expressions #5834

Merged

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented Jun 15, 2022

Closes #5846

Adds support for the null character \u0000 in regular expressions.

cuDF requires null characters to be represented as \0 rather than the literal \u0000. But we can't simply transpile \u0000 to \0 since patterns such as \u00002 (ie. null character followed by a 2) would transpile to \02 (ie. the octal digit with codepoint=2) so we instead transpile \u0000 to (?:\0)

I've also left null characters disabled in character classes due to a cuDF bug rapidsai/cudf#11109

Signed-off-by: Anthony Chang antchang@nvidia.com

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang self-assigned this Jun 15, 2022
if (endsWithLineAnchor(ll) || endsWithLineAnchor(rr)) {
throw new RegexUnsupportedException(
"cuDF does not support terms ending with line anchors on one side of a choice")
}

// cuDF does not support terms ending with word boundaries on one side
// of a choice, such as "\\b|a"
if (endsWithWordBoundary(ll) || endsWithWordBoundary(rr)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in this PR because the updated fuzz tests caught this case

@sameerz sameerz added the task Work required that improves the product but is not user facing label Jun 15, 2022
@sameerz sameerz added this to the Jun 6 - Jun 17 milestone Jun 15, 2022
andygrove
andygrove previously approved these changes Jun 15, 2022
@anthony-chang
Copy link
Contributor Author

build

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang changed the title [WIP] Add support for null characters in regular expressions Add support for null characters in regular expressions Jun 15, 2022
@anthony-chang anthony-chang marked this pull request as ready for review June 15, 2022 20:51
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should update the predefined character classes that are supposed to include the null character as well (such as \p{ASCII}, etc.

@anthony-chang
Copy link
Contributor Author

Should update the predefined character classes that are supposed to include the null character as well (such as \p{ASCII}, etc.

Null characters still don't work in character classes. We can do this after rapidsai/cudf#11112 is merged

@NVnavkumar
Copy link
Collaborator

NVnavkumar commented Jun 24, 2022

Should update the predefined character classes that are supposed to include the null character as well (such as \p{ASCII}, etc.

Null characters still don't work in character classes. We can do this after rapidsai/cudf#11112 is merged

Can you file a tracking issue for the plugin on this bug and note it in the code around those character classes?

Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang anthony-chang merged commit 8b88023 into NVIDIA:branch-22.08 Jun 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Support null characters in regular expressions
4 participants