-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for null characters in regular expressions #5834
Add support for null characters in regular expressions #5834
Conversation
Signed-off-by: Anthony Chang <antchang@nvidia.com>
if (endsWithLineAnchor(ll) || endsWithLineAnchor(rr)) { | ||
throw new RegexUnsupportedException( | ||
"cuDF does not support terms ending with line anchors on one side of a choice") | ||
} | ||
|
||
// cuDF does not support terms ending with word boundaries on one side | ||
// of a choice, such as "\\b|a" | ||
if (endsWithWordBoundary(ll) || endsWithWordBoundary(rr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in this PR because the updated fuzz tests caught this case
build |
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should update the predefined character classes that are supposed to include the null character as well (such as \p{ASCII}
, etc.
Null characters still don't work in character classes. We can do this after rapidsai/cudf#11112 is merged |
Can you file a tracking issue for the plugin on this bug and note it in the code around those character classes? |
Signed-off-by: Anthony Chang <antchang@nvidia.com>
build |
Closes #5846
Adds support for the null character
\u0000
in regular expressions.cuDF requires null characters to be represented as
\0
rather than the literal\u0000
. But we can't simply transpile\u0000
to\0
since patterns such as\u00002
(ie. null character followed by a 2) would transpile to\02
(ie. the octal digit with codepoint=2) so we instead transpile\u0000
to(?:\0)
I've also left null characters disabled in character classes due to a cuDF bug rapidsai/cudf#11109
Signed-off-by: Anthony Chang antchang@nvidia.com