-
Notifications
You must be signed in to change notification settings - Fork 891
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add support for single-line regex anchors ^/$ in contains_re (#9482)
Closes #9439 The `^` (begin anchor) and `$` (end anchor) apply to beginning of line (BOL) and end of line (EOL) respectively. This means that they cannot be used to match on strings containing embedded new-line ('\n') characters when desiring the anchors only match just the beginning and end of the string as a whole. Many regex engines support a flag for overriding the behavior of the BOL/EOL anchors: [Python](https://docs.python.org/3/library/re.html#re.MULTILINE), [Java](https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#MULTILINE), [C++](https://en.cppreference.com/w/cpp/regex/basic_regex/constants). This PR introduces a similar flag parameter to the `cudf::strings::contains_re`, `cudf::strings::matches_re` and `cudf::strings::count_re` APIs to tell the regex engine how to interpret the anchor characters in the given regex pattern. Additional information about these anchors can also be found here: https://www.regular-expressions.info/anchors.html The current default behavior of the libcudf regex is to interpret BOL/EOL as similar to the `MULTILINE` flag. This behavior doesn't match the engines/languages listed above. So for consistency the default is reversed requiring this PR to be a breaking change. Also, the new `flags` parameter added to the above APIs requires this to be a breaking change. An additional flag (DOTALL) is included in this PR since the internal regex code supports it but only needed a path for the caller to specify the behavior. The `DOTALL` flag is also a feature of the above languages. When specified, the dot '.' pattern includes embedded new-line characters in its matching character set. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Bradley Dice (https://github.com/bdice) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9482
- Loading branch information
1 parent
237b0ce
commit d073ecb
Showing
13 changed files
with
321 additions
and
60 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
/* | ||
* Copyright (c) 2021, NVIDIA CORPORATION. | ||
* | ||
* Licensed under the Apache License, Version 2.0 (the "License"); | ||
* you may not use this file except in compliance with the License. | ||
* You may obtain a copy of the License at | ||
* | ||
* http://www.apache.org/licenses/LICENSE-2.0 | ||
* | ||
* Unless required by applicable law or agreed to in writing, software | ||
* distributed under the License is distributed on an "AS IS" BASIS, | ||
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
#pragma once | ||
|
||
#include <cstdint> | ||
|
||
namespace cudf { | ||
namespace strings { | ||
|
||
/** | ||
* @addtogroup strings_contains | ||
* @{ | ||
*/ | ||
|
||
/** | ||
* @brief Regex flags. | ||
* | ||
* These types can be or'd to combine them. | ||
* The values are chosen to leave room for future flags | ||
* and to match the Python flag values. | ||
*/ | ||
enum regex_flags : uint32_t { | ||
DEFAULT = 0, /// default | ||
MULTILINE = 8, /// the '^' and '$' honor new-line characters | ||
DOTALL = 16 /// the '.' matching includes new-line characters | ||
}; | ||
|
||
/** | ||
* @brief Returns true if the given flags contain MULTILINE. | ||
* | ||
* @param f Regex flags to check | ||
* @return true if `f` includes MULTILINE | ||
*/ | ||
constexpr bool is_multiline(regex_flags const f) | ||
{ | ||
return (f & regex_flags::MULTILINE) == regex_flags::MULTILINE; | ||
} | ||
|
||
/** | ||
* @brief Returns true if the given flags contain DOTALL. | ||
* | ||
* @param f Regex flags to check | ||
* @return true if `f` includes DOTALL | ||
*/ | ||
constexpr bool is_dotall(regex_flags const f) | ||
{ | ||
return (f & regex_flags::DOTALL) == regex_flags::DOTALL; | ||
} | ||
|
||
/** @} */ // end of doxygen group | ||
} // namespace strings | ||
} // namespace cudf |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.