Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for regexp_extract_all on GPU #5968

Merged

Conversation

anthony-chang
Copy link
Contributor

@anthony-chang anthony-chang commented Jul 7, 2022

Closes #4283, closes #4353

Depends on: rapidsai/cudf#11196 rapidsai/cudf#11215 being merged first

This PR adds support for Spark's regexp_extract_all

  • For idx = 0, this maps cleanly to cuDF's findall_record
  • For idx > 0, cuDF's extract_all_record doesn't let you specify an index, so it's a bit more complicated:
  1. cudf::strings::extract_all_record returns a list column containing all the matches, but we only want the ones specified by the group idx
  2. So we extract the strings we want from the lists by extracting every nth element (n = number of capture groups in the regex pattern) starting from idx. We do this with ColumnVector.extractListElement for each index we want.
  3. ColumnVector.extractListElement returns null when the index is out of bounds of a list, so we then filter out the null values in the lists.
  4. Lastly, if the input is null, the above would give an empty list, but we instead want to output null.

Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang added the feature request New feature or request label Jul 7, 2022
@anthony-chang anthony-chang self-assigned this Jul 7, 2022
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
Signed-off-by: Anthony Chang <antchang@nvidia.com>
@anthony-chang anthony-chang marked this pull request as ready for review July 11, 2022 15:10
@anthony-chang
Copy link
Contributor Author

build

@anthony-chang
Copy link
Contributor Author

build

1 similar comment
@pxLi
Copy link
Collaborator

pxLi commented Jul 14, 2022

build

@anthony-chang anthony-chang merged commit 0875b04 into NVIDIA:branch-22.08 Jul 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement regexp_extract_all on GPU for idx = 0 [FEA] Implement regexp_extract_all on GPU for idx > 0
3 participants