Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] regexp_extract doesn't work correctly with concat #5088

Closed
sperlingxx opened this issue Mar 30, 2022 · 2 comments
Closed

[BUG] regexp_extract doesn't work correctly with concat #5088

sperlingxx opened this issue Mar 30, 2022 · 2 comments
Labels
bug Something isn't working duplicate This issue or pull request already exists

Comments

@sperlingxx
Copy link
Collaborator

sperlingxx commented Mar 30, 2022

Describe the bug
regexp_extract(concat(....), ) produces incorrect results in GPU runs, when children of concat contains at least one column vector.

Steps/Code to reproduce bug

val df = (1 to 10).toDF("a")
spark.conf.set("spark.rapids.sql.regexp.enabled", "true")
df.coalesce(1).select(regexp_extract(concat(col("a"), lit("a")), "(a)", 1)).collect()

GPU results: Array([], [], [], [], [], [], [], [], [], [])
CPU results: Array([a], [a], [a], [a], [a], [a], [a], [a], [a], [a])

For above query, GPU works correctly only when column(a) outputs an empty string.

@sperlingxx sperlingxx added bug Something isn't working ? - Needs Triage Need team to review and classify labels Mar 30, 2022
@revans2
Copy link
Collaborator

revans2 commented Mar 30, 2022

This has nothing to do with concat. It appears that regexp_extract on the CPU is doing a find, where as on the GPU it is doing a full match.

val df =Seq("1a", "2a", "3a", "4a", "5a", "6a", "7a", "8a", "9a", "10a").toDF("c")
df.coalesce(1).select(regexp_extract(col("c"), "(a)", 1)).collect()

shows the same results, but changing the regular expression to be ".*(a).*" produces the same result for both the CPU and the GPU.

@sperlingxx
Copy link
Collaborator Author

Close this issue since it is included in #5135

@sameerz sameerz added duplicate This issue or pull request already exists and removed ? - Needs Triage Need team to review and classify labels Apr 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working duplicate This issue or pull request already exists
Projects
None yet
Development

No branches or pull requests

3 participants