Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GpuStringSplit now honors thespark.rapids.sql.regexp.enabled configuration option #5297

Merged

Conversation

NVnavkumar
Copy link
Collaborator

Fixes #5130.

This allows GpuStringSplit to honor the spark.rapids.sql.regexp.enabled configuration flag. Desired behavior is to fallback to CPU when using a valid regular expression, but still use the GPU when using a simple string or when the pattern can be converted into a simple string. This is done by using the existing checkRegExp method that already handles the logic when the transpilation to a simple string will still work, and also when the parameter is not an actual regular expression.

Here's some PySpark code to test this branch:

from pyspark.sql.functions import expr
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder \
      .master("local[*]") \
      .appName("regex_bracket") \
      .config("spark.rapids.sql.explain","ALL") \
      .config("spark.rapids.sql.regexp.enabled", "false") \
      .config("spark.rapids.sql.castStringToTimestamp.enabled","true") \
      .config("spark.rapids.sql.exec.CollectLimitExec", "true") \
      .config("spark.rapids.sql.castFloatToString.enabled", "true") \
      .config("spark.rapids.sql.hasExtendedYearValues", "false") \
      .getOrCreate()  


data2 = [("abc|123abc|123",""),
    ("xyz|123xyz|123","Rose")
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

regex = r"[a-z]\\|[0-9]"

# This code will fallback to the CPU
df2 = df.withColumn('newcol1', expr(f"""   split(firstname, "{regex}", 2)    """))
df2.explain()
df2.show(truncate=False)

transpilable = r"\\|"

# This code will continue to run on the GPU (Note that | is an escaped regular expression meta character but here is used as a simple delimeter)
df3 = df.withColumn('newcol2', expr(f"""   split(firstname, "{transpilable}", 3)    """))
df3.explain()
df3.show(truncate=False)


simple = "1"

# This code will continue to run on the GPU (this is a simple string)
df4 = df.withColumn('newcol2', expr(f"""   split(firstname, "{simple}", 3)    """))
df4.explain()
df4.show(truncate=False)

Signed-off-by: Navin Kumar navink@nvidia.com

Signed-off-by: Navin Kumar <navink@nvidia.com>
@NVnavkumar NVnavkumar self-assigned this Apr 21, 2022
@NVnavkumar
Copy link
Collaborator Author

build

@sameerz sameerz added the bug Something isn't working label Apr 21, 2022
@sameerz sameerz added this to the Apr 18 - Apr 29 milestone Apr 21, 2022
@andygrove
Copy link
Contributor

Could we add an integration test to confirm we are falling back to CPU when regex is disabled?

…exp on the GPU

Signed-off-by: Navin Kumar <navink@nvidia.com>
andygrove
andygrove previously approved these changes Apr 22, 2022
@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar
Copy link
Collaborator Author

build

@NVnavkumar NVnavkumar merged commit 5e1f0e6 into NVIDIA:branch-22.06 Apr 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] string_split does not respect spark.rapids.sql.regexp.enabled config
3 participants