GpuStringSplit now honors the`spark.rapids.sql.regexp.enabled` configuration option #5297

NVnavkumar · 2022-04-21T19:24:11Z

This allows GpuStringSplit to honor the spark.rapids.sql.regexp.enabled configuration flag. Desired behavior is to fallback to CPU when using a valid regular expression, but still use the GPU when using a simple string or when the pattern can be converted into a simple string. This is done by using the existing checkRegExp method that already handles the logic when the transpilation to a simple string will still work, and also when the parameter is not an actual regular expression.

Here's some PySpark code to test this branch:

from pyspark.sql.functions import expr
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder \
      .master("local[*]") \
      .appName("regex_bracket") \
      .config("spark.rapids.sql.explain","ALL") \
      .config("spark.rapids.sql.regexp.enabled", "false") \
      .config("spark.rapids.sql.castStringToTimestamp.enabled","true") \
      .config("spark.rapids.sql.exec.CollectLimitExec", "true") \
      .config("spark.rapids.sql.castFloatToString.enabled", "true") \
      .config("spark.rapids.sql.hasExtendedYearValues", "false") \
      .getOrCreate()  


data2 = [("abc|123abc|123",""),
    ("xyz|123xyz|123","Rose")
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)

regex = r"[a-z]\\|[0-9]"

# This code will fallback to the CPU
df2 = df.withColumn('newcol1', expr(f"""   split(firstname, "{regex}", 2)    """))
df2.explain()
df2.show(truncate=False)

transpilable = r"\\|"

# This code will continue to run on the GPU (Note that | is an escaped regular expression meta character but here is used as a simple delimeter)
df3 = df.withColumn('newcol2', expr(f"""   split(firstname, "{transpilable}", 3)    """))
df3.explain()
df3.show(truncate=False)


simple = "1"

# This code will continue to run on the GPU (this is a simple string)
df4 = df.withColumn('newcol2', expr(f"""   split(firstname, "{simple}", 3)    """))
df4.explain()
df4.show(truncate=False)

Signed-off-by: Navin Kumar navink@nvidia.com

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-04-21T23:02:24Z

build

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala

andygrove · 2022-04-22T13:44:53Z

Could we add an integration test to confirm we are falling back to CPU when regex is disabled?

…exp on the GPU Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-04-22T17:10:23Z

build

…gexp_enabled_string_split

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar · 2022-04-25T17:49:58Z

build

GpuStringSplit now handles the regex configuration flag

d6a8548

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar requested a review from andygrove April 21, 2022 19:24

NVnavkumar self-assigned this Apr 21, 2022

sameerz added the bug Something isn't working label Apr 21, 2022

sameerz added this to the Apr 18 - Apr 29 milestone Apr 21, 2022

andygrove reviewed Apr 22, 2022

View reviewed changes

sql-plugin/src/main/scala/org/apache/spark/sql/rapids/stringFunctions.scala Outdated Show resolved Hide resolved

Fix style issue and add integration tests for split and disabling reg…

fe749aa

…exp on the GPU Signed-off-by: Navin Kumar <navink@nvidia.com>

andygrove previously approved these changes Apr 22, 2022

View reviewed changes

NVnavkumar added 2 commits April 25, 2022 10:27

Merge branch 'branch-22.06' of github.com:NVIDIA/spark-rapids into re…

5c6c123

…gexp_enabled_string_split

Need to enable regexp on GPU for split tests to work

0406f03

Signed-off-by: Navin Kumar <navink@nvidia.com>

NVnavkumar dismissed andygrove’s stale review via 0406f03 April 25, 2022 17:35

NVnavkumar requested a review from andygrove April 25, 2022 17:36

andygrove approved these changes Apr 25, 2022

View reviewed changes

NVnavkumar merged commit 5e1f0e6 into NVIDIA:branch-22.06 Apr 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GpuStringSplit now honors the`spark.rapids.sql.regexp.enabled` configuration option #5297

GpuStringSplit now honors the`spark.rapids.sql.regexp.enabled` configuration option #5297

NVnavkumar commented Apr 21, 2022

NVnavkumar commented Apr 21, 2022

andygrove commented Apr 22, 2022

NVnavkumar commented Apr 22, 2022

NVnavkumar commented Apr 25, 2022

GpuStringSplit now honors thespark.rapids.sql.regexp.enabled configuration option #5297

GpuStringSplit now honors thespark.rapids.sql.regexp.enabled configuration option #5297

Conversation

NVnavkumar commented Apr 21, 2022

NVnavkumar commented Apr 21, 2022

andygrove commented Apr 22, 2022

NVnavkumar commented Apr 22, 2022

NVnavkumar commented Apr 25, 2022

GpuStringSplit now honors the`spark.rapids.sql.regexp.enabled` configuration option #5297

GpuStringSplit now honors the`spark.rapids.sql.regexp.enabled` configuration option #5297