Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-register UDF extention when main plugin is set #1102

Merged
merged 3 commits into from
Nov 16, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -290,10 +290,10 @@ Casting from string to timestamp currently has the following limitations.
Only timezone 'Z' (UTC) is supported. Casting unsupported formats will result in null values.

## UDF to Catalyst Expressions
To speedup the process of UDF, spark-rapids introduces a udf-compiler extension to translate UDFs to Catalyst expressions.
To speedup the process of UDF, spark-rapids introduces a udf-compiler extension to translate UDFs to Catalyst expressions.

To enable this operation on the GPU, set
[`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) to `true`, and `spark.sql.extensions=com.nvidia.spark.udf.Plugin`.
[`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) to `true`.

However, Spark may produce different results for a compiled udf and the non-compiled. For example: a udf of `x/y` where `y` happens to be `0`, the compiled catalyst expressions will return `NULL` while the original udf would fail the entire job with a `java.lang.ArithmeticException: / by zero`

Expand Down
15 changes: 9 additions & 6 deletions sql-plugin/src/main/scala/com/nvidia/spark/rapids/Plugin.scala
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ class SQLExecPlugin extends (SparkSessionExtensions => Unit) with Logging {

object RapidsPluginUtils extends Logging {
private val SQL_PLUGIN_NAME = classOf[SQLExecPlugin].getName
private val UDF_PLUGIN_NAME = "com.nvidia.spark.udf.Plugin"
private val SQL_PLUGIN_CONF_KEY = StaticSQLConf.SPARK_SESSION_EXTENSIONS.key
private val SERIALIZER_CONF_KEY = "spark.serializer"
private val JAVA_SERIALIZER_NAME = classOf[JavaSerializer].getName
Expand All @@ -70,14 +71,16 @@ object RapidsPluginUtils extends Logging {
def fixupConfigs(conf: SparkConf): Unit = {
// First add in the SQL executor plugin because that is what we need at a minimum
if (conf.contains(SQL_PLUGIN_CONF_KEY)) {
val previousValue = conf.get(SQL_PLUGIN_CONF_KEY).split(",").map(_.trim)
if (!previousValue.contains(SQL_PLUGIN_NAME)) {
conf.set(SQL_PLUGIN_CONF_KEY, (previousValue :+ SQL_PLUGIN_NAME).mkString(","))
} else {
conf.set(SQL_PLUGIN_CONF_KEY, previousValue.mkString(","))
for (pluginName <- Array(SQL_PLUGIN_NAME, UDF_PLUGIN_NAME)){
val previousValue = conf.get(SQL_PLUGIN_CONF_KEY).split(",").map(_.trim)
if (!previousValue.contains(pluginName)) {
conf.set(SQL_PLUGIN_CONF_KEY, (previousValue :+ pluginName).mkString(","))
} else {
conf.set(SQL_PLUGIN_CONF_KEY, previousValue.mkString(","))
}
}
} else {
conf.set(SQL_PLUGIN_CONF_KEY, SQL_PLUGIN_NAME)
conf.set(SQL_PLUGIN_CONF_KEY, Array(SQL_PLUGIN_NAME,UDF_PLUGIN_NAME).mkString(","))
}

val serializer = conf.get(SERIALIZER_CONF_KEY, JAVA_SERIALIZER_NAME)
Expand Down
2 changes: 1 addition & 1 deletion udf-compiler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,6 @@ How to run
----------

The UDF compiler is included in the rapids-4-spark jar that is produced by the `dist` maven project. Set up your cluster to run the RAPIDS Accelerator for Apache Spark
and set the spark config `spark.sql.extensions` to include `com.nvidia.spark.udf.Plugin`.
and this udf plugin will be automatically injected to spark extensions when `com.nvidia.spark.SQLPlugin` is set.

The plugin is still disabled by default and you will need to set `spark.rapids.sql.udfCompiler.enabled` to `true` to enable it.