SPARK SQL does not work #628

gourav-sg · 2020-08-28T19:24:09Z

Describe the bug
in AWS EMR SPARK SQL does not work. The error thrown is cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.datasources.v2.SetCatalogAndNamespaceExec could be found

Steps/Code to reproduce bug
NOTE that the https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh was downloaded into all the nodes as a part of bootstrap actions.

from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("MortgageETL")
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")
conf.set('spark.rapids.sql.explain', 'ALL')
conf.set("spark.executor.instances", "20")
conf.set("spark.executor.cores", "2")
conf.set("spark.task.cpus", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.executor.memory", "4g")
conf.set("spark.rapids.memory.pinnedPool.size", "1G")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.executor.extraJavaOptions", "-Dai.rapids.cudf.prefer-pinned=true")
conf.set("spark.locality.wait", "0s")
conf.set("spark.sql.files.maxPartitionBytes", "512m")
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.task.resource.gpu.amount", "0.25")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set('spark.rapids.sql.batchSizeBytes', '512M')
conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')
conf.set('spark.rapids.sql.variableFloatAgg.enabled', 'true')
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.adaptive.enabled", False)

conf.set("spark.executor.resource.gpu.discoveryScript","/mnt/mapred/getGpusResources.sh")
spark = SparkSession.builder.enableHiveSupport()
.config(conf=conf)
.master("yarn").getOrCreate()

Expected behavior
The code should work

Environment details (please complete the following information)

Environment location: YARN, Cloud(AWS EMR 6.1.0)
Spark configuration settings related to the issue

Additional context
the code does not work with and without hive support enabled

krajendrannv · 2020-08-28T19:27:49Z

Is AWS EMR 6.1.0 using Apache Spark 3.0? If not, Spark SQL on GPU wont work

gourav-sg · 2020-08-28T19:30:17Z

Is AWS EMR 6.1.0 using Apache Spark 3.0? If not, Spark SQL on GPU wont work

Yes it is indeed

revans2 · 2020-08-28T19:41:15Z

SetCatalogAndNamespaceExec is a metadata operation that we will never put on the GPU. We have plans to stop reporting it as a warning (#499). You can ignore it for now. Are there other warnings that you are seeing?

gourav-sg · 2020-08-28T19:50:39Z

SetCatalogAndNamespaceExec is a metadata operation that we will never put on the GPU. We have plans to stop reporting it as a warning (#499). You can ignore it for now. Are there other warnings that you are seeing?

The cluster does not even start without hive enabled. the containers do not get allocated at all. The following is from the YARN logs:

20/08/28 18:03:06 ERROR RapidsExecutorPlugin: Exception in the executor plugin java.lang.NoSuchMethodError: ai.rapids.cudf.Cuda.setDevice(I)V at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

revans2 · 2020-08-28T20:15:29Z

There is something odd happening. The API that is throwing the NoSuchMethodError has been in the cudf jar since 0.12. So unless you have a a 0.9.x release of the cudf jar in your classpath you should not get that error.

jlowe · 2020-08-28T20:29:24Z

The first thing in your list of jars is indeed cudf-0.9.2 which explains the issue.

conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")

gourav-sg · 2020-08-28T20:32:36Z

The first thing in your list of jars is indeed cudf-0.9.2 which explains the issue.
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")

Yeah tried it using the below, still the same issue
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar")

the issue reported now is:

`20/08/28 20:25:21 WARN ResourceRequestHelper: YARN doesn't know about resource yarn.io/gpu, your resource discovery has to handle properly discovering and isolating the resource! Error: The resource manager encountered a problem that should not occur under normal circumstances. Please report this error to the Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and including the following information:

Resource type requested: yarn.io/gpu`

jlowe · 2020-08-28T20:37:58Z

That implies the YARN cluster has not been configured to schedule for GPUs. Please check the YARN configuration files and verify it is configured to support GPU scheduling. https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

gourav-sg · 2020-08-28T20:51:08Z

https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

done that now I am getting the error "Could not load cudf jni library.." this has been mentioned in #149 which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

jlowe · 2020-08-28T20:59:11Z

If this message is occurring only on the driver node then it should be a benign message. The driver does not require a GPU or the cudf code to load in order to function.

which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

cudf-0.15 has not yet released. Once it has the jar will be posted there. Note that cudf-0.15 is likely not compatible with version 0.1.0 of the plugin jar, so you should stick with cudf-0.14 as long as you are using plugin version 0.1.0.

gourav-sg · 2020-08-28T21:02:14Z

If this message is occurring only on the driver node then it should be a benign message. The driver does not require a GPU or the cudf code to load in order to function.

which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

cudf-0.15 has not yet released. Once it has the jar will be posted there. Note that cudf-0.15 is likely not compatible with version 0.1.0 of the plugin jar, so you should stick with cudf-0.14 as long as you are using plugin version 0.1.0.

the current error:

We have the following file in EMR:
0 lrwxrwxrwx 1 root root 16 Aug 28 20:11 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.9.2

20/08/28 20:46:43 ERROR NativeDepsLoader: Could not load cudf jni library... java.lang.UnsatisfiedLinkError: /mnt3/yarn/usercache/hadoop/appcache/application_1598647482677_0001/container_1598647482677_0001_01_000005/tmp/rmm3831790456313469993.so: libcudart.so.10.1: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1088) at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:81) at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:49) at ai.rapids.cudf.Cuda.<clinit>(Cuda.java:28) at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/08/28 20:46:43 ERROR RapidsExecutorPlugin: Exception in the executor plugin java.lang.UnsatisfiedLinkError: ai.rapids.cudf.Cuda.setDevice(I)V at ai.rapids.cudf.Cuda.setDevice(Native Method) at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

jlowe · 2020-08-28T21:13:14Z

libcudart.so.10.1: cannot open shared object file: No such file or directory

This is the relevant portion of the error. The cudf jar is built for a specific CUDA runtime, 10.1 in this case. There is a version built for the CUDA 10.2 runtime at https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-2.jar. Typically the CUDA runtimes are installed under /usr/local/cuda, see which version you have. The nvidia-smi will show you which version of the driver you have installed (which is NOT necessarily the same as the CUDA runtime version!) but the driver version will at least tell you whether it's capable of supporting either the CUDA 10.1 or CUDA 10.2 runtimes, one of which is required to run the RAPIDS Accelerator plugin. See https://docs.nvidia.com/deploy/cuda-compatibility/index.html for a description of the CUDA environment and how the driver and runtime versions interact. That page also has a compatibility table that shows the minimum driver versions required to support the various CUDA runtimes. If you have a recent enough driver version for CUDA 10.1 or CUDA 10.2 you should be able to install the corresponding CUDA runtime package if it's missing from your system.

gourav-sg · 2020-08-28T21:15:30Z

your team is absolutely brilliant, I am infact surprised by your cooperation and support.

We have the following file in EMR:
0 lrwxrwxrwx 1 root root 16 Aug 28 20:11 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.9.2

Can this person, from NVIDIA, who is causing all this confusion having not done proper diligence, please be asked to update this article: https://aws.amazon.com/blogs/big-data/improving-rapids-xgboost-performance-and-reducing-costs-with-amazon-emr-running-amazon-ec2-g4-instances/?nc1=b_rp?

The article is clearly misleading, people who are using something like xgboost will try running SQL first to prepare their data. And this article is just frustratingly incomplete.

jlowe · 2020-08-28T21:37:41Z

I am infact surprised by your cooperation and support.

Just trying to help, glad you are trying out the software and are willing to work through issues!

I took a quick look at the article, and it appears to be not using the RAPIDS Accelerator for Apache Spark (this project) but rather a custom solution for xgboost that was built earlier. The notebook referred to in that article uses an old GpuDatasetReader class, and it was based on cudf-0.9.2 which would work with the CUDA 9.2 runtime. I suspect it works if you follow the exact directions in the article, but it won't be using the RAPIDS Accelerator.

It appears the updated getting started guide is now at https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/csp/aws/ec2.md which shows running with the RAPIDS Accelerator plugin and xgboost, but it does so under EC2 rather than EMR, running Spark in standalone mode rather than with Spark-on-YARN.

There probably needs to be a getting started guide for EMR for those that would rather work in that environment. Would you be willing to file an issue in the https://github.com/NVIDIA/spark-xgboost-examples repo requesting an AWS EMR getting started guide?

gourav-sg · 2020-08-28T21:40:22Z

Hi @jlowe I am currently working on it, is there a way I could contribute and write the notebook and check it in?

jlowe · 2020-08-28T22:01:53Z

is there a way I could contribute and write the notebook and check it in?

You can definitely write up a notebook and submit a pull request against https://github.com/NVIDIA/spark-xgboost-examples, that would be great! I can't guarantee it would be accepted verbatim or at all, as it's up to the committers in that repository to review and ultimately decide to accept the contribution. However in general contributions of all types (issues reported, features requested, pull requests posted, etc.) are welcome!

I would recommend filing the issue first and post a followup comment to the issue stating you are interested in working on it and plan on posting a pull request. Then you can fork the repo in Github, put your notebook changes on a branch off of the spark-3 branch in your fork, then post the pull request after you push your changes to that branch of your fork.

I'm going to close this issue since it's a documentation issue in the spark-xgboost-examples repo rather than a bug in the RAPIDS Accelerator plugin.

…IDIA#628) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

gourav-sg added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 28, 2020

jlowe closed this as completed Aug 28, 2020

gourav-sg mentioned this issue Aug 28, 2020

updating latest examples for using GPU in SPARK EMR #629

Open

sameerz added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Aug 29, 2020

jlowe mentioned this issue Sep 11, 2020

SPARK SQL will fail on AWS EMR with incompatible data types issues #730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK SQL does not work #628

SPARK SQL does not work #628

gourav-sg commented Aug 28, 2020 •

edited

Loading

krajendrannv commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

revans2 commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

revans2 commented Aug 28, 2020

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 •

edited

Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 •

edited

Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 •

edited

Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

jlowe commented Aug 28, 2020

SPARK SQL does not work #628

SPARK SQL does not work #628

Comments

gourav-sg commented Aug 28, 2020 • edited Loading

krajendrannv commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

revans2 commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

revans2 commented Aug 28, 2020

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 • edited Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 • edited Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 • edited Loading

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020

jlowe commented Aug 28, 2020

gourav-sg commented Aug 28, 2020 •

edited

Loading

gourav-sg commented Aug 28, 2020 •

edited

Loading

gourav-sg commented Aug 28, 2020 •

edited

Loading

gourav-sg commented Aug 28, 2020 •

edited

Loading