Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK SQL does not work #628

Closed
gourav-sg opened this issue Aug 28, 2020 · 16 comments
Closed

SPARK SQL does not work #628

gourav-sg opened this issue Aug 28, 2020 · 16 comments
Labels
bug Something isn't working documentation Improvements or additions to documentation

Comments

@gourav-sg
Copy link

gourav-sg commented Aug 28, 2020

Describe the bug
in AWS EMR SPARK SQL does not work. The error thrown is cannot run on GPU because no GPU enabled version of operator class org.apache.spark.sql.execution.datasources.v2.SetCatalogAndNamespaceExec could be found

Steps/Code to reproduce bug
NOTE that the https://github.com/apache/spark/blob/master/examples/src/main/scripts/getGpusResources.sh was downloaded into all the nodes as a part of bootstrap actions.

from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("MortgageETL")
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")
conf.set('spark.rapids.sql.explain', 'ALL')
conf.set("spark.executor.instances", "20")
conf.set("spark.executor.cores", "2")
conf.set("spark.task.cpus", "1")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.executor.memory", "4g")
conf.set("spark.rapids.memory.pinnedPool.size", "1G")
conf.set("spark.executor.memoryOverhead", "2G")
conf.set("spark.executor.extraJavaOptions", "-Dai.rapids.cudf.prefer-pinned=true")
conf.set("spark.locality.wait", "0s")
conf.set("spark.sql.files.maxPartitionBytes", "512m")
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.task.resource.gpu.amount", "0.25")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.rapids.sql.hasNans", "false")
conf.set('spark.rapids.sql.batchSizeBytes', '512M')
conf.set('spark.rapids.sql.reader.batchSizeBytes', '768M')
conf.set('spark.rapids.sql.variableFloatAgg.enabled', 'true')
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.adaptive.enabled", False)

conf.set("spark.executor.resource.gpu.discoveryScript","/mnt/mapred/getGpusResources.sh")
spark = SparkSession.builder.enableHiveSupport()
.config(conf=conf)
.master("yarn").getOrCreate()

Expected behavior
The code should work

Environment details (please complete the following information)

  • Environment location: YARN, Cloud(AWS EMR 6.1.0)
  • Spark configuration settings related to the issue

Additional context
the code does not work with and without hive support enabled

@gourav-sg gourav-sg added ? - Needs Triage Need team to review and classify bug Something isn't working labels Aug 28, 2020
@krajendrannv
Copy link
Contributor

Is AWS EMR 6.1.0 using Apache Spark 3.0? If not, Spark SQL on GPU wont work

@gourav-sg
Copy link
Author

Is AWS EMR 6.1.0 using Apache Spark 3.0? If not, Spark SQL on GPU wont work

Yes it is indeed

@revans2
Copy link
Collaborator

revans2 commented Aug 28, 2020

SetCatalogAndNamespaceExec is a metadata operation that we will never put on the GPU. We have plans to stop reporting it as a warning (#499). You can ignore it for now. Are there other warnings that you are seeing?

@gourav-sg
Copy link
Author

SetCatalogAndNamespaceExec is a metadata operation that we will never put on the GPU. We have plans to stop reporting it as a warning (#499). You can ignore it for now. Are there other warnings that you are seeing?

The cluster does not even start without hive enabled. the containers do not get allocated at all. The following is from the YARN logs:

20/08/28 18:03:06 ERROR RapidsExecutorPlugin: Exception in the executor plugin java.lang.NoSuchMethodError: ai.rapids.cudf.Cuda.setDevice(I)V at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

@revans2
Copy link
Collaborator

revans2 commented Aug 28, 2020

There is something odd happening. The API that is throwing the NoSuchMethodError has been in the cudf jar since 0.12. So unless you have a a 0.9.x release of the cudf jar in your classpath you should not get that error.

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

The first thing in your list of jars is indeed cudf-0.9.2 which explains the issue.

conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")

@gourav-sg
Copy link
Author

gourav-sg commented Aug 28, 2020

The first thing in your list of jars is indeed cudf-0.9.2 which explains the issue.

conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")

Yeah tried it using the below, still the same issue
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar")

the issue reported now is:

`20/08/28 20:25:21 WARN ResourceRequestHelper: YARN doesn't know about resource yarn.io/gpu, your resource discovery has to handle properly discovering and isolating the resource! Error: The resource manager encountered a problem that should not occur under normal circumstances. Please report this error to the Hadoop community by opening a JIRA ticket at http://issues.apache.org/jira and including the following information:

  • Resource type requested: yarn.io/gpu`

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

That implies the YARN cluster has not been configured to schedule for GPUs. Please check the YARN configuration files and verify it is configured to support GPU scheduling. https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

@gourav-sg
Copy link
Author

https://hadoop.apache.org/docs/r3.2.1/hadoop-yarn/hadoop-yarn-site/UsingGpus.html

done that now I am getting the error "Could not load cudf jni library.." this has been mentioned in #149 which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

If this message is occurring only on the driver node then it should be a benign message. The driver does not require a GPU or the cudf code to load in order to function.

which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

cudf-0.15 has not yet released. Once it has the jar will be posted there. Note that cudf-0.15 is likely not compatible with version 0.1.0 of the plugin jar, so you should stick with cudf-0.14 as long as you are using plugin version 0.1.0.

@gourav-sg
Copy link
Author

gourav-sg commented Aug 28, 2020

If this message is occurring only on the driver node then it should be a benign message. The driver does not require a GPU or the cudf code to load in order to function.

which is referring to cudf-0.15 but I cannot see it here: https://repo1.maven.org/maven2/ai/rapids/cudf/

cudf-0.15 has not yet released. Once it has the jar will be posted there. Note that cudf-0.15 is likely not compatible with version 0.1.0 of the plugin jar, so you should stick with cudf-0.14 as long as you are using plugin version 0.1.0.

the current error:

We have the following file in EMR:
0 lrwxrwxrwx 1 root root 16 Aug 28 20:11 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.9.2

20/08/28 20:46:43 ERROR NativeDepsLoader: Could not load cudf jni library... java.lang.UnsatisfiedLinkError: /mnt3/yarn/usercache/hadoop/appcache/application_1598647482677_0001/container_1598647482677_0001_01_000005/tmp/rmm3831790456313469993.so: libcudart.so.10.1: cannot open shared object file: No such file or directory at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1934) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1817) at java.lang.Runtime.load0(Runtime.java:809) at java.lang.System.load(System.java:1088) at ai.rapids.cudf.NativeDepsLoader.loadDep(NativeDepsLoader.java:81) at ai.rapids.cudf.NativeDepsLoader.loadNativeDeps(NativeDepsLoader.java:49) at ai.rapids.cudf.Cuda.<clinit>(Cuda.java:28) at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/08/28 20:46:43 ERROR RapidsExecutorPlugin: Exception in the executor plugin java.lang.UnsatisfiedLinkError: ai.rapids.cudf.Cuda.setDevice(I)V at ai.rapids.cudf.Cuda.setDevice(Native Method) at com.nvidia.spark.rapids.GpuDeviceManager$.setGpuDeviceAndAcquire(GpuDeviceManager.scala:90) at com.nvidia.spark.rapids.GpuDeviceManager$.$anonfun$initializeGpu$1(GpuDeviceManager.scala:117) at scala.runtime.java8.JFunction1$mcII$sp.apply(JFunction1$mcII$sp.java:23) at scala.Option.map(Option.scala:230) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpu(GpuDeviceManager.scala:117) at com.nvidia.spark.rapids.GpuDeviceManager$.initializeGpuAndMemory(GpuDeviceManager.scala:125) at com.nvidia.spark.rapids.RapidsExecutorPlugin.init(Plugin.scala:230) at org.apache.spark.internal.plugin.ExecutorPluginContainer.$anonfun$executorPlugins$1(PluginContainer.scala:111) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108) at org.apache.spark.internal.plugin.ExecutorPluginContainer.<init>(PluginContainer.scala:99) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:164) at org.apache.spark.internal.plugin.PluginContainer$.apply(PluginContainer.scala:152) at org.apache.spark.executor.Executor.$anonfun$plugins$1(Executor.scala:158) at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221) at org.apache.spark.executor.Executor.<init>(Executor.scala:158) at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:168) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

libcudart.so.10.1: cannot open shared object file: No such file or directory

This is the relevant portion of the error. The cudf jar is built for a specific CUDA runtime, 10.1 in this case. There is a version built for the CUDA 10.2 runtime at https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-2.jar. Typically the CUDA runtimes are installed under /usr/local/cuda, see which version you have. The nvidia-smi will show you which version of the driver you have installed (which is NOT necessarily the same as the CUDA runtime version!) but the driver version will at least tell you whether it's capable of supporting either the CUDA 10.1 or CUDA 10.2 runtimes, one of which is required to run the RAPIDS Accelerator plugin. See https://docs.nvidia.com/deploy/cuda-compatibility/index.html for a description of the CUDA environment and how the driver and runtime versions interact. That page also has a compatibility table that shows the minimum driver versions required to support the various CUDA runtimes. If you have a recent enough driver version for CUDA 10.1 or CUDA 10.2 you should be able to install the corresponding CUDA runtime package if it's missing from your system.

@gourav-sg
Copy link
Author

gourav-sg commented Aug 28, 2020

your team is absolutely brilliant, I am infact surprised by your cooperation and support.

We have the following file in EMR:
0 lrwxrwxrwx 1 root root 16 Aug 28 20:11 /usr/local/cuda/lib64/libcudart.so -> libcudart.so.9.2

Can this person, from NVIDIA, who is causing all this confusion having not done proper diligence, please be asked to update this article: https://aws.amazon.com/blogs/big-data/improving-rapids-xgboost-performance-and-reducing-costs-with-amazon-emr-running-amazon-ec2-g4-instances/?nc1=b_rp?

The article is clearly misleading, people who are using something like xgboost will try running SQL first to prepare their data. And this article is just frustratingly incomplete.

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

I am infact surprised by your cooperation and support.

Just trying to help, glad you are trying out the software and are willing to work through issues!

I took a quick look at the article, and it appears to be not using the RAPIDS Accelerator for Apache Spark (this project) but rather a custom solution for xgboost that was built earlier. The notebook referred to in that article uses an old GpuDatasetReader class, and it was based on cudf-0.9.2 which would work with the CUDA 9.2 runtime. I suspect it works if you follow the exact directions in the article, but it won't be using the RAPIDS Accelerator.

It appears the updated getting started guide is now at https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/csp/aws/ec2.md which shows running with the RAPIDS Accelerator plugin and xgboost, but it does so under EC2 rather than EMR, running Spark in standalone mode rather than with Spark-on-YARN.

There probably needs to be a getting started guide for EMR for those that would rather work in that environment. Would you be willing to file an issue in the https://github.com/NVIDIA/spark-xgboost-examples repo requesting an AWS EMR getting started guide?

@gourav-sg
Copy link
Author

Hi @jlowe I am currently working on it, is there a way I could contribute and write the notebook and check it in?

@jlowe
Copy link
Member

jlowe commented Aug 28, 2020

is there a way I could contribute and write the notebook and check it in?

You can definitely write up a notebook and submit a pull request against https://github.com/NVIDIA/spark-xgboost-examples, that would be great! I can't guarantee it would be accepted verbatim or at all, as it's up to the committers in that repository to review and ultimately decide to accept the contribution. However in general contributions of all types (issues reported, features requested, pull requests posted, etc.) are welcome!

I would recommend filing the issue first and post a followup comment to the issue stating you are interested in working on it and plan on posting a pull request. Then you can fork the repo in Github, put your notebook changes on a branch off of the spark-3 branch in your fork, then post the pull request after you push your changes to that branch of your fork.

I'm going to close this issue since it's a documentation issue in the spark-xgboost-examples repo rather than a bug in the RAPIDS Accelerator plugin.

@jlowe jlowe closed this as completed Aug 28, 2020
@sameerz sameerz added documentation Improvements or additions to documentation and removed ? - Needs Triage Need team to review and classify labels Aug 29, 2020
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
…IDIA#628)

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>

Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

5 participants