-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK SQL will fail on AWS EMR with incompatible data types issues #730
Comments
@gourav-sg thanks for the bug report! I believe this is already fixed in the upcoming 0.2 release, as I was unable to reproduce it with rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar and cudf-0.15-cuda10-1.jar. If you don't mind, it would be good if you could also verify this issue does not reproduce in 0.2.0-SNAPSHOT. You can build your own version of 0.2.0-SNAPSHOT by checking out latest on branch-0.2 and building. It's a straightforward build with Maven, e.g.: If you are able to reproduce it on 0.2.0, it would be good to get the full stacktrace of the error. |
hi @jlowe
|
This shows there was an error initializing the plugin. Do you see anything earlier in the log indicating what may have triggered the error, either in the driver or executor logs? (e.g.: unsatisified link error or something similar) |
Hi, this is the first error visible, thanks a ton for coming back :)
|
The |
Hi,
Let me check on this and get back to you, I have uninstalled and then
installed the libraries, therefore that should not be happening.
Regards,
Gourav
…On Fri, Sep 11, 2020 at 5:25 PM Jason Lowe ***@***.***> wrote:
The NoSuchMethodError makes me wonder if we're getting into a situation
similar to what happened initially in #628
<#628>. Are there multiple
versions of cudf in the classpath somehow?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#730 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJZLQYKPIFXJ7VOHBRZPDTSFJFOFANCNFSM4RGAJDPA>
.
|
Hi, This is the error that I see in the pyhton console, the error in yarn logs is already mentioned above:
Regards, |
Hi, I remember about a configuration is SPARK 3.x to fall back from using GPU to normal CPU in case of errors, can you please let me know the configuration? Also is there is any work around from this issue it will be a lot of help. Regards, |
OK, so this node has a CUDA 10.2 runtime environment. That's useful to know. However I was asking about the cudf jar. Given this is a CUDA 10.2 runtime environment, are you using the cudf-0.15-cuda10-2.jar or cudf-0.15-cuda10-1.jar? The cudf jar being used needs to have a classifier (the "cuda10-1" or "cuda10-2" part of the jar name) that matches the CUDA runtime environment available on the system. You're getting a NoSuchMethodError when cudf initializes, which indicates there may be a mismatch between the cudf Java code and the cudf JNI code. A similar Looking closer at your startup configs, I see this:
That looks like you're placing two different cudf jars in the classpath, cudf-0.9.2 and cudf-0.140-cuda10-1. There should only be one cudf jar in the classpath, and it needs to match the version expected by the rapids plugin. If the cudf-0.9.2 jar is also in the classpath when trying to run with plugin 0.2.0-SNAPSHOT and cudf-0.15 then that could explain the problem there. Also cudf-0.14-cuda10-1.jar is built for CUDA runtime 10.1, but you mentioned a CUDA 10.2 runtime environment above which is a bit odd. Maybe the CUDA 10.1 environment is also available under
The Python error is caused by cudf failing to initialize. Solving the cudf
The plugin should never throw an exception like ClassCastException. The plugin does not support dynamically switching back to the CPU when exceptions occur during runtime. It only falls back to the CPU during the query planning process when a query operation is known to not have a GPU-equivalent (or that equivalent has been configured to be disallowed). As I mentioned above, I believe the error you are getting with the plugin 0.1.0 jars is fixed in the 0.2.0 release because I followed your problem reproduction steps and did not see the error.
You could try disabling Parquet reads in the plugin by setting |
Hi, @jlowe the details of the code execution is mentioned below, you can see that I am using the correct CUDA jar file. Can you please share your code?
Regards, |
This is a different error, so I assume you were able to get past the We're going to have to dig deeper into this on our end. In the meantime, it would be helpful if you could replace |
Hi, The entire code is mentioned below and the entire YARN log is attached.
Regards, |
Hi @jlowe Regards, |
Sorry for the delayed reply. We did a bit of investigation on our end, and I believe we found a workaround, at least for the one test query that is failing for you. Credit goes to @tgravescs for finding some settings that should help. We recommend setting the following two configs when running with the 0.1.0 plugin in your AWS EMR setup:
The first config setting will disable Adaptive Query Execution (AQE) which is not supported by the 0.1.0 version of the plugin. The second config setting forces Spark to load the data via DataSourceV2 interfaces which allows the test query to work. We believe something specific to the AWS EMR version of Spark is interfering with the plugin's processing of DataSourceV1 loads from S3, but we don't yet know the root cause of the issue. I hope this allows you to proceed with your testing. Note that we will not be able to officially support an AWS EMR setup until the 0.3.0 release at the earliest. It appears there will need to be plugin changes to support whatever AWS EMR's Spark is doing with V1 data sources, and the 0.2.0 release is in its final stages. |
Hi @tgravescs and @jlowe , |
Hi @jlowe and @tgravescs , I have been able to finally test with the changes but now nothing works via the GPU anymore. Do you have any sample data and queries that I can use to test whether the DataSourceV1 can use GPU's. Regards, |
Can you elaborate on what you mean by "now nothing works"? Are operations simply not running on the GPU at all, is it crashing, or something else?
A perfect example is the one you provided in the description of this ticket:
That's what I used to replicate the issue in AWS EMR. Setting |
Hi,
sorry, what I mean by "now nothing works" is that "now nothing works in the
GPU". But let me check on this once again and come back.
Thank you so much for getting back to me.
Regards,
Gourav Sengupta
…On Fri, Sep 18, 2020 at 10:46 PM Jason Lowe ***@***.***> wrote:
Can you elaborate on what you mean by "now nothing works"? Are operations
simply not running on the GPU at all, is it crashing, or something else?
Do you have any sample data and queries that I can use to test whether the
DataSourceV1 can use GPU's.
A perfect example is the one you provided in the description of this
ticket:
spark.sql("SELECT 'a' FLD1, id FROM range(100)").write.parquet("s3://gourav-bucket/gourav/testdata3/")
spark.read.parquet("s3://gourav-bucket/gourav/testdata3/").createOrReplaceTempView("test")
spark.sql("SELECT * FROM test").show()
That's what I used to replicate the issue in AWS EMR. Setting
spark.sql.sources.useV1SourceList to an empty string, forcing all data
sources to use V2 instead of V1, and re-executing that query setup allowed
it to read Parquet data from S3 using the GPU as seen in the query
explanation (e.g.: replace .show() with .explain()).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#730 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJZLQ6DOBK5IRLN2B75ARTSGPINDANCNFSM4RGAJDPA>
.
|
You can take look at a straightforward join operation https://nvidia.github.io/spark-rapids/docs/get-started/getting-started.html#example-join-operation or at the mortgage etl notebook https://nvidia.github.io/spark-rapids/docs/examples.html demo. |
also if your operators are not running on the GPU can you please let us know what the output of the explain is in the driver log. It looks like you have configuration conf.set('spark.rapids.sql.explain', 'ALL'). enabled, so in the driver log it will detail which operations and data types are allowed on the GPU and which ones block something from being on the GPU. |
HI @gourav-sg , since this was filed we've spent time integrating more closely with AWS EMR. There are instructions on using the 0.2 version of the plugin with EMR at https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-aws-emr.html . Let us know if you have any more questions on this issue, or if we can close it. |
Dearest Sameer,
EMR 6.2.0 works wonderfully and seamlessly with the instructions given in
EMR documentation.
https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html
We should definitely be able to close this issue, thank you so very much
for your kind help with this.
Thanks and Regards,
Gourav Sengupta
…On Thu, Apr 29, 2021 at 8:10 PM Sameer Raheja ***@***.***> wrote:
HI @gourav-sg <https://github.com/gourav-sg> , since this was filed we've
spent time integrating more closely with AWS EMR. There are instructions on
using the 0.2 version of the plugin with EMR at
https://nvidia.github.io/spark-rapids/docs/get-started/getting-started-aws-emr.html
. Let us know if you have any more questions on this issue, or if we can
close it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#730 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJZLQZ3ZTXN2XP7TT6REPDTLGVL5ANCNFSM4RGAJDPA>
.
|
…IDIA#730) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com> Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Describe the bug
java.lang.ClassCastException: org.apache.spark.sql.execution.vectorized.OnHeapColumnVector cannot be cast to com.nvidia.spark.rapids.GpuColumnVector
Steps/Code to reproduce bug
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = SparkConf().setAppName("MortgageETL")
conf.set("spark.jars", "s3://gourav-bucket/gourav/gpu/cudf-0.9.2.jar,s3://gourav-bucket/gourav/gpu/rapids-4-spark_2.12-0.1.0.jar,s3://gourav-bucket/gourav/gpu/cudf-0.14-cuda10-1.jar")
conf.set('spark.rapids.sql.explain', 'ALL')
conf.set("spark.executor.extraJavaOptions", "-Dai.rapids.cudf.prefer-pinned=true")
conf.set("spark.rapids.sql.concurrentGpuTasks", "1")
conf.set("spark.plugins", "com.nvidia.spark.SQLPlugin")
conf.set("spark.sql.adaptive.enabled", False)
conf.set("spark.executor.resource.gpu.discoveryScript","/usr/lib/spark/examples/src/main/scripts/getGpusResources.sh")
conf.set("spark.executor.resource.gpu.amount", "1")
conf.set("spark.task.resource.gpu.amount", "0.25")
conf.set("spark.executor.cores", "2")
conf.set("spark.task.cpus", "1")
spark = SparkSession.builder.enableHiveSupport().config(conf=conf).master("yarn").getOrCreate()
spark.conf.set('spark.rapids.sql.incompatibleOps.enabled', False)
spark.sql("SELECT 'a' FLD1, id FROM range(100)").show()
THIS WORKS
spark.sql("SELECT 'a' FLD1, id FROM range(100)").write.parquet("s3://gourav-bucket/gourav/testdata3/")
spark.read.parquet("s3://gourav-bucket/gourav/testdata3/").createOrReplaceTempView("test")
spark.sql("SELECT * FROM test").show()
FAILED HERE
spark.conf.set('spark.rapids.sql.incompatibleOps.enabled', True)
spark.sql("SELECT * FROM test").show()
FAILED AGAIN
Expected behavior
should show the table
Environment details (please complete the following information)
Environment location: YARN, Cloud(AWS EMR 6.1.0)
Thu Sep 10 22:14:18 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 43C P0 27W / 70W | 14108MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 16812 C /etc/alternatives/jre/bin/java 14097MiB |
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: