-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Mortgage ETL sample failed with spark.sql.adaptive enabled on AWS EMR 6.2 #1423
Comments
it looks like EMR definition of org.apache.spark.sql.execution.exchange.BroadcastExchangeLike extends another class so we can try to shim this. |
this happens other places then EMR.
I think I tracked down the problem. In the fixUpExchangeOverhead we look to make sure everything can be replaced, but there is one weird corner case where we are removing the Sort when we replace sortmergejoin with a shuffle hash join. This ends up showing: 20/12/19 00:40:02 WARN GpuShuffleMeta: parent can be replaced is not empty Some(*Exec could run on GPU but is going to be removed because removing SortExec as part replacing sortMergeJoin with shuffleHashJoin This causes us to not mark the gpuColumnarExchange as will not work even though the child wouldn't run on the GPU in the initial planning and thus is not tagged. With AQE on, that subquery ends up being re-evaluated and at that point we do properly change the GpuColumnarExchange as a CPU one, but since it wasn't tagged the otherwise of the join subquery doesn't know it and runs on the GPU, then we have mismatched and get the error that one is cpu and one is gpu. In this case this is the parent node not the child, the converted plan would look like:
*Exec could run on GPU So I think we just need to add a check in to handle this special case. I did try a quick hack change to see if it passed and it got by that error but failed later with: So not sure if there are further things broken as well. |
correction to the above comment, the fixUpExchangeOverhead checks to see if both:
In this case the child is but not the parent so it doesn't mark it as will not work. But then when AQE kicks it to look only at the subquery, I believe it only sees the child and it gets marked as won't work, the other side does go on the GPU and then we end up with the mismatch. |
So just to clarify there are 2 issues here -
|
Note that with the AQE fix, the issue 1 doesn't show up on the mortgage runs, I believe that will only show up on failures, at least in this scenario. |
For the second issue with incompatibilty with EMR we are going to assume that won't be an issue with the official EMR released jar because they build against their own jars. So for now the fix for AQE issue will resolve this |
…IDIA#1423) Signed-off-by: spark-rapids automation <70000568+nvauto@users.noreply.github.com>
Describe the bug
Mortgage ETL sample failed with spark.sql.adaptive enabled on AWS EMR 6.2 with 0.3 release.
With "spark.sql.adaptive.enabled":"false", the same ETL sample on EMR PySpark Notebook will run and give us end-to-end number.
With AQE on we get the following error message from the notebook
Steps/Code to reproduce bug
Notebook Code: (please adjust the S3 location for dataset)
Expected behavior
Morgage ETL example should complete without issue and report around 620s.
Environment details (please complete the following information)
Environment location: AWS EMR, 1x M5.xlarge for Master, 2x g4dn.12xlarge for Core nodes. EMR 6.2 release, jar file manually replaced with v0.3 (/usr/share/aws/emr/spark-rapids/lib)
Spark configuration settings related to the issue
you can manually turn off AQE
sudo vi /etc/spark/conf/spark-defaults.conf
spark.sql.adaptive.enabled false
#restart on master
sudo systemctl restart hadoop-yarn-resourcemanager.service
EMR cluster configuration
Additional context
believe I run into same issue in v0.2 on EMR. so we turn AQE off when bring up the cluster.
The text was updated successfully, but these errors were encountered: