Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spark 2.x Explain Api #4529

Merged
merged 96 commits into from
Jan 20, 2022

Conversation

tgravescs
Copy link
Collaborator

@tgravescs tgravescs commented Jan 13, 2022

contributes to #4360

This PR adds a new module that is a copy of a bunch of the code from sql-plugin but only keeps the tagging functionality. So all convert functions and dependencies on cudf were removed. This is to allow running with a single jar (rapids-4-spark-sql-meta_2.11-22.02.0-SNAPSHOT-spark24.jar) against Spark 2.x and call the explain api val output=com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(mydf)
to get an idea of how the users application may run on the GPU. It also gives us an idea if the plugin doesn't support some functions the users application is using.

I unfortunately was unable to find a good way to split the main sql-plugin code into tagging and convert before 22.02 so that is why this is a separate module with a lot of copied code. In the next release we will look at commonizing the code.

Note I named the jar rapids-4-spark-sql-meta_2.11-22.02.0-SNAPSHOT-spark24.jar thinking if we are able to commonize the code perhaps the rapids-4-spark-sql-meta name would apply there. If people have other ideas please let me know.
As of testing this one jar worked for all the Spark 2.4.x versions I have tried. It even works against CDH which has a 2.4.X jar version. I only tested one version of CDH though as well. Note that management said we only have to support Spark 2.4 and newer.
Also note that Spark 2.4 by default builds with scala 2.11 so that is what I use here. They did optionally support 2.12 but I haven't supported there here. scala 2.11 doesn't support -Ywarn-unused:imports so I had to copy some build code into the module pom file.

Since its completely separate code it won't break if someone changes the sql-plugin in an incompatible way. so I don't have it build by default. It can be built in 2 ways:

  • top level run: mvn clean -Dbuildver=24X package -DskipTests
  • or cd spark2-sql-plugin and mvn clena package -DskipTests

I tried to make comments about what 2.x supported in various places to make diffing the code easier. Its also possible I missed some unused imports as intellij doesn't seem to like to import that module properly.

All testing is manual right now. Testing involved running all nds queries with Spark 2.4.X and getting explain output and comparing it to the explain output run on Spark 3.x. Very little differences were there, the diffs were all because Spark 2.x doesn't support the exact same functions the same way in catalyst. I also manually hacked the integration tests to run the explain to look for it causing any exceptions. This only kind of worked. Lots of tests failed due to it relying on features not there in 2.x. I manually went through a bunch of the output, but want to see if I can make another pass and get rid of things not supported.

Some things Spark 2 doesn't support or are different (I'm sure there are things I'm missing):

  1. MapInPandasExec, Acosh, Asinh, Atanh, GetTimestamp, IntegralDivide, MapEntries, TransformValues, NormalizeNaNAndZero, KnownFloatingPointNormalized, DateAddInterval, IntegralDivide, TransformKeys
  2. Adaptive Execution is not supported in 2.x
  3. 2.x doesn't support the ANSI stuff
  4. SortMergeJoinExec is functionally different with how it optimizes which causes some diffs in plans
  5. Aggregate don’t support TypeImperativeAgg - means GpuObjectHashAggregateExecMeta not supported
  6. There are just some class changes like missing BaseAggregateExec
  7. Datasource V2 is completely different.
  8. 3.x switched to Proleptic Gregorian calendar and 2.x is not for file reading (https://issues.apache.org/jira/browse/SPARK-31404)

once this is merged people will have to update the meta information in 2 places until we can get it commonized. I'm sure I will have to make another pass to get things up to date after this PR goes in.

If all this goes in I will have to modify deploy scripts to release the jar.

I added in a script that attempts to diff all the classes and functions added here. Its in the scripts directory. It also has a directory scripts/spark2diffs that have a bunch of diff files used by that script. I did use the script to upmerge to the latest and it worked in the sense it found a diff, which I manually resolved. I may try to run the script in the premerge but can do taht in a separate PR.

@tgravescs
Copy link
Collaborator Author

Upmerged to the latest again to pull the 3.x explain changes, also picked up another config change. This is ready and nothing else to update at the moment.

@tgravescs
Copy link
Collaborator Author

build

Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm through almost all the diffs but skipped the huge GpuOverrides changes for now, posting what I have so far. Wondering if this would be less delta change from the baseline if we had the spark2 jar pull in the few cudf classes it needs or even encapsulate/require the cudf jar. We already do not require CUDA or a GPU on the driver, so even if we pull in cudf classes during planning it shouldn't try to load native code or find CUDA or a GPU during planning. That might simplify quite a bit since we wouldn't have to hack out anything that might touch a cudf class (even one as benign as DType).

docs/get-started/getting-started-workload-qualification.md Outdated Show resolved Hide resolved
jenkins/spark-nightly-build.sh Show resolved Hide resolved
Comment on lines +38 to +41
sed -n '/class GpuBroadcastNestedLoopJoinMeta/,/override def convertToGpu/{/override def convertToGpu/!p}' ../spark2-sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinMeta.scala > $tmp_dir/GpuBroadcastNestedLoopJoinMeta_new.out
sed -n '/class GpuBroadcastNestedLoopJoinMeta/,/override def convertToGpu/{/override def convertToGpu/!p}' ../sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExec.scala > $tmp_dir/GpuBroadcastNestedLoopJoinMeta_old.out
diff $tmp_dir/GpuBroadcastNestedLoopJoinMeta_new.out $tmp_dir/GpuBroadcastNestedLoopJoinMeta_old.out > $tmp_dir/GpuBroadcastNestedLoopJoinMeta.newdiff
diff -c spark2diffs/GpuBroadcastNestedLoopJoinMeta.diff $tmp_dir/GpuBroadcastNestedLoopJoinMeta.newdiff
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Seems like we could factor out this code into a function parameterized by the class name, filename, and the sed pattern, and then we can simply build an an array or string of intputs to for loop which would probably reduce the boilerplate and thus file size quite a bit. Would also be nice to have the list ordered in some manner so it's easy to scan for a particular file/class. Not must-fix, IMHO.

scripts/spark2diffs/RapidsMeta.diff Outdated Show resolved Hide resolved
@tgravescs
Copy link
Collaborator Author

Pulling in just DType might reduce the diffs quite a bit. Now that I know the diffs I have some other ideas for integrating it with the sql-plugin code for next release.

@tgravescs
Copy link
Collaborator Author

build

@sameerz sameerz linked an issue Jan 19, 2022 that may be closed by this pull request
2 tasks
spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved
spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved
spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved
@tgravescs
Copy link
Collaborator Author

build

pom.xml Show resolved Hide resolved
@tgravescs
Copy link
Collaborator Author

build

@tgravescs
Copy link
Collaborator Author

build

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved
spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved
@tgravescs
Copy link
Collaborator Author

build

<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<classifier>${spark.version.classifier}</classifier>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want/need the classifier on this jar?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at least my intention was to leave it to be able to clearly see this was for spark 2.4. Also the plan was if we needed to support 2.3 for this release it would just be a separate jar rather then doing an entire parallel world like setup again. if you have other thoughts let me know

@tgravescs tgravescs merged commit 2f0ce0f into NVIDIA:branch-22.02 Jan 20, 2022
@NvTimLiu NvTimLiu linked an issue Jan 26, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Release build with mvn option -P source-javadoc FAILED [FEA] Add explain api for Spark 2.X
2 participants