Add support for Spark 2.x Explain Api #4529

tgravescs · 2022-01-13T19:50:15Z

contributes to #4360

This PR adds a new module that is a copy of a bunch of the code from sql-plugin but only keeps the tagging functionality. So all convert functions and dependencies on cudf were removed. This is to allow running with a single jar (rapids-4-spark-sql-meta_2.11-22.02.0-SNAPSHOT-spark24.jar) against Spark 2.x and call the explain api val output=com.nvidia.spark.rapids.ExplainPlan.explainPotentialGpuPlan(mydf)
to get an idea of how the users application may run on the GPU. It also gives us an idea if the plugin doesn't support some functions the users application is using.

I unfortunately was unable to find a good way to split the main sql-plugin code into tagging and convert before 22.02 so that is why this is a separate module with a lot of copied code. In the next release we will look at commonizing the code.

Note I named the jar rapids-4-spark-sql-meta_2.11-22.02.0-SNAPSHOT-spark24.jar thinking if we are able to commonize the code perhaps the rapids-4-spark-sql-meta name would apply there. If people have other ideas please let me know.
As of testing this one jar worked for all the Spark 2.4.x versions I have tried. It even works against CDH which has a 2.4.X jar version. I only tested one version of CDH though as well. Note that management said we only have to support Spark 2.4 and newer.
Also note that Spark 2.4 by default builds with scala 2.11 so that is what I use here. They did optionally support 2.12 but I haven't supported there here. scala 2.11 doesn't support -Ywarn-unused:imports so I had to copy some build code into the module pom file.

Since its completely separate code it won't break if someone changes the sql-plugin in an incompatible way. so I don't have it build by default. It can be built in 2 ways:

top level run: mvn clean -Dbuildver=24X package -DskipTests
or cd spark2-sql-plugin and mvn clena package -DskipTests

I tried to make comments about what 2.x supported in various places to make diffing the code easier. Its also possible I missed some unused imports as intellij doesn't seem to like to import that module properly.

All testing is manual right now. Testing involved running all nds queries with Spark 2.4.X and getting explain output and comparing it to the explain output run on Spark 3.x. Very little differences were there, the diffs were all because Spark 2.x doesn't support the exact same functions the same way in catalyst. I also manually hacked the integration tests to run the explain to look for it causing any exceptions. This only kind of worked. Lots of tests failed due to it relying on features not there in 2.x. I manually went through a bunch of the output, but want to see if I can make another pass and get rid of things not supported.

Some things Spark 2 doesn't support or are different (I'm sure there are things I'm missing):

MapInPandasExec, Acosh, Asinh, Atanh, GetTimestamp, IntegralDivide, MapEntries, TransformValues, NormalizeNaNAndZero, KnownFloatingPointNormalized, DateAddInterval, IntegralDivide, TransformKeys
Adaptive Execution is not supported in 2.x
2.x doesn't support the ANSI stuff
SortMergeJoinExec is functionally different with how it optimizes which causes some diffs in plans
Aggregate don’t support TypeImperativeAgg - means GpuObjectHashAggregateExecMeta not supported
There are just some class changes like missing BaseAggregateExec
Datasource V2 is completely different.
3.x switched to Proleptic Gregorian calendar and 2.x is not for file reading (https://issues.apache.org/jira/browse/SPARK-31404)

once this is merged people will have to update the meta information in 2 places until we can get it commonized. I'm sure I will have to make another pass to get things up to date after this PR goes in.

If all this goes in I will have to modify deploy scripts to release the jar.

I added in a script that attempts to diff all the classes and functions added here. Its in the scripts directory. It also has a directory scripts/spark2diffs that have a bunch of diff files used by that script. I did use the script to upmerge to the latest and it worked in the sense it found a diff, which I manually resolved. I may try to run the script in the premerge but can do taht in a separate PR.

…at would have run on GPU Signed-off-by: Thomas Graves <tgraves@apache.org>

…nto explainonlymode

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

This reverts commit 1e5b1ab.

tgravescs · 2022-01-14T20:50:43Z

Upmerged to the latest again to pull the 3.x explain changes, also picked up another config change. This is ready and nothing else to update at the moment.

tgravescs · 2022-01-14T20:50:47Z

build

jlowe

I'm through almost all the diffs but skipped the huge GpuOverrides changes for now, posting what I have so far. Wondering if this would be less delta change from the baseline if we had the spark2 jar pull in the few cudf classes it needs or even encapsulate/require the cudf jar. We already do not require CUDA or a GPU on the driver, so even if we pull in cudf classes during planning it shouldn't try to load native code or find CUDA or a GPU during planning. That might simplify quite a bit since we wouldn't have to hack out anything that might touch a cudf class (even one as benign as DType).

docs/get-started/getting-started-workload-qualification.md

jenkins/spark-nightly-build.sh

jlowe · 2022-01-14T21:30:26Z

scripts/rundiffspark2.sh

+sed -n  '/class GpuBroadcastNestedLoopJoinMeta/,/override def convertToGpu/{/override def convertToGpu/!p}'  ../spark2-sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinMeta.scala > $tmp_dir/GpuBroadcastNestedLoopJoinMeta_new.out
+sed -n  '/class GpuBroadcastNestedLoopJoinMeta/,/override def convertToGpu/{/override def convertToGpu/!p}'  ../sql-plugin/src/main/scala/org/apache/spark/sql/rapids/execution/GpuBroadcastNestedLoopJoinExec.scala > $tmp_dir/GpuBroadcastNestedLoopJoinMeta_old.out
+diff $tmp_dir/GpuBroadcastNestedLoopJoinMeta_new.out $tmp_dir/GpuBroadcastNestedLoopJoinMeta_old.out > $tmp_dir/GpuBroadcastNestedLoopJoinMeta.newdiff
+diff -c spark2diffs/GpuBroadcastNestedLoopJoinMeta.diff $tmp_dir/GpuBroadcastNestedLoopJoinMeta.newdiff


Nit: Seems like we could factor out this code into a function parameterized by the class name, filename, and the sed pattern, and then we can simply build an an array or string of intputs to for loop which would probably reduce the boilerplate and thus file size quite a bit. Would also be nice to have the list ordered in some manner so it's easy to scan for a particular file/class. Not must-fix, IMHO.

spark2-sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuCastMeta.scala

spark2-sql-plugin/src/main/scala/com/nvidia/spark/rapids/DateUtils.scala

spark2-sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuBroadcastJoinMeta.scala

scripts/spark2diffs/RapidsMeta.diff

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

commented out code and some uneeded AQE checks

…spark-rapids into spark2shim-sep-module-upmerge

tgravescs · 2022-01-18T19:41:21Z

Pulling in just DType might reduce the diffs quite a bit. Now that I know the diffs I have some other ideas for integrating it with the sql-plugin code for next release.

tgravescs · 2022-01-18T23:53:47Z

build

spark2-sql-plugin/pom.xml

extra imports, don't build tests jar if empty

tgravescs · 2022-01-19T22:31:01Z

build

pom.xml

tgravescs · 2022-01-19T22:36:11Z

build

…p-module-upmerge

tgravescs · 2022-01-19T23:05:05Z

build

spark2-sql-plugin/pom.xml

tgravescs · 2022-01-19T23:43:01Z

build

jlowe · 2022-01-19T23:50:25Z

spark2-sql-plugin/pom.xml

+                <groupId>org.apache.maven.plugins</groupId>
+                <artifactId>maven-jar-plugin</artifactId>
+                <configuration>
+                    <classifier>${spark.version.classifier}</classifier>


Do we want/need the classifier on this jar?

Yes, at least my intention was to leave it to be able to clearly see this was for spark 2.4. Also the plan was if we needed to support 2.3 for this release it would just be a separate jar rather then doing an entire parallel world like setup again. if you have other thoughts let me know

tgravescs and others added 30 commits December 1, 2021 11:03

Add an explain only mode configuration to run plugin on CPU to get wh…

8eb81f6

…at would have run on GPU Signed-off-by: Thomas Graves <tgraves@apache.org>

Updates

1fca5ef

Don't allocate gpu or enable shuffle for explain only mode

f043f86

explain only mode check for rapids shuffle internal manager

431426c

update doc

3541832

Change how we check explain with sql enabled

6ec51d0

Merge branch 'explainonlymode' of github.com:tgravescs/spark-rapids i…

21a79ca

…nto explainonlymode

update not work message:

b8e9e26

Merge remote-tracking branch 'origin/branch-22.02' into explainonlymode

c15b786

fix spacing

ba3b355

Update doc adding explain option

42aa80f

update docs

278d96c

Add explian only mode test to make sure runs on cpu

2a3b228

add note about adaptive

3bdf4f2

get rest of broadcast shims

4b88718

Update logging of enabled and explain only mode

8f9301a

update config docs

d79c9a4

Add spark2 module that supports the Explain only api

ca880f1

Signed-off-by: Thomas Graves <tgraves@nvidia.com>

copyrights

c961f12

imports

1e5b1ab

Revert "imports"

b0f815c

This reverts commit 1e5b1ab.

imports

25b1a0a

cleanup imports

7f5e4bb

cleanup imports datasource

5cc57dc

remove cudf as dependency

d981107

upmerge joins to latest

645c39b

more upmerge and copyrights

879c005

upmerge and diffed overrides

aed2d13

shim Sequence not in 2.3:

352c090

update copyrights and update RapidsConf

6e5ea9a

cleanup documentation

52447c8

jlowe reviewed Jan 14, 2022

View reviewed changes

tgravescs and others added 5 commits January 18, 2022 12:03

Update docs/get-started/getting-started-workload-qualification.md

f3c2636

Co-authored-by: Jason Lowe <jlowe@nvidia.com>

copyright, remove 2.3 from DateUtils, fix up some comments and remove

f1c55d2

commented out code and some uneeded AQE checks

Merge branch 'spark2shim-sep-module-upmerge' of github.com:tgravescs/…

3bfde90

…spark-rapids into spark2shim-sep-module-upmerge

Update RapidsMeta comment and diff

6f190a6

Update 2.x to sql-plugin diffs

08b2490

add deploy of new jar for nightly

b724245

sameerz linked an issue Jan 19, 2022 that may be closed by this pull request

[FEA] Add explain api for Spark 2.X #4360

Closed

2 tasks

jlowe reviewed Jan 19, 2022

View reviewed changes

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved

Remove unneeded dependencies from pom file, update NOTICE pulled, remove

33da75b

extra imports, don't build tests jar if empty

tgravescs commented Jan 19, 2022

View reviewed changes

pom.xml Show resolved Hide resolved

remove python files

0bad0cc

tgravescs added 2 commits January 19, 2022 16:55

Merge remote-tracking branch 'origin/branch-22.02' into spark2shim-se…

f532d81

…p-module-upmerge

Upmerged to latest sql-plugin code

008d88b

jlowe reviewed Jan 19, 2022

View reviewed changes

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved

spark2-sql-plugin/pom.xml Outdated Show resolved Hide resolved

update NOTICE copy and remove transient

cc97619

jlowe reviewed Jan 19, 2022

View reviewed changes

jlowe approved these changes Jan 20, 2022

View reviewed changes

tgravescs merged commit 2f0ce0f into NVIDIA:branch-22.02 Jan 20, 2022

pxLi mentioned this pull request Jan 21, 2022

[BUG] dup GpuHashJoin.diff case-folding issue #4593

Closed

NvTimLiu linked an issue Jan 26, 2022 that may be closed by this pull request

[BUG] Release build with mvn option -P source-javadoc FAILED #4631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Spark 2.x Explain Api #4529

Add support for Spark 2.x Explain Api #4529

tgravescs commented Jan 13, 2022 •

edited

Loading

tgravescs commented Jan 14, 2022

tgravescs commented Jan 14, 2022

jlowe left a comment

jlowe Jan 14, 2022

tgravescs commented Jan 18, 2022

tgravescs commented Jan 18, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

jlowe Jan 19, 2022

tgravescs Jan 20, 2022

Add support for Spark 2.x Explain Api #4529

Add support for Spark 2.x Explain Api #4529

Conversation

tgravescs commented Jan 13, 2022 • edited Loading

tgravescs commented Jan 14, 2022

tgravescs commented Jan 14, 2022

jlowe left a comment

Choose a reason for hiding this comment

jlowe Jan 14, 2022

Choose a reason for hiding this comment

tgravescs commented Jan 18, 2022

tgravescs commented Jan 18, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

tgravescs commented Jan 19, 2022

jlowe Jan 19, 2022

Choose a reason for hiding this comment

tgravescs Jan 20, 2022

Choose a reason for hiding this comment

tgravescs commented Jan 13, 2022 •

edited

Loading