Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for Spark 2.x Explain Api #4529

Merged
merged 96 commits into from
Jan 20, 2022
Merged
Show file tree
Hide file tree
Changes from 95 commits
Commits
Show all changes
96 commits
Select commit Hold shift + click to select a range
8eb81f6
Add an explain only mode configuration to run plugin on CPU to get wh…
tgravescs Dec 1, 2021
1fca5ef
Updates
tgravescs Dec 1, 2021
f043f86
Don't allocate gpu or enable shuffle for explain only mode
tgravescs Dec 2, 2021
431426c
explain only mode check for rapids shuffle internal manager
tgravescs Dec 2, 2021
3541832
update doc
tgravescs Dec 2, 2021
6ec51d0
Change how we check explain with sql enabled
tgravescs Dec 3, 2021
21a79ca
Merge branch 'explainonlymode' of github.com:tgravescs/spark-rapids i…
tgravescs Dec 3, 2021
b8e9e26
update not work message:
tgravescs Dec 3, 2021
c15b786
Merge remote-tracking branch 'origin/branch-22.02' into explainonlymode
tgravescs Dec 3, 2021
ba3b355
fix spacing
tgravescs Dec 6, 2021
42aa80f
Update doc adding explain option
tgravescs Dec 6, 2021
278d96c
update docs
tgravescs Dec 6, 2021
2a3b228
Add explian only mode test to make sure runs on cpu
tgravescs Dec 6, 2021
3bdf4f2
add note about adaptive
tgravescs Dec 7, 2021
4b88718
get rest of broadcast shims
tgravescs Dec 7, 2021
8f9301a
Update logging of enabled and explain only mode
tgravescs Dec 7, 2021
d79c9a4
update config docs
tgravescs Dec 7, 2021
ca880f1
Add spark2 module that supports the Explain only api
tgravescs Jan 10, 2022
c961f12
copyrights
tgravescs Jan 10, 2022
1e5b1ab
imports
tgravescs Jan 10, 2022
b0f815c
Revert "imports"
tgravescs Jan 10, 2022
25b1a0a
imports
tgravescs Jan 10, 2022
7f5e4bb
cleanup imports
tgravescs Jan 10, 2022
5cc57dc
cleanup imports datasource
tgravescs Jan 10, 2022
d981107
remove cudf as dependency
tgravescs Jan 10, 2022
645c39b
upmerge joins to latest
tgravescs Jan 10, 2022
879c005
more upmerge and copyrights
tgravescs Jan 10, 2022
aed2d13
upmerge and diffed overrides
tgravescs Jan 10, 2022
352c090
shim Sequence not in 2.3:
tgravescs Jan 10, 2022
6e5ea9a
update copyrights and update RapidsConf
tgravescs Jan 10, 2022
bc4eede
finish copyright updates
tgravescs Jan 10, 2022
eda55f2
building 248 and 232
tgravescs Jan 10, 2022
ddddf35
add pom file
tgravescs Jan 10, 2022
c0346e8
fix style issues
tgravescs Jan 10, 2022
36396f9
fix style
tgravescs Jan 10, 2022
7282c0c
remove shim layers
tgravescs Jan 11, 2022
03610be
update pom description
tgravescs Jan 11, 2022
84fd72f
Add CheckOverflow
tgravescs Jan 11, 2022
70ff0bf
fix line length
tgravescs Jan 11, 2022
233b259
fix copyright
tgravescs Jan 11, 2022
6402631
Merge remote-tracking branch 'origin/branch-22.02' into explainonlymode
tgravescs Jan 12, 2022
13d7ee1
cleanup
tgravescs Jan 12, 2022
ab49521
Update to use the spark.rapids.sql.mode config
tgravescs Jan 12, 2022
ce38223
update config doc and formatting
tgravescs Jan 12, 2022
b1dd70e
update docs
tgravescs Jan 12, 2022
c94d732
fix typo
tgravescs Jan 12, 2022
0003e4e
update configs.md
tgravescs Jan 12, 2022
241f62c
change pom to 24x
tgravescs Jan 12, 2022
6ee3c0c
Merge remote-tracking branch 'origin/branch-22.02' into spark2shim-se…
tgravescs Jan 13, 2022
bb87652
Upmerge some files
tgravescs Jan 13, 2022
1340d9a
change name of jar
tgravescs Jan 13, 2022
77c82a2
Change to check for isSqlEnabled and the mode separately because we may
tgravescs Jan 13, 2022
c02dae5
fix spacing
tgravescs Jan 13, 2022
2afeb2b
update auto generated configs doc
tgravescs Jan 13, 2022
4c98c6e
Add ExplainPlan
tgravescs Jan 13, 2022
f5dc619
Merge remote-tracking branch 'tgravescs/explainonlymode' into spark2s…
tgravescs Jan 13, 2022
1eca823
Update docs for 2.x explain api
tgravescs Jan 13, 2022
ac9049a
cleanup
tgravescs Jan 13, 2022
6c06b31
update build scripts to build 24x
tgravescs Jan 13, 2022
1dc83df
remove unneeded tag
tgravescs Jan 13, 2022
48ae9fd
start diff scripts
tgravescs Jan 13, 2022
85088fa
checkpoint
tgravescs Jan 13, 2022
935c25e
checkpoint
tgravescs Jan 13, 2022
e7c7a0b
checkpoint
tgravescs Jan 13, 2022
cf55dc6
checkpoint
tgravescs Jan 14, 2022
359c306
checkpoint
tgravescs Jan 14, 2022
5b1d40a
checkpoint
tgravescs Jan 14, 2022
cbf62a5
add more
tgravescs Jan 14, 2022
126919a
checkpoint
tgravescs Jan 14, 2022
75edb2d
more diffs done
tgravescs Jan 14, 2022
0db3a3d
more diffs
tgravescs Jan 14, 2022
8bb8291
diff files
tgravescs Jan 14, 2022
b7aa8c7
split apart some functions for easier diff
tgravescs Jan 14, 2022
2dc5251
Overrides diffs and split apart for easier diffing
tgravescs Jan 14, 2022
8b5d720
more diffs and move
tgravescs Jan 14, 2022
8907fac
update script to tmp dir
tgravescs Jan 14, 2022
abd12b0
builds
tgravescs Jan 14, 2022
7437f30
Merge remote-tracking branch 'origin/branch-22.02' into spark2shim-se…
tgravescs Jan 14, 2022
8b40694
Upmerge pull changes
tgravescs Jan 14, 2022
de98f8b
update diff file
tgravescs Jan 14, 2022
9933cdd
Finish diffing the rest of execs, split one out to another file for
tgravescs Jan 14, 2022
82d5818
Merge remote-tracking branch 'origin/branch-22.02' into spark2shim-se…
tgravescs Jan 14, 2022
b912fc6
Update spark2 RapidsConf with changes in sql-plugin
tgravescs Jan 14, 2022
9116768
Update RapidsConf to deal with toBytes different 2.x
tgravescs Jan 14, 2022
52447c8
cleanup documentation
tgravescs Jan 14, 2022
f3c2636
Update docs/get-started/getting-started-workload-qualification.md
tgravescs Jan 18, 2022
f1c55d2
copyright, remove 2.3 from DateUtils, fix up some comments and remove
tgravescs Jan 18, 2022
3bfde90
Merge branch 'spark2shim-sep-module-upmerge' of github.com:tgravescs/…
tgravescs Jan 18, 2022
6f190a6
Update RapidsMeta comment and diff
tgravescs Jan 18, 2022
08b2490
Update 2.x to sql-plugin diffs
tgravescs Jan 18, 2022
b724245
add deploy of new jar for nightly
tgravescs Jan 19, 2022
33da75b
Remove unneeded dependencies from pom file, update NOTICE pulled, remove
tgravescs Jan 19, 2022
0bad0cc
remove python files
tgravescs Jan 19, 2022
f532d81
Merge remote-tracking branch 'origin/branch-22.02' into spark2shim-se…
tgravescs Jan 19, 2022
008d88b
Upmerged to latest sql-plugin code
tgravescs Jan 19, 2022
cc97619
update NOTICE copy and remove transient
tgravescs Jan 19, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 38 additions & 15 deletions docs/get-started/getting-started-workload-qualification.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,8 @@ the other is to modify your existing Spark application code to call a function d

Please note that if using adaptive execution in Spark the explain output may not be perfect
as the plan could have changed along the way in a way that we wouldn't see by looking at just
the CPU plan.

### Requirements

- A Spark 3.x CPU cluster
- The `rapids-4-spark` and `cudf` [jars](../download.md)
- Ability to modify the existing Spark application code if using the function call directly
the CPU plan. The same applies if you are using an older version of Spark. Spark planning
may be slightly different if you go up to a newer version of Spark.

### Using the Configuration Flag for Explain Only Mode

Expand All @@ -78,6 +73,13 @@ This mode allows you to run on a CPU cluster and can help us understand the pote
if there are any unsupported features. Basically it will log the output which is the same as
the driver logs with `spark.rapids.sql.explain=all`.

#### Requirements

- A Spark 3.x CPU cluster
- The `rapids-4-spark` and `cudf` [jars](../download.md)

#### Usage

1. In `spark-shell`, add the `rapids-4-spark` and `cudf` jars into --jars option or put them in the
Spark classpath and enable the configs `spark.rapids.sql.mode=explainOnly` and
`spark.plugins=com.nvidia.spark.SQLPlugin`.
Expand Down Expand Up @@ -125,20 +127,41 @@ pretty accurate.

### How to use the Function Call

Starting with version 21.12 of the RAPIDS Accelerator, a new function named
`explainPotentialGpuPlan` is added which can help us understand the potential GPU plan and if there
are any unsupported features on a CPU cluster. Basically it can return output which is the same as
the driver logs with `spark.rapids.sql.explain=all`.
A function named `explainPotentialGpuPlan` is available which can help us understand the potential
GPU plan and if there are any unsupported features on a CPU cluster. Basically it can return output
which is the same as the driver logs with `spark.rapids.sql.explain=all`.

1. In `spark-shell`, add the `rapids-4-spark` and `cudf` jars into --jars option or put them in the
#### Requirements with Spark 3.X

- A Spark 3.X CPU cluster
- The `rapids-4-spark` and `cudf` [jars](../download.md)
- Ability to modify the existing Spark application code
- RAPIDS Accelerator for Apache Spark version 21.12 or newer

#### Requirements with Spark 2.4.X

- A Spark 2.4.X CPU cluster
- The `rapids-4-spark-sql-meta` [jar](../download.md)
- Ability to modify the existing Spark application code
- RAPIDS Accelerator for Apache Spark version 22.02 or newer

#### Usage

1. In `spark-shell`, add the necessary jars into --jars option or put them in the
Spark classpath.

For example:
For example, on Spark 3.X:

```bash
spark-shell --jars /PathTo/cudf-<version>.jar,/PathTo/rapids-4-spark_<version>.jar
```

For example, on Spark 2.4.X:

```bash
spark-shell --jars /PathTo/rapids-4-spark-sql-meta-<version and classifier>.jar
```

2. Test if the class can be successfully loaded or not.

```scala
Expand All @@ -148,8 +171,8 @@ the driver logs with `spark.rapids.sql.explain=all`.
3. Enable optional RAPIDS Accelerator related parameters based on your setup.

Enabling optional parameters may allow more operations to run on the GPU but please understand
the meaning and risk of above parameters before enabling it. Please refer to [configs
doc](../configs.md) for details of RAPIDS Accelerator parameters.
the meaning and risk of above parameters before enabling it. Please refer to the
[configuration documentation](../configs.md) for details of RAPIDS Accelerator parameters.

For example, if your jobs have `double`, `float` and `decimal` operators together with some Scala
UDFs, you can set the following parameters:
Expand Down
10 changes: 9 additions & 1 deletion jenkins/spark-nightly-build.sh
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#!/bin/bash
#
# Copyright (c) 2020-2021, NVIDIA CORPORATION. All rights reserved.
# Copyright (c) 2020-2022, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -74,6 +74,14 @@ function distWithReducedPom {
$mvnExtaFlags
jlowe marked this conversation as resolved.
Show resolved Hide resolved
}

# build the Spark 2.x explain jar
mvn -B $MVN_URM_MIRROR -Dmaven.repo.local=$M2DIR -Dbuildver=24X clean install -DskipTests
[[ $SKIP_DEPLOY != 'true' ]] && \
mvn -B deploy $MVN_URM_MIRROR \
-Dmaven.repo.local=$M2DIR \
-DskipTests \
-Dbuildver=24X

# build, install, and deploy all the versions we support, but skip deploy of individual dist module since we
# only want the combined jar to be pushed.
# Note this does not run any integration tests
Expand Down
3 changes: 3 additions & 0 deletions jenkins/spark-premerge-build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,9 @@ mvn_verify() {
# file size check for pull request. The size of a committed file should be less than 1.5MiB
pre-commit run check-added-large-files --from-ref $BASE_REF --to-ref HEAD

# build the Spark 2.x explain jar
env -u SPARK_HOME mvn -B $MVN_URM_MIRROR -Dbuildver=24X clean install -DskipTests

# build all the versions but only run unit tests on one 3.0.X version (base version covers this), one 3.1.X version, and one 3.2.X version.
# All others shims test should be covered in nightly pipelines
env -u SPARK_HOME mvn -U -B $MVN_URM_MIRROR -Dbuildver=302 clean install -Drat.skip=true -DskipTests -Dmaven.javadoc.skip=true -Dskip -Dmaven.scalastyle.skip=true -Dcuda.version=$CUDA_CLASSIFIER -pl aggregator -am
Expand Down
Loading