Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a getting started on K8s page #1932

Merged
merged 52 commits into from
Apr 7, 2021
Merged
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
f80cb6a
doing some test
viadea Mar 6, 2021
47fdca4
Revert "doing some test"
viadea Mar 6, 2021
ede56bd
Update download.md
viadea Mar 8, 2021
744bb42
Update download.md
viadea Mar 8, 2021
521921e
Merge branch 'branch-0.5' of github.com:viadea/spark-rapids into bran…
viadea Mar 8, 2021
4a82093
Revert "Update download.md"
viadea Mar 8, 2021
ca4d040
Merge remote-tracking branch 'upstream/branch-0.5' into branch-0.5
viadea Mar 8, 2021
e0ed7ad
Merge remote-tracking branch 'upstream/branch-0.5' into branch-0.5
viadea Mar 14, 2021
5287650
Create getting-started-kubernetes.md
viadea Mar 14, 2021
2f366f9
Fixed one typo
viadea Mar 14, 2021
5a474cb
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
fb2f303
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
a4dfc1a
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
1cf99e4
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
7de0029
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
8fd70fd
changed nav_order to 6
viadea Mar 15, 2021
3604a79
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
0c05ee1
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
232b37b
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
db67081
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
464171c
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
0b021bc
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
46d3c95
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
bb77cbd
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
781c2d9
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
3370353
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
2d1d3e9
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
6a8af96
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
67b705e
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
aa648d7
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 15, 2021
7d268d6
create "To delete the Driver POD" section
viadea Mar 15, 2021
be4079c
Add a note.
viadea Mar 15, 2021
03a4bef
add spark.kubernetes.memoryOverheadFactor=0.6
viadea Mar 15, 2021
4b727d1
Changed to spark.executor.memoryOverhead=3G
viadea Mar 15, 2021
1140a76
Added a note to explain the jar location
viadea Mar 16, 2021
cda4933
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
8934851
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
dfbb390
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
3b5df05
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
978dcc5
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
3602272
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
bfd484a
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
d687b4c
reword
viadea Mar 17, 2021
58e5833
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
a071279
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
9cebbac
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
0ca16ce
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
a8f4024
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
09cacce
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
7551175
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
fc17731
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
7c424a7
Update docs/get-started/getting-started-kubernetes.md
viadea Mar 17, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
267 changes: 267 additions & 0 deletions docs/get-started/getting-started-kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,267 @@
---
layout: page
title: Kubernetes
nav_order: 6
parent: Getting-Started
---

# Getting Started with RAPIDS and Kubernetes

This guide will run through how to set up the RAPIDS Accelerator for Apache Spark in a Kubernetes cluster.
At the end of this guide, the reader will be able to run a sample Apache Spark application that runs
on NVIDIA GPUs in a Kubernetes cluster.

This is a quick start guide which uses default settings which may be different from your cluster.

Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker
image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script. See this
[Dockerfile.cuda](Dockerfile.cuda) example.

viadea marked this conversation as resolved.
Show resolved Hide resolved

## Prerequisites
* Kubernetes cluster is up and running with NVIDIA GPU support
* Docker is installed on a client machine
* A Docker repository which is accessible by the Kubernetes cluster

These instructions do not cover how to setup a Kubernetes cluster.

Please refer to [Install Kubernetes](https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html) on
how to install a Kubernetes cluster with NVIDIA GPU support.

## Docker Image Preparation

On a client machine which has access to the Kubernetes cluster:

1. [Download Apache Spark](https://spark.apache.org/downloads.html).
Supported versions of Spark are listed on the [RAPIDS Accelerator download page](../download.md). Please note that only
Scala version 2.12 is currently supported by the accelerator.

Note that you can download these into a local directory and untar the Spark `.tar.gz` as a directory named `spark`.

2. Download the [RAPIDS Accelerator for Spark jars](getting-started-on-prem.md#download-the-rapids-jars) and the
[GPU discovery script](getting-started-on-prem.md#install-the-gpu-discovery-script).

Put the 2 jars -- `rapids-4-spark_<version>.jar`, `cudf-<version>.jar` and `getGpusResources.sh` in the same directory as `spark`.
tgravescs marked this conversation as resolved.
Show resolved Hide resolved

Note: If here you decide to put above 2 jars in the `spark/jars` directory which will be copied into
`/opt/spark/jars` directory in Docker image, then in the future you do not need to
specify `spark.driver.extraClassPath` or `spark.executor.extraClassPath` using `cluster` mode.
This example just shows you a way to put customized jars or 3rd party jars.

3. Download the sample [Dockerfile.cuda](Dockerfile.cuda) in the same directory as `spark`.

The sample Dockerfile.cuda will copy the `spark` directory's several sub-directories into `/opt/spark/`
along with the RAPIDS Accelerator jars and `getGpusResources.sh` into `/opt/sparkRapidsPlugin`
inside the Docker image.

Examine the Dockerfile.cuda file to ensure the file names are correct and modify if needed.

Currently the directory in the local machine should look as below:
```shell
$ ls
Dockerfile.cuda cudf-<version>.jar getGpusResources.sh rapids-4-spark_<version>.jar spark
```

4. Build the Docker image with a proper repository name and tag and push it to the repository
```shell
export IMAGE_NAME=xxx/yyy:tag
docker build . -f Dockerfile.cuda -t $IMAGE_NAME
docker push $IMAGE_NAME
```
hyperbolic2346 marked this conversation as resolved.
Show resolved Hide resolved

## Running Spark Applications in the Kubernetes Cluster

### Submitting a Simple Test Job

This simple job will test if the RAPIDS plugin can be found.
`ClassNotFoundException` is a common error if the Spark driver can not
find the RAPIDS Accelerator jar, resulting in an exception like this:
```
Exception in thread "main" java.lang.ClassNotFoundException: com.nvidia.spark.SQLPlugin
```

Here is an example job:

```shell
export SPARK_HOME=~/spark
export IMAGE_NAME=xxx/yyy:tag
export K8SMASTER=k8s://https://<k8s-apiserver-host>:<k8s-apiserver-port>
export SPARK_NAMESPACE=default
export SPARK_DRIVER_NAME=exampledriver
$SPARK_HOME/bin/spark-submit \
--master $K8SMASTER \
--deploy-mode cluster \
--name examplejob \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.memory=4G \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=2G \
viadea marked this conversation as resolved.
Show resolved Hide resolved
--conf spark.executor.memoryOverhead=3G \
--conf spark.locality.wait=0s \
--conf spark.sql.files.maxPartitionBytes=512m \
--conf spark.sql.shuffle.partitions=10 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
--conf spark.kubernetes.driver.pod.name=$SPARK_DRIVER_NAME \
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=$IMAGE_NAME \
--conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar:/opt/sparkRapidsPlugin/cudf-<version>.jar \
tgravescs marked this conversation as resolved.
Show resolved Hide resolved
--conf spark.driver.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar:/opt/sparkRapidsPlugin/cudf-<version>.jar \
--driver-memory 2G \
local:///opt/spark/examples/jars/spark-examples_2.12-3.0.2.jar
```

Note: `local://` means the jar file location is inside the Docker image.
Since this is `cluster` mode, the Spark driver is running inside a pod in Kubernetes.
The driver and executor pods can be seen when the job is running:
```shell
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
spark-pi-d11075782f399fd7-exec-1 1/1 Running 0 9s
exampledriver 1/1 Running 0 15s
```

To view the Spark driver log, use below command:
```shell
kubectl logs $SPARK_DRIVER_NAME
```

To view the Spark driver UI when the job is running first expose the driver UI port:
```shell
kubectl port-forward $SPARK_DRIVER_NAME 4040:4040
tgravescs marked this conversation as resolved.
Show resolved Hide resolved
```
Then open a web browser to the Spark driver UI page on the exposed port:
```shell
http://localhost:4040
```

To kill the Spark job:
```shell
$SPARK_HOME/bin/spark-submit --kill spark:$SPARK_DRIVER_NAME
```

To delete the driver POD:
```shell
kubectl delete pod $SPARK_DRIVER_NAME
viadea marked this conversation as resolved.
Show resolved Hide resolved
```

### Running an Interactive Spark Shell

If you need an interactive Spark shell with executor pods running inside the Kubernetes cluster:
```shell
$SPARK_HOME/bin/spark-shell \
--master $K8SMASTER \
--name mysparkshell \
--deploy-mode client \
--conf spark.executor.instances=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.memory=4G \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=2G \
viadea marked this conversation as resolved.
Show resolved Hide resolved
--conf spark.executor.memoryOverhead=3G \
--conf spark.locality.wait=0s \
--conf spark.sql.files.maxPartitionBytes=512m \
--conf spark.sql.shuffle.partitions=10 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=$IMAGE_NAME \
--conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar:/opt/sparkRapidsPlugin/cudf-<version>.jar \
viadea marked this conversation as resolved.
Show resolved Hide resolved
--driver-class-path=./cudf-<version>.jar:./rapids-4-spark_<version>.jar \
--driver-memory 2G
```

Only the `client` deploy mode should be used. If you specify the `cluster` deploy mode, you would see the following error:
```shell
Cluster deploy mode is not applicable to Spark shells.
```
Also notice that `--conf spark.driver.extraClassPath` was removed but `--driver-class-path` was added.
This is because now the driver is running on the client machine, so the jar paths should be local filesystem paths.

When running the shell you can see only the executor pods are running inside Kubernetes:
```
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
mysparkshell-bfe52e782f44841c-exec-1 1/1 Running 0 11s
```

The following Scala code can be run in the Spark shell to test if the RAPIDS Accelerator is enabled.
```shell
val df = spark.sparkContext.parallelize(Seq(1)).toDF()
df.createOrReplaceTempView("df")
spark.sql("SELECT value FROM df WHERE value <>1").show
spark.sql("SELECT value FROM df WHERE value <>1").explain
:quit
```
The expected `explain` plan should contain the GPU related operators:
```shell
scala> spark.sql("SELECT value FROM df WHERE value <>1").explain
== Physical Plan ==
GpuColumnarToRow false
+- GpuFilter NOT (value#2 = 1)
+- GpuRowToColumnar TargetSize(2147483647)
+- *(1) SerializeFromObject [input[0, int, false] AS value#2]
+- Scan[obj#1]
```

### Running PySpark in Client Mode

Of course, you can `COPY` the Python code in the Docker image when building it
and submit it using the `cluster` deploy mode as showin in the previous example pi job.

However if you do not want to re-build the Docker image each time and just want to submit the Python code
from the client machine, you can use the `client` deploy mode.

```shell
$SPARK_HOME/bin/spark-submit \
--master $K8SMASTER \
--deploy-mode client \
--name mypythonjob \
--conf spark.executor.instances=1 \
--conf spark.executor.resource.gpu.amount=1 \
--conf spark.executor.memory=4G \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
--conf spark.task.resource.gpu.amount=1 \
--conf spark.rapids.memory.pinnedPool.size=2G \
viadea marked this conversation as resolved.
Show resolved Hide resolved
--conf spark.executor.memoryOverhead=3G \
--conf spark.locality.wait=0s \
--conf spark.sql.files.maxPartitionBytes=512m \
--conf spark.sql.shuffle.partitions=10 \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.kubernetes.namespace=$SPARK_NAMESPACE \
--conf spark.executor.resource.gpu.discoveryScript=/opt/sparkRapidsPlugin/getGpusResources.sh \
--conf spark.executor.resource.gpu.vendor=nvidia.com \
--conf spark.kubernetes.container.image=$IMAGE_NAME \
--conf spark.executor.extraClassPath=/opt/sparkRapidsPlugin/rapids-4-spark_<version>.jar:/opt/sparkRapidsPlugin/cudf-<version>.jar \
--driver-memory 2G \
--driver-class-path=./cudf-<version>.jar:./rapids-4-spark_<version>.jar \
test.py
```

A sample `test.py` is as below:
```shell
from pyspark.sql import SQLContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
sc = SparkContext.getOrCreate()
sqlContext = SQLContext(sc)
df=sqlContext.createDataFrame([1,2,3], "int").toDF("value")
df.createOrReplaceTempView("df")
sqlContext.sql("SELECT * FROM df WHERE value<>1").explain()
sqlContext.sql("SELECT * FROM df WHERE value<>1").show()
sc.stop()
```


Please refer to [Running Spark on Kubernetes](https://spark.apache.org/docs/latest/running-on-kubernetes.html) for more information.