Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation changes for 0.2 release #637

Merged
merged 10 commits into from
Sep 11, 2020
14 changes: 7 additions & 7 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,10 @@ shows stale results.

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

The RAPIDS Accelerator for Apache Spark requires version 3.0.0 of Apache Spark. Because the plugin
replaces parts of the physical plan that Apache Spark considers to be internal the code for those
plans can change even between bug fix releases. As a part of our process, we try to stay on top of
these changes and release updates as quickly as possible.
The RAPIDS Accelerator for Apache Spark requires version 3.0.0 or 3.0.1 of Apache Spark. Because the
plugin replaces parts of the physical plan that Apache Spark considers to be internal the code for
those plans can change even between bug fix releases. As a part of our process, we try to stay on
top of these changes and release updates as quickly as possible.

### Which distributions are supported?

Expand All @@ -41,9 +41,9 @@ Reference architectures should be available around Q4 2020.

### What CUDA versions are supported?

CUDA 10.1 and 10.2 are currently supported, but you need to download the cudf jar that corresponds
to the version you are using. Please look [here][version/stable-release.md] for download links
for the stable release.
CUDA 10.1, 10.2 and 11.0 are currently supported, but you need to download the cudf jar that
corresponds to the version you are using. Please look [here][version/stable-release.md] for download
links for the stable release.

### What parts of Apache Spark are accelerated?

Expand Down
2 changes: 1 addition & 1 deletion docs/configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ The following is the list of options that `rapids-plugin-4-spark` supports.
On startup use: `--conf [conf key]=[conf value]`. For example:

```
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar,cudf-0.15-cuda10-1.jar' \
${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.2.0.jar,cudf-0.15-cuda10-1.jar' \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.incompatibleOps.enabled=true
```
Expand Down
2 changes: 1 addition & 1 deletion docs/demo/Databricks/generate-init-script.ipynb
Original file line number Diff line number Diff line change
@@ -1 +1 @@
{"cells":[{"cell_type":"code","source":["dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-0.1.0-databricks.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0-databricks/rapids-4-spark_2.12-0.1.0-databricks.jar\nsudo wget -O /databricks/jars/cudf-0.14-cuda10-1.jar https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar\"\"\", True)"],"metadata":{},"outputs":[],"execution_count":1},{"cell_type":"code","source":["%sh\ncd ../../dbfs/databricks/init_scripts\npwd\nls -ltr\ncat init.sh"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":3}],"metadata":{"name":"generate-init-script","notebookId":2645746662301564},"nbformat":4,"nbformat_minor":0}
{"cells":[{"cell_type":"code","source":["dbutils.fs.mkdirs(\"dbfs:/databricks/init_scripts/\")\n \ndbutils.fs.put(\"/databricks/init_scripts/init.sh\",\"\"\"\n#!/bin/bash\nsudo wget -O /databricks/jars/rapids-4-spark_2.12-0.2.0-databricks.jar https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.2.0-databricks/rapids-4-spark_2.12-0.2.0-databricks.jar\nsudo wget -O /databricks/jars/cudf-0.15-cuda10-1.jar https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda10-1.jar\"\"\", True)"],"metadata":{},"outputs":[],"execution_count":1},{"cell_type":"code","source":["%sh\ncd ../../dbfs/databricks/init_scripts\npwd\nls -ltr\ncat init.sh"],"metadata":{},"outputs":[],"execution_count":2},{"cell_type":"code","source":[""],"metadata":{},"outputs":[],"execution_count":3}],"metadata":{"name":"generate-init-script","notebookId":2645746662301564},"nbformat":4,"nbformat_minor":0}
88 changes: 88 additions & 0 deletions docs/get-started/Dockerfile.cuda
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

FROM nvidia/cuda:10.1-devel-ubuntu18.04
ARG spark_uid=185

# Install java dependencies
RUN apt-get update && apt-get install -y --no-install-recommends openjdk-8-jdk openjdk-8-jre
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64
ENV PATH $PATH:/usr/lib/jvm/java-1.8.0-openjdk-amd64/jre/bin:/usr/lib/jvm/java-1.8.0-openjdk-amd64/bin

# Before building the docker image, first either download Apache Spark 3.0+ from
# http://spark.apache.org/downloads.html or build and make a Spark distribution following
# the instructions in http://spark.apache.org/docs/3.0.1/building-spark.html (3.0.0 can
# be used as well).
# If this docker file is being used in the context of building your images from a Spark
# distribution, the docker build command should be invoked from the top level directory
# of the Spark distribution. E.g.:
# docker build -t spark:3.0.1 -f kubernetes/dockerfiles/spark/Dockerfile .

RUN set -ex && \
ln -s /lib /lib64 && \
mkdir -p /opt/spark && \
mkdir -p /opt/spark/jars && \
mkdir -p /opt/tpch && \
mkdir -p /opt/spark/examples && \
mkdir -p /opt/spark/work-dir && \
mkdir -p /opt/sparkRapidsPlugin && \
touch /opt/spark/RELEASE && \
rm /bin/sh && \
ln -sv /bin/bash /bin/sh && \
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
chgrp root /etc/passwd && chmod ug+rw /etc/passwd

COPY spark-3.0.1-bin-hadoop3.2/jars /opt/spark/jars
COPY spark-3.0.1-bin-hadoop3.2/bin /opt/spark/bin
COPY spark-3.0.1-bin-hadoop3.2/sbin /opt/spark/sbin
COPY spark-3.0.1-bin-hadoop3.2/kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY spark-3.0.1-bin-hadoop3.2/examples /opt/spark/examples
COPY spark-3.0.1-bin-hadoop3.2/kubernetes/tests /opt/spark/tests
COPY spark-3.0.1-bin-hadoop3.2/data /opt/spark/data

COPY cudf-0.15-cuda10-1.jar /opt/sparkRapidsPlugin
COPY rapids-4-spark_2.12-0.2.0.jar /opt/sparkRapidsPlugin
COPY getGpusResources.sh /opt/sparkRapidsPlugin

RUN mkdir /opt/spark/python
# TODO: Investigate running both pip and pip3 via virtualenvs
RUN apt-get update && \
apt install -y python python-pip && \
apt install -y python3 python3-pip && \
# We remove ensurepip since it adds no functionality since pip is
# installed on the image and it just takes up 1.6MB on the image
rm -r /usr/lib/python*/ensurepip && \
pip install --upgrade pip setuptools && \
# You may install with python3 packages by using pip3.6
# Removed the .cache to save space
rm -r /root/.cache && rm -rf /var/cache/apt/*

COPY spark-3.0.1-bin-hadoop3.2/python/pyspark /opt/spark/python/pyspark
COPY spark-3.0.1-bin-hadoop3.2/python/lib /opt/spark/python/lib

ENV SPARK_HOME /opt/spark

WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

ENV TINI_VERSION v0.18.0
ADD https://github.com/krallin/tini/releases/download/${TINI_VERSION}/tini /usr/bin/tini
RUN chmod +rx /usr/bin/tini

ENTRYPOINT [ "/opt/entrypoint.sh" ]

# Specify the User that the actual main process will run as
USER ${spark_uid}
2 changes: 1 addition & 1 deletion docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ To use notebooks with a Dataproc cluster, click on the cluster name under the Da

![Dataproc Web Interfaces](../img/dataproc-service.png)

The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark-parent) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.14), which are pre-downloaded by the GCP Dataproc [RAPIDS init script]().
The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.15), which are pre-downloaded by the GCP Dataproc [RAPIDS init script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids).
tgravescs marked this conversation as resolved.
Show resolved Hide resolved

Once data is prepared, we use the [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute the training job on the GPU. NVIDIA also ships [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j](https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/) and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content.

Expand Down
24 changes: 15 additions & 9 deletions docs/get-started/getting-started-on-prem.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ to read the deployment method sections before doing any installations.

## Install Spark
To install Apache Spark please follow the official
[instructions](https://spark.apache.org/docs/latest/#launching-on-a-cluster). Please note that only
[instructions](https://spark.apache.org/docs/latest/#launching-on-a-cluster). Supported versions of
Spark are listed on the [stable release](stable-release.md) page. Please note that only
scala version 2.12 is currently supported by the accelerator.

## Download the RAPIDS jars
Expand All @@ -51,18 +52,19 @@ CUDA and will not run on other versions. The jars use a maven classifier to keep

- CUDA 10.1 => classifier cuda10-1
- CUDA 10.2 => classifier cuda10-2
- CUDA 11.0 => classifier cuda11-0
- CUDA 11.0 => classifier cuda11

For example, here is a sample version of the jars and cudf with CUDA 10.1 support:
- cudf-0.15-cuda10-1.jar
- rapids-4-spark_2.12-0.1.0.jar
- rapids-4-spark_2.12-0.2.0.jar
sameerz marked this conversation as resolved.
Show resolved Hide resolved


For simplicity export the location to these jars. This example assumes the sample jars above have
been placed in the `/opt/sparkRapidsPlugin` directory:
```shell
export SPARK_RAPIDS_DIR=/opt/sparkRapidsPlugin
export SPARK_CUDF_JAR=${SPARK_RAPIDS_DIR}/cudf-0.15-cuda10-1.jar
export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar
export SPARK_RAPIDS_PLUGIN_JAR=${SPARK_RAPIDS_DIR}/rapids-4-spark_2.12-0.2.0.jar
```

## Install the GPU Discovery Script
Expand Down Expand Up @@ -289,10 +291,13 @@ $SPARK_HOME/bin/spark-shell \
```

## Running on Kubernetes
Kubernetes requires a Docker image to run Spark. Generally you put everything you need in
that Docker image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script.
Alternatively they would need to be on a drive that is mounted when your Spark application runs.
Here we will assume you have created a Docker image that contains all of them.
Kubernetes requires a Docker image to run Spark. Generally everything needed is in the Docker
image - Spark, the RAPIDS Accelerator for Spark jars, and the discovery script. See this
[Dockerfile.cuda](Dockerfile.cuda) example.

Alternatively the jars and discovery script would need to be on a drive that is mounted when your
Spark application runs. Here we will assume you have created a Docker image that contains the
RAPIDS jars, cudf jars and discovery script.

This assumes you have Kubernetes already installed and setup. These instructions do not cover how
to setup a Kubernetes cluster.
Expand All @@ -302,8 +307,9 @@ to setup a Kubernetes cluster.
[GPU discovery script](#install-the-gpu-discovery-script) on the node from which you are
tgravescs marked this conversation as resolved.
Show resolved Hide resolved
going to build your Docker image. Note that you can download these into a local directory and
untar the Spark `.tar.gz` rather than installing into a location on the machine.
- Include the RAPIDS Accelerator for Spark jars in the Spark /jars directory
- Download the sample
[Dockerfile.cuda](https://drive.google.com/open?id=1ah7I1DQEB4Wqz5t2KK2UsctGrxDwWpeJ) or create
[Dockerfile.cuda](Dockerfile.cuda) or create
your own.
- Update the Dockerfile with the filenames for Spark and the RAPIDS Accelerator for Spark jars
that you downloaded. Include anything else application-specific that you need.
Expand Down
4 changes: 2 additions & 2 deletions docs/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ we typically run with the default options and only increase the scale factor dep
dbgen -b dists.dss -s 10
```

You can include the test jar `rapids-4-spark-integration-tests_2.12-0.2.0-SNAPSHOT.jar` with the
You can include the test jar `rapids-4-spark-integration-tests_2.12-0.2.0.jar` with the
Spark --jars option to get the TPCH tests. To setup for the queries you can run
`TpchLikeSpark.setupAllCSV` for CSV formatted data or `TpchLikeSpark.setupAllParquet`
for parquet formatted data. Both of those take the Spark session, and a path to the dbgen
Expand Down Expand Up @@ -83,7 +83,7 @@ individually, so you don't risk running unit tests along with the integration te
http://www.scalatest.org/user_guide/using_the_scalatest_shell

```shell
spark-shell --jars rapids-4-spark-tests_2.12-0.2.0-SNAPSHOT-tests.jar,rapids-4-spark-integration-tests_2.12-0.2.0-SNAPSHOT-tests.jar,scalatest_2.12-3.0.5.jar,scalactic_2.12-3.0.5.jar
spark-shell --jars rapids-4-spark-tests_2.12-0.2.0-tests.jar,rapids-4-spark-integration-tests_2.12-0.2.0-tests.jar,scalatest_2.12-3.0.5.jar,scalactic_2.12-3.0.5.jar
```

First you import the `scalatest_shell` and tell the tests where they can find the test files you
Expand Down
30 changes: 28 additions & 2 deletions docs/version/stable-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,35 @@ nav_order: 1
parent: Version
---

## Stable Release - v0.2.0
This is the second public release of the RAPIDS Accelerator for Apache Spark.
The list of supported operations is provided [here](../configs.md#supported-gpu-operators-and-fine-tuning)

Hardware Requirements:

GPU Architecture: NVIDIA Pascal™ or better (Tested on V100, T4 and A100 GPU)

Software Requirements:

OS: Ubuntu 16.04 & gcc 5.4 OR Ubuntu 18.04/CentOS 7 & gcc 7.3

CUDA & Nvidia Drivers: 10.1.2 & v418.87+, 10.2 & v440.33+ or 11.0 & v450.36+

Apache Spark 3.0, 3.0.1

Apache Hadoop 2.10+ or 3.1.1+ (3.1.1 for nvidia-docker version 2)

Python 3.x, Scala 2.12, Java 8

## Download - v0.2.0
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.2.0.jar)
* [cuDF 11.0 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda11-0.jar)
* [cuDF 10.2 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda10-2.jar)
* [cuDF 10.1 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.15/cudf-0.15-cuda10-1.jar)

## Stable Release - v0.1.0
This is the first public release of the RAPIDS Accelerator for Apache Spark.
The list of supported operations is provided [here](../configs.html#supported-gpu-operators-and-fine-tuning)
The list of supported operations is provided [here](../configs.md#supported-gpu-operators-and-fine-tuning)

Hardware Requirements:

Expand All @@ -27,7 +53,7 @@ Software Requirements:
Python 3.x, Scala 2.12, Java 8


## Download
## Download v0.1
* [RAPIDS Spark Package](https://repo1.maven.org/maven2/com/nvidia/rapids-4-spark_2.12/0.1.0/rapids-4-spark_2.12-0.1.0.jar)
* [cuDF 10.2 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-2.jar)
* [cuDF 10.1 Package](https://repo1.maven.org/maven2/ai/rapids/cudf/0.14/cudf-0.14-cuda10-1.jar)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -719,7 +719,7 @@ object RapidsConf {
|On startup use: `--conf [conf key]=[conf value]`. For example:
|
|```
|${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.2.0-SNAPSHOT.jar,cudf-0.15-cuda10-1.jar' \
|${SPARK_HOME}/bin/spark --jars 'rapids-4-spark_2.12-0.2.0.jar,cudf-0.15-cuda10-1.jar' \
|--conf spark.plugins=com.nvidia.spark.SQLPlugin \
|--conf spark.rapids.sql.incompatibleOps.enabled=true
|```
Expand Down