Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Branch 0.4 doc cleanup #1570

Merged
merged 3 commits into from
Jan 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ Apache Spark caches what is used to build the output of the `explain()` function
knowledge about configs, so it may return results that are not up to date with the current config
settings. This is true of all configs in Spark. If you changed
`spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting
query would not change to reflect that config and still show a `SortMergeJoin` even though the
new config might have changed to be a `BroadcastHashJoin` instead. When actually running something
like with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to re-plan the
query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain
shows stale results.
query would not change to reflect that config and still show a `SortMergeJoin` even though the new
config might have changed to be a `BroadcastHashJoin` instead. When actually running something like
with `collect`, `show` or `write` a new `DataFrame` is constructed causing Spark to re-plan the
query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows
stale results.

### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support?

Expand Down
224 changes: 118 additions & 106 deletions docs/compatibility.md

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/dev/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -201,8 +201,8 @@ i.e.: nodes that are transitioning the data to or from the GPU. Most nodes
expect their input to already be on the GPU and produce output on the GPU, so
those nodes do not need to worry about using `GpuSemaphore`. The general
rules for using the semaphore are:
* If the plan node has inputs not on the GPU but produces outputs on the GPU then the node must acquire the semaphore by calling
`GpuSemaphore.acquireIfNecessary`.
* If the plan node has inputs not on the GPU but produces outputs on the GPU then the node must
acquire the semaphore by calling `GpuSemaphore.acquireIfNecessary`.
* If the plan node has inputs on the GPU but produces outputs not on the GPU
then the node must release the semaphore by calling
`GpuSemaphore.releaseIfNecessary`.
Expand Down
12 changes: 9 additions & 3 deletions docs/examples.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,17 @@ nav_order: 11
---
# Demos

Example notebooks allow users to test drive "RAPIDS Accelerator for Apache Spark" with public datasets.
Example notebooks allow users to test drive "RAPIDS Accelerator for Apache Spark" with public
datasets.

##### [Mortgage ETL Notebook](demo/gpu-mortgage_accelerated.ipynb) [(Dataset)](https://docs.rapids.ai/datasets/mortgage-data)

##### About the Mortgage Dataset:
Dataset is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae.
Dataset is derived from [Fannie Mae’s Single-Family Loan Performance
Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all
rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent
from Fannie Mae.

For the full raw dataset visit [Fannie Mae](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) to register for an account and to download.
For the full raw dataset visit [Fannie
Mae](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) to register
for an account and to download.
555 changes: 292 additions & 263 deletions docs/get-started/getting-started-aws-emr.md

Large diffs are not rendered by default.

215 changes: 123 additions & 92 deletions docs/get-started/getting-started-databricks.md
Original file line number Diff line number Diff line change
@@ -1,92 +1,123 @@
---
layout: page
title: Databricks
nav_order: 3
parent: Getting-Started
---

# Getting started with RAPIDS Accelerator on Databricks
This guide will run through how to set up the RAPIDS Accelerator for Apache Spark 3.0 on Databricks. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on Databricks.

## Prerequisites
* Apache Spark 3.0 running in DataBricks Runtime 7.3 ML with GPU
* AWS: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1)
* Azure: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1)

The number of GPUs per node dictates the number of Spark executors that can run in that node.

## Start a Databricks Cluster
Create a Databricks cluster by going to Clusters, then clicking “+ Create Cluster”. Ensure the cluster meets the prerequisites above by configuring it as follows:
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the Prerequisites section.
2. Under Autopilot Options, disable autoscaling.
3. Choose the number of workers that matches the number of GPUs you want to use.
4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`. p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker (although they can be used for the driver node). For Azure, choose GPU nodes such as Standard_NC6s_v3.
5. Select the driver type. Generally this can be set to be the same as the worker.
6. Start the cluster.

## Advanced Cluster Configuration

We will need to create an initialization script for the cluster that installs the RAPIDS jars to the cluster.

1. To create the initialization script, import the initialization script notebook from the repo [generate-init-script.ipynb](../demo/Databricks/generate-init-script.ipynb) to your workspace. See [Managing Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) on how to import a notebook, then open the notebook.
2. Once you are in the notebook, click the “Run All” button.
3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the contents of the script are correct.
4. Go back and edit your cluster to configure it to use the init script. To do this, click the “Clusters” button on the left panel, then select your cluster.
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in the advanced options section, and paste the initialization script: `dbfs:/databricks/init_scripts/init.sh`, then click “Add”.

![Init Script](../img/Databricks/initscript.png)

6. Now select the “Spark” tab, and paste the following config options into the Spark Config section. Change the config values based on the workers you choose. See Apache Spark [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator for Apache Spark [descriptions](../configs.md) for each config.

The [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling) configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just like the CPU side. Having the value smaller is fine as well.

There is an incompatibility between the Databricks specific implementation of adaptive query
execution (AQE) and the spark-rapids plugin. In order to mitigate this,
`spark.sql.adaptive.enabled` should be set to false. In addition, the plugin does not work with
the Databricks `spark.databricks.delta.optimizeWrite` option.

```bash
spark.plugins com.nvidia.spark.SQLPlugin
spark.task.resource.gpu.amount 0.1
spark.rapids.memory.pinnedPool.size 2G
spark.locality.wait 0s
spark.databricks.delta.optimizeWrite.enabled false
spark.sql.adaptive.enabled false
spark.rapids.sql.concurrentGpuTasks 2
```

![Spark Config](../img/Databricks/sparkconfig.png)

7. Once you’ve added the Spark config, click “Confirm and Restart”.
8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF.

## Import the GPU Mortgage Example Notebook
Import the example [notebook](../demo/gpu-mortgage_accelerated.ipynb) from the repo into your workspace, then open the notebook.
Modify the first cell to point to your workspace, and download a larger dataset if needed. You can find the links to the datasets at [docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data).

```bash
%sh

wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz -P /Users/<your user id>/

mkdir -p /dbfs/FileStore/tables/mortgage
mkdir -p /dbfs/FileStore/tables/mortgage_parquet_gpu/perf
mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/acq
mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output

tar xfvz /Users/<your user id>/mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage
```

In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares the data for XGBoost training. The temp and final output results are written back to the dbfs.
```bash
orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*'
orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*'
tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/'
tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/'
output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/'
```
Run the notebook by clicking “Run All”.

## Hints
Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud storage location using Databricks [cluster log delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this option before starting the cluster to capture the logs.

---
layout: page
title: Databricks
nav_order: 3
parent: Getting-Started
---

# Getting started with RAPIDS Accelerator on Databricks
This guide will run through how to set up the RAPIDS Accelerator for Apache Spark 3.0 on Databricks.
At the end of this guide, the reader will be able to run a sample Apache Spark application that runs
on NVIDIA GPUs on Databricks.

## Prerequisites
* Apache Spark 3.0 running in DataBricks Runtime 7.3 ML with GPU
* AWS: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1)
* Azure: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1)

The number of GPUs per node dictates the number of Spark executors that can run in that node.

## Start a Databricks Cluster
Create a Databricks cluster by going to Clusters, then clicking “+ Create Cluster”. Ensure the
cluster meets the prerequisites above by configuring it as follows:
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
Prerequisites section.
2. Under Autopilot Options, disable autoscaling.
3. Choose the number of workers that matches the number of GPUs you want to use.
4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
(although they can be used for the driver node). For Azure, choose GPU nodes such as
Standard_NC6s_v3.
5. Select the driver type. Generally this can be set to be the same as the worker.
6. Start the cluster.

## Advanced Cluster Configuration

We will need to create an initialization script for the cluster that installs the RAPIDS jars to the
cluster.

1. To create the initialization script, import the initialization script notebook from the repo
[generate-init-script.ipynb](../demo/Databricks/generate-init-script.ipynb) to your
workspace. See [Managing
Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) on how to import a
notebook, then open the notebook.
2. Once you are in the notebook, click the “Run All” button.
3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
contents of the script are correct.
4. Go back and edit your cluster to configure it to use the init script. To do this, click the
“Clusters” button on the left panel, then select your cluster.
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init
Scripts” tab in the advanced options section, and paste the initialization script:
`dbfs:/databricks/init_scripts/init.sh`, then click “Add”.

![Init Script](../img/Databricks/initscript.png)

6. Now select the “Spark” tab, and paste the following config options into the Spark Config section.
Change the config values based on the workers you choose. See Apache Spark
[configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator
for Apache Spark [descriptions](../configs.md) for each config.

The
[`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling)
configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an
executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just
like the CPU side. Having the value smaller is fine as well.

There is an incompatibility between the Databricks specific implementation of adaptive query
execution (AQE) and the spark-rapids plugin. In order to mitigate this,
`spark.sql.adaptive.enabled` should be set to false. In addition, the plugin does not work with
the Databricks `spark.databricks.delta.optimizeWrite` option.

```bash
spark.plugins com.nvidia.spark.SQLPlugin
spark.task.resource.gpu.amount 0.1
spark.rapids.memory.pinnedPool.size 2G
spark.locality.wait 0s
spark.databricks.delta.optimizeWrite.enabled false
spark.sql.adaptive.enabled false
spark.rapids.sql.concurrentGpuTasks 2
```

![Spark Config](../img/Databricks/sparkconfig.png)

7. Once you’ve added the Spark config, click “Confirm and Restart”.
8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF.

## Import the GPU Mortgage Example Notebook
Import the example [notebook](../demo/gpu-mortgage_accelerated.ipynb) from the repo into your
workspace, then open the notebook. Modify the first cell to point to your workspace, and download a
larger dataset if needed. You can find the links to the datasets at
[docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data).

```bash
%sh

wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz -P /Users/<your user id>/

mkdir -p /dbfs/FileStore/tables/mortgage
mkdir -p /dbfs/FileStore/tables/mortgage_parquet_gpu/perf
mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/acq
mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output

tar xfvz /Users/<your user id>/mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage
```

In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares
the data for XGBoost training. The temp and final output results are written back to the dbfs.

```bash
orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*'
orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*'
tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/'
tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/'
output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/'
```
Run the notebook by clicking “Run All”.

## Hints
Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud
storage location using Databricks [cluster log
delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this
option before starting the cluster to capture the logs.

Loading