Skip to content

Commit

Permalink
Documentation updates (NVIDIA#422)
Browse files Browse the repository at this point in the history
  • Loading branch information
sameerz authored Jul 27, 2020
1 parent 3428cc0 commit a709912
Show file tree
Hide file tree
Showing 5 changed files with 51 additions and 50 deletions.
48 changes: 23 additions & 25 deletions docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,29 +6,27 @@ parent: Getting-Started
---

# Getting started with RAPIDS Accelerator on GCP Dataproc
[Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to show:
[Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to:

* [How to spin up a Dataproc Cluster Accelerated by GPU](getting-started-gcp#how-to-spin-up-a-dataproc-cluster-accelerated-by-gpu)
* [Run a sample Pyspark or Scala ETL and XGBoost training Notebooks on a Dataproc Cluster Accelerated by GPU](getting-started-gcp#run-pyspark-and-scala-notebook-a-dataproc-cluster-accelerated-by-gpu)
* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by GPU](getting-started-gcp#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpu)
* [Spin up a Dataproc Cluster Accelerated by GPUs](getting-started-gcp#spin-up-a-dataproc-cluster-accelerated-by-gpus)
* [Run a sample Pyspark or Scala ETL and XGBoost training Notebook on a Dataproc Cluster Accelerated by GPUs](getting-started-gcp#run-pyspark-and-scala-notebook-a-dataproc-cluster-accelerated-by-gpus)
* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by GPUs](getting-started-gcp#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpus)



## How to spin up a Dataproc Cluster Accelerated by GPU
## Spin up a Dataproc Cluster Accelerated by GPUs

You can use [Cloud Shell](https://cloud.google.com/shell) to execute shell commands that will create a Dataproc cluster. Cloud Shell contains command line tools for interacting with Google Cloud Platform, including gcloud and gsutil. Alternatively, you can install [GCloud SDK](https://cloud.google.com/sdk/install) on your laptop. From the Cloud Shell, users will need to enable services within your project. Enable the Compute and Dataproc APIs in order to access Dataproc, and enable the Storage API as you’ll need a Google Cloud Storage bucket to house your data. This may take several minutes.
You can use [Cloud Shell](https://cloud.google.com/shell) to execute shell commands that will create a Dataproc cluster. Cloud Shell contains command line tools for interacting with Google Cloud Platform, including gcloud and gsutil. Alternatively, you can install [GCloud SDK](https://cloud.google.com/sdk/install) on your laptop. From the Cloud Shell, users will need to enable services within your project. Enable the Compute and Dataproc APIs in order to access Dataproc, and enable the Storage API as you’ll need a Google Cloud Storage bucket to house your data. This may take several minutes.
```bash
gcloud services enable compute.googleapis.com
gcloud services enable dataproc.googleapis.com
gcloud services enable storage-api.googleapis.com
```

After command line environment is setup, log in to your GCP account. We can now create a Dataproc cluster with configuration mentioned below.
The configuration will allow users to run any of the [notebooks demo](../demo/GCP) on GCP. Alternatively, users can also start a 2*2T4 worker nodes.
* [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) and [RAPIDS Acclerator for Apache Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids) through initialization actions (currently the init action is only updated in US region public buckets as of 07/16/2020)
After the command line environment is setup, log in to your GCP account. You can now create a Dataproc cluster with the configuration shown below.
The configuration will allow users to run any of the [notebook demos](../demo/GCP) on GCP. Alternatively, users can also start 2*2T4 worker nodes.
* [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) and [RAPIDS Acclerator for Apache Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids) through initialization actions (the init action is only available in US region public buckets as of 2020-07-16)
* One 8-core master node and 5 32-core worker nodes
* Four NVIDIA T4 to each worker nodes
* [Local SSDs](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) is recommended to improve IO for Spark scratch places
* Four NVIDIA T4 for each worker node
* [Local SSD](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) is recommended for Spark scratch space to improve IO
* Component gateway enabled for accessing Web UIs hosted on the cluster
* Configuration for [GPU scheduling and isolation](/get-started/yarn-gpu.html)

Expand Down Expand Up @@ -56,21 +54,20 @@ gcloud dataproc clusters create $CLUSTER_NAME \
--enable-component-gateway \
--properties="^#^spark:spark.yarn.unmanagedAM.enabled=false"
```
This may take around 5-15 minutes to complete. You can navigate to Dataproc clusters tab in the Google Cloud Console to see the progress.
This may take around 5-15 minutes to complete. You can navigate to the Dataproc clusters tab in the Google Cloud Console to see the progress.

![Dataproc Cluster](../img/dataproc-cluster.png)

## Run Pyspark and Scala Notebook a Dataproc Cluster Accelerated by GPU
To use notebooks with Dataproc cluster, click on the cluster name under Dataproc cluster tab and navigate to the "Web Interfaces" Tab. Under the "Web Interfaces", click on JupyterLab or Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage data](https://rapidsai.github.io/demos/datasets/mortgage-data).
## Run Pyspark and Scala Notebook a Dataproc Cluster Accelerated by GPUs
To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage data](https://rapidsai.github.io/demos/datasets/mortgage-data).

![Dataproc Web Interfaces](../img/dataproc-service.png)

The notebook will first transcode CSV files into Parquet Files and then run a ETL query to prepare the dataset for Training. In the sample notebook, we use 2016 data as evaluation set and the rest as training set, saving to respective GCS location.
First stage with default configuration in notebook should take ~110 seconds (1/3 of CPU execution time with same config) whereas second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark-parent) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.14), which pre-downloaded by GCP Dataproc [RAPIDS init script]().
The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark-parent) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.14), which are pre-downloaded by the GCP Dataproc [RAPIDS init script]().

Once data is prepared, we use [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc Zeppelin service to execute the training job on GPU. NVIDIA Spark team also ship [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j]() and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) library could be downloaded from maven, it is pre downloaded by GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since GITHUB cannot render zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view code content.
Once data is prepared, we use the [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute the training job on the GPU. NVIDIA also ships [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j]() and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content.

The training time should be around 480 seconds (1/10 of CPU execution time with same config). Which is shown under cell:
The training time should be around 480 seconds (1/10 of CPU execution time with same config). This is shown under cell:
```scala
// Start training
println("\n------ Training ------")
Expand All @@ -79,10 +76,11 @@ val (xgbClassificationModel, _) = benchmark("train") {
}
```

## Submit Spark jobs to a Dataproc Cluster Accelerated by GPU
Similar to spark-submit for on-prem clusters, Dataproc supports a Spark applicaton job to be submitted as a dataproc job. The mortgage examples we use above is also available as [spark application](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples/apps/scala). After [build the jar files](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/building-sample-apps/scala.md) through maven `mvn package -Dcuda.classifier=cuda10-2`
## Submit Spark jobs to a Dataproc Cluster Accelerated by GPUs
Similar to spark-submit for on-prem clusters, Dataproc supports a Spark applicaton job to be submitted as a Dataproc job. The mortgage examples we use above are also available as a [spark application](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples/apps/scala). After [building the jar files](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/building-sample-apps/scala.md) they are available through maven `mvn package -Dcuda.classifier=cuda10-2`.

Place the jar file `sample_xgboost_apps-0.2.2.jar` under the `gs://$GCS_BUCKET/scala/` folder by running `gsutil cp target/sample_xgboost_apps-0.2.2.jar gs://$GCS_BUCKET/scala/`. To do this you can either drag and drop files from your local machine into the GCP storage browser, or use the gsutil cp as shown before to do this from a command line. We can thereby submit the jar by:

Then place the jar file `sample_xgboost_apps-0.2.2.jar` under the `gs://$GCS_BUCKET/scala/` folder by `gsutil cp target/sample_xgboost_apps-0.2.2.jar gs://$GCS_BUCKET/scala/`. To do this you can either drag and drop files from your local machine into the GCP storage browser, or use the gsutil cp as shown before to do this from a command line. In the end, we can thereby submit the jar by:
```bash
export GCS_BUCKET=<bucket_name>
export CLUSTER_NAME=<cluster_name>
Expand Down Expand Up @@ -110,6 +108,6 @@ gcloud dataproc jobs submit spark \
```

## Dataproc Hub in AI Platform Notebook to Dataproc cluster
With the integration between AI Platform Notebooks and Dataproc. Users can create a [Dataproc Hub notebook](https://cloud.google.com/blog/products/data-analytics/administering-jupyter-notebooks-for-spark-workloads-on-dataproc) from AI platform will can connect to Dataproc cluster through a yaml configuration.
With the integration between AI Platform Notebooks and Dataproc, users can create a [Dataproc Hub notebook](https://cloud.google.com/blog/products/data-analytics/administering-jupyter-notebooks-for-spark-workloads-on-dataproc). The AI platform will connect to a Dataproc cluster through a yaml configuration.

In future, user will be able to provision a dataproc cluster through DataprocHub notebook. Please use example [pyspark notebooks](../demo/GCP/Mortgage-ETL-GPU.ipynb) to experiment.
In the future, users will be able to provision a Dataproc cluster through DataprocHub notebook. You can use example [pyspark notebooks](../demo/GCP/Mortgage-ETL-GPU.ipynb) to experiment.
4 changes: 2 additions & 2 deletions docs/get-started/getting-started-menu.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ To enable GPU processing acceleration you will need:
## Spark GPU Scheduling Overview
Apache Spark 3.0 now supports GPU scheduling as long as you are using a cluster manager that
supports it. You can have Spark request GPUs and assign them to tasks. The exact configs you use
will vary depending on your cluster manager. Here are a few of the configs:
will vary depending on your cluster manager. Here are some example configs:
- Request your executor to have GPUs:
- `--conf spark.executor.resource.gpu.amount=1`
- Specify the number of GPUs per task:
Expand All @@ -54,4 +54,4 @@ You can also refer to the official Apache Spark documentation.
- [Overview](https://github.com/apache/spark/blob/master/docs/configuration.md#custom-resource-scheduling-and-configuration-overview)
- [Kubernetes specific documentation](https://github.com/apache/spark/blob/master/docs/running-on-kubernetes.md#resource-allocation-and-configuration-overview)
- [Yarn specific documentation](https://github.com/apache/spark/blob/master/docs/running-on-yarn.md#resource-allocation-and-configuration-overview)
- [Standalone specific documentation](https://github.com/apache/spark/blob/master/docs/spark-standalone.md#resource-allocation-and-configuration-overview)
- [Standalone specific documentation](https://github.com/apache/spark/blob/master/docs/spark-standalone.md#resource-allocation-and-configuration-overview)
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ We will need to create an initialization script for the cluster that installs th

## Import the GPU Mortgage Example Notebook
Import the example [notebook](../demo/gpu-mortgage_accelerated.ipynb) from the repo into your workspace, then open the notebook.
Modify the first cell to point to your workspace, and download a larger dataset if needed. You can find the links to the datasets at [docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data)
Modify the first cell to point to your workspace, and download a larger dataset if needed. You can find the links to the datasets at [docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data).

```bash
%sh
Expand All @@ -74,15 +74,15 @@ mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output
tar xfvz /Users/<your user id>/mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage
```

In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares the data for XGoost training. The temp and final output results are written back to the dbfs
In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares the data for XGoost training. The temp and final output results are written back to the dbfs.
```bash
orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*'
orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*'
tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/'
tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/'
output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/'
```
Run the notebook by clicking “Run All”
Run the notebook by clicking “Run All”.

## Hints
Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud storage location using Databricks [cluster log delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this option before starting the cluster to capture the logs.
Loading

0 comments on commit a709912

Please sign in to comment.