From c89bf97158d023923d8f5fc38f8bb4dbf3ce6a25 Mon Sep 17 00:00:00 2001 From: Sameer Raheja Date: Thu, 21 Jan 2021 16:09:47 -0800 Subject: [PATCH 1/3] Formatting cleanup Signed-off-by: Sameer Raheja --- .../get-started/getting-started-databricks.md | 215 ++++++++++-------- docs/get-started/getting-started-gcp.md | 93 ++++++-- docs/supported_ops.md | 3 +- .../com/nvidia/spark/rapids/TypeChecks.scala | 5 +- 4 files changed, 202 insertions(+), 114 deletions(-) diff --git a/docs/get-started/getting-started-databricks.md b/docs/get-started/getting-started-databricks.md index 63e169ba3e1..3355c8926f2 100644 --- a/docs/get-started/getting-started-databricks.md +++ b/docs/get-started/getting-started-databricks.md @@ -1,92 +1,123 @@ ---- -layout: page -title: Databricks -nav_order: 3 -parent: Getting-Started ---- - -# Getting started with RAPIDS Accelerator on Databricks -This guide will run through how to set up the RAPIDS Accelerator for Apache Spark 3.0 on Databricks. At the end of this guide, the reader will be able to run a sample Apache Spark application that runs on NVIDIA GPUs on Databricks. - -## Prerequisites - * Apache Spark 3.0 running in DataBricks Runtime 7.3 ML with GPU - * AWS: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1) - * Azure: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1) - -The number of GPUs per node dictates the number of Spark executors that can run in that node. - -## Start a Databricks Cluster -Create a Databricks cluster by going to Clusters, then clicking “+ Create Cluster”. Ensure the cluster meets the prerequisites above by configuring it as follows: -1. Select the Databricks Runtime Version from one of the supported runtimes specified in the Prerequisites section. -2. Under Autopilot Options, disable autoscaling. -3. Choose the number of workers that matches the number of GPUs you want to use. -4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`. p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker (although they can be used for the driver node). For Azure, choose GPU nodes such as Standard_NC6s_v3. -5. Select the driver type. Generally this can be set to be the same as the worker. -6. Start the cluster. - -## Advanced Cluster Configuration - -We will need to create an initialization script for the cluster that installs the RAPIDS jars to the cluster. - -1. To create the initialization script, import the initialization script notebook from the repo [generate-init-script.ipynb](../demo/Databricks/generate-init-script.ipynb) to your workspace. See [Managing Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) on how to import a notebook, then open the notebook. -2. Once you are in the notebook, click the “Run All” button. -3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the contents of the script are correct. -4. Go back and edit your cluster to configure it to use the init script. To do this, click the “Clusters” button on the left panel, then select your cluster. -5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init Scripts” tab in the advanced options section, and paste the initialization script: `dbfs:/databricks/init_scripts/init.sh`, then click “Add”. - - ![Init Script](../img/Databricks/initscript.png) - -6. Now select the “Spark” tab, and paste the following config options into the Spark Config section. Change the config values based on the workers you choose. See Apache Spark [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator for Apache Spark [descriptions](../configs.md) for each config. - - The [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling) configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just like the CPU side. Having the value smaller is fine as well. - - There is an incompatibility between the Databricks specific implementation of adaptive query - execution (AQE) and the spark-rapids plugin. In order to mitigate this, - `spark.sql.adaptive.enabled` should be set to false. In addition, the plugin does not work with - the Databricks `spark.databricks.delta.optimizeWrite` option. - - ```bash - spark.plugins com.nvidia.spark.SQLPlugin - spark.task.resource.gpu.amount 0.1 - spark.rapids.memory.pinnedPool.size 2G - spark.locality.wait 0s - spark.databricks.delta.optimizeWrite.enabled false - spark.sql.adaptive.enabled false - spark.rapids.sql.concurrentGpuTasks 2 - ``` - - ![Spark Config](../img/Databricks/sparkconfig.png) - -7. Once you’ve added the Spark config, click “Confirm and Restart”. -8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF. - -## Import the GPU Mortgage Example Notebook -Import the example [notebook](../demo/gpu-mortgage_accelerated.ipynb) from the repo into your workspace, then open the notebook. -Modify the first cell to point to your workspace, and download a larger dataset if needed. You can find the links to the datasets at [docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data). - -```bash -%sh - -wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz -P /Users// - -mkdir -p /dbfs/FileStore/tables/mortgage -mkdir -p /dbfs/FileStore/tables/mortgage_parquet_gpu/perf -mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/acq -mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output - -tar xfvz /Users//mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage -``` - -In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares the data for XGBoost training. The temp and final output results are written back to the dbfs. -```bash -orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*' -orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*' -tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/' -tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/' -output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/' -``` -Run the notebook by clicking “Run All”. - -## Hints -Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud storage location using Databricks [cluster log delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this option before starting the cluster to capture the logs. - +--- +layout: page +title: Databricks +nav_order: 3 +parent: Getting-Started +--- + +# Getting started with RAPIDS Accelerator on Databricks +This guide will run through how to set up the RAPIDS Accelerator for Apache Spark 3.0 on Databricks. +At the end of this guide, the reader will be able to run a sample Apache Spark application that runs +on NVIDIA GPUs on Databricks. + +## Prerequisites + * Apache Spark 3.0 running in DataBricks Runtime 7.3 ML with GPU + * AWS: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1) + * Azure: 7.3 LTS ML (GPU, Scala 2.12, Spark 3.0.1) + +The number of GPUs per node dictates the number of Spark executors that can run in that node. + +## Start a Databricks Cluster +Create a Databricks cluster by going to Clusters, then clicking “+ Create Cluster”. Ensure the +cluster meets the prerequisites above by configuring it as follows: +1. Select the Databricks Runtime Version from one of the supported runtimes specified in the + Prerequisites section. +2. Under Autopilot Options, disable autoscaling. +3. Choose the number of workers that matches the number of GPUs you want to use. +4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`. + p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker + (although they can be used for the driver node). For Azure, choose GPU nodes such as + Standard_NC6s_v3. +5. Select the driver type. Generally this can be set to be the same as the worker. +6. Start the cluster. + +## Advanced Cluster Configuration + +We will need to create an initialization script for the cluster that installs the RAPIDS jars to the +cluster. + +1. To create the initialization script, import the initialization script notebook from the repo + [generate-init-script.ipynb](../demo/Databricks/generate-init-script.ipynb) to your + workspace. See [Managing + Notebooks](https://docs.databricks.com/notebooks/notebooks-manage.html#id2) on how to import a + notebook, then open the notebook. +2. Once you are in the notebook, click the “Run All” button. +3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the + contents of the script are correct. +4. Go back and edit your cluster to configure it to use the init script. To do this, click the + “Clusters” button on the left panel, then select your cluster. +5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init + Scripts” tab in the advanced options section, and paste the initialization script: + `dbfs:/databricks/init_scripts/init.sh`, then click “Add”. + + ![Init Script](../img/Databricks/initscript.png) + +6. Now select the “Spark” tab, and paste the following config options into the Spark Config section. + Change the config values based on the workers you choose. See Apache Spark + [configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator + for Apache Spark [descriptions](../configs.md) for each config. + + The + [`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling) + configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an + executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set + this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just + like the CPU side. Having the value smaller is fine as well. + + There is an incompatibility between the Databricks specific implementation of adaptive query + execution (AQE) and the spark-rapids plugin. In order to mitigate this, + `spark.sql.adaptive.enabled` should be set to false. In addition, the plugin does not work with + the Databricks `spark.databricks.delta.optimizeWrite` option. + + ```bash + spark.plugins com.nvidia.spark.SQLPlugin + spark.task.resource.gpu.amount 0.1 + spark.rapids.memory.pinnedPool.size 2G + spark.locality.wait 0s + spark.databricks.delta.optimizeWrite.enabled false + spark.sql.adaptive.enabled false + spark.rapids.sql.concurrentGpuTasks 2 + ``` + + ![Spark Config](../img/Databricks/sparkconfig.png) + +7. Once you’ve added the Spark config, click “Confirm and Restart”. +8. Once the cluster comes back up, it is now enabled for GPU-accelerated Spark with RAPIDS and cuDF. + +## Import the GPU Mortgage Example Notebook +Import the example [notebook](../demo/gpu-mortgage_accelerated.ipynb) from the repo into your +workspace, then open the notebook. Modify the first cell to point to your workspace, and download a +larger dataset if needed. You can find the links to the datasets at +[docs.rapids.ai](https://docs.rapids.ai/datasets/mortgage-data). + +```bash +%sh + +wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz -P /Users// + +mkdir -p /dbfs/FileStore/tables/mortgage +mkdir -p /dbfs/FileStore/tables/mortgage_parquet_gpu/perf +mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/acq +mkdir /dbfs/FileStore/tables/mortgage_parquet_gpu/output + +tar xfvz /Users//mortgage_2000.tgz --directory /dbfs/FileStore/tables/mortgage +``` + +In Cell 3, update the data paths if necessary. The example notebook merges the columns and prepares +the data for XGBoost training. The temp and final output results are written back to the dbfs. + +```bash +orig_perf_path='dbfs:///FileStore/tables/mortgage/perf/*' +orig_acq_path='dbfs:///FileStore/tables/mortgage/acq/*' +tmp_perf_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/perf/' +tmp_acq_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/acq/' +output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/' +``` +Run the notebook by clicking “Run All”. + +## Hints +Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud +storage location using Databricks [cluster log +delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this +option before starting the cluster to capture the logs. + diff --git a/docs/get-started/getting-started-gcp.md b/docs/get-started/getting-started-gcp.md index 823227c964b..2f07d9dfdec 100644 --- a/docs/get-started/getting-started-gcp.md +++ b/docs/get-started/getting-started-gcp.md @@ -6,30 +6,47 @@ parent: Getting-Started --- # Getting started with RAPIDS Accelerator on GCP Dataproc - [Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache Spark and Hadoop service. This guide will walk through the steps to: + [Google Cloud Dataproc](https://cloud.google.com/dataproc) is Google Cloud's fully managed Apache + Spark and Hadoop service. This guide will walk through the steps to: * [Spin up a Dataproc Cluster Accelerated by GPUs](#spin-up-a-dataproc-cluster-accelerated-by-gpus) -* [Run Pyspark or Scala ETL and XGBoost training Notebook on a Dataproc Cluster Accelerated by GPUs](#run-pyspark-or-scala-notebook-on-a-dataproc-cluster-accelerated-by-gpus) -* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by GPUs](#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpus) +* [Run Pyspark or Scala ETL and XGBoost training Notebook on a Dataproc Cluster Accelerated by + GPUs](#run-pyspark-or-scala-notebook-on-a-dataproc-cluster-accelerated-by-gpus) +* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by + GPUs](#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpus) ## Spin up a Dataproc Cluster Accelerated by GPUs - You can use [Cloud Shell](https://cloud.google.com/shell) to execute shell commands that will create a Dataproc cluster. Cloud Shell contains command line tools for interacting with Google Cloud Platform, including gcloud and gsutil. Alternatively, you can install [GCloud SDK](https://cloud.google.com/sdk/install) on your laptop. From the Cloud Shell, users will need to enable services within your project. Enable the Compute and Dataproc APIs in order to access Dataproc, and enable the Storage API as you’ll need a Google Cloud Storage bucket to house your data. This may take several minutes. + You can use [Cloud Shell](https://cloud.google.com/shell) to execute shell commands that will + create a Dataproc cluster. Cloud Shell contains command line tools for interacting with Google + Cloud Platform, including gcloud and gsutil. Alternatively, you can install [GCloud + SDK](https://cloud.google.com/sdk/install) on your machine. From the Cloud Shell, users will need + to enable services within your project. Enable the Compute and Dataproc APIs in order to access + Dataproc, and enable the Storage API as you’ll need a Google Cloud Storage bucket to house your + data. This may take several minutes. + ```bash gcloud services enable compute.googleapis.com gcloud services enable dataproc.googleapis.com gcloud services enable storage-api.googleapis.com ``` -After the command line environment is setup, log in to your GCP account. You can now create a Dataproc cluster with the configuration shown below. -The configuration will allow users to run any of the [notebook demos](https://github.com/NVIDIA/spark-rapids/tree/branch-0.2/docs/demo/GCP) on GCP. Alternatively, users can also start 2*2T4 worker nodes. +After the command line environment is setup, log in to your GCP account. You can now create a +Dataproc cluster with the configuration shown below. The configuration will allow users to run any +of the [notebook demos](https://github.com/NVIDIA/spark-rapids/tree/branch-0.2/docs/demo/GCP) on +GCP. Alternatively, users can also start 2*2T4 worker nodes. The script below will initialize with the following: -* [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) and [RAPIDS Acclerator for Apache Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids) through initialization actions (the init action is only available in US region public buckets as of 2020-07-16) +* [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) and + [RAPIDS Acclerator for Apache + Spark](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids) through + initialization actions (the init action is only available in US region public buckets as of + 2020-07-16) * One 8-core master node and 5 32-core worker nodes * Four NVIDIA T4 for each worker node -* [Local SSD](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) is recommended for Spark scratch space to improve IO +* [Local SSD](https://cloud.google.com/dataproc/docs/concepts/compute/dataproc-local-ssds) is + recommended for Spark scratch space to improve IO * Component gateway enabled for accessing Web UIs hosted on the cluster * Configuration for [GPU scheduling and isolation](yarn-gpu.md) @@ -57,20 +74,47 @@ gcloud dataproc clusters create $CLUSTER_NAME \ --enable-component-gateway \ --properties="^#^spark:spark.yarn.unmanagedAM.enabled=false" ``` -This may take around 5-15 minutes to complete. You can navigate to the Dataproc clusters tab in the Google Cloud Console to see the progress. + +This may take around 5-15 minutes to complete. You can navigate to the Dataproc clusters tab in the +Google Cloud Console to see the progress. ![Dataproc Cluster](../img/GCP/dataproc-cluster.png) ## Run PySpark or Scala Notebook on a Dataproc Cluster Accelerated by GPUs -To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage data](https://rapidsai.github.io/demos/datasets/mortgage-data). +To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab +and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or +Jupyter link to start to use sample [Mortgage ETL on GPU Jupyter +Notebook](../demo/GCP/Mortgage-ETL-GPU.ipynb) to process full 17 years [Mortgage +data](https://rapidsai.github.io/demos/datasets/mortgage-data). ![Dataproc Web Interfaces](../img/GCP/dataproc-service.png) -The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the rest as a training set, saving to respective GCS locations. Using the default notebook configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook depends on the pre-compiled [Spark RAPIDS SQL plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) and [cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.15), which are pre-downloaded by the GCP Dataproc [RAPIDS init script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). +The notebook will first transcode CSV files into Parquet files and then run an ETL query to prepare +the dataset for training. In the sample notebook, we use 2016 data as the evaluation set and the +rest as a training set, saving to respective GCS locations. Using the default notebook +configuration the first stage should take ~110 seconds (1/3 of CPU execution time with same config) +and the second stage takes ~170 seconds (1/7 of CPU execution time with same config). The notebook +depends on the pre-compiled [Spark RAPIDS SQL +plugin](https://mvnrepository.com/artifact/com.nvidia/rapids-4-spark) and +[cuDF](https://mvnrepository.com/artifact/ai.rapids/cudf/0.15), which are pre-downloaded by the GCP +Dataproc [RAPIDS init +script](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). + +Once data is prepared, we use the [Mortgage XGBoost4j Scala +Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute +the training job on the GPU. NVIDIA also ships [Spark +XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC +xgboost](https://github.com/dmlc/xgboost). Precompiled +[XGBoost4j](https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/) and [XGBoost4j +Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be +downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init +action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since +github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala +code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content. + +The training time should be around 480 seconds (1/10 of CPU execution time with same config). This +is shown under cell: -Once data is prepared, we use the [Mortgage XGBoost4j Scala Notebook](../demo/GCP/mortgage-xgboost4j-gpu-scala.zpln) in Dataproc's Zeppelin service to execute the training job on the GPU. NVIDIA also ships [Spark XGBoost4j](https://github.com/NVIDIA/spark-xgboost) which is based on [DMLC xgboost](https://github.com/dmlc/xgboost). Precompiled [XGBoost4j](https://repo1.maven.org/maven2/com/nvidia/xgboost4j_3.0/) and [XGBoost4j Spark](https://repo1.maven.org/maven2/com/nvidia/xgboost4j-spark_3.0/1.0.0-0.1.0/) libraries can be downloaded from maven. They are pre-downloaded by the GCP [RAPIDS init action](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/rapids). Since github cannot render a Zeppelin notebook, we prepared a [Jupyter Notebook with Scala code](../demo/GCP/mortgage-xgboost4j-gpu-scala.ipynb) for you to view the code content. - -The training time should be around 480 seconds (1/10 of CPU execution time with same config). This is shown under cell: ```scala // Start training println("\n------ Training ------") @@ -80,9 +124,17 @@ val (xgbClassificationModel, _) = benchmark("train") { ``` ## Submit Spark jobs to a Dataproc Cluster Accelerated by GPUs -Similar to spark-submit for on-prem clusters, Dataproc supports a Spark applicaton job to be submitted as a Dataproc job. The mortgage examples we use above are also available as a [spark application](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples/apps/scala). After [building the jar files](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/building-sample-apps/scala.md) they are available through maven `mvn package -Dcuda.classifier=cuda10-2`. - -Place the jar file `sample_xgboost_apps-0.2.2.jar` under the `gs://$GCS_BUCKET/scala/` folder by running `gsutil cp target/sample_xgboost_apps-0.2.2.jar gs://$GCS_BUCKET/scala/`. To do this you can either drag and drop files from your local machine into the GCP storage browser, or use the gsutil cp as shown before to do this from a command line. We can thereby submit the jar by: +Similar to spark-submit for on-prem clusters, Dataproc supports a Spark applicaton job to be +submitted as a Dataproc job. The mortgage examples we use above are also available as a [spark +application](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples/apps/scala). +After [building the jar +files](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-3/getting-started-guides/building-sample-apps/scala.md) +they are available through maven `mvn package -Dcuda.classifier=cuda10-2`. + +Place the jar file `sample_xgboost_apps-0.2.2.jar` under the `gs://$GCS_BUCKET/scala/` folder by +running `gsutil cp target/sample_xgboost_apps-0.2.2.jar gs://$GCS_BUCKET/scala/`. To do this you +can either drag and drop files from your local machine into the GCP storage browser, or use the +gsutil cp as shown before to do this from a command line. We can thereby submit the jar by: ```bash export GCS_BUCKET= @@ -111,6 +163,9 @@ gcloud dataproc jobs submit spark \ ``` ## Dataproc Hub in AI Platform Notebook to Dataproc cluster -With the integration between AI Platform Notebooks and Dataproc, users can create a [Dataproc Hub notebook](https://cloud.google.com/blog/products/data-analytics/administering-jupyter-notebooks-for-spark-workloads-on-dataproc). The AI platform will connect to a Dataproc cluster through a yaml configuration. +With the integration between AI Platform Notebooks and Dataproc, users can create a [Dataproc Hub +notebook](https://cloud.google.com/blog/products/data-analytics/administering-jupyter-notebooks-for-spark-workloads-on-dataproc). +The AI platform will connect to a Dataproc cluster through a yaml configuration. -In the future, users will be able to provision a Dataproc cluster through DataprocHub notebook. You can use example [pyspark notebooks](../demo/GCP/Mortgage-ETL-GPU.ipynb) to experiment. +In the future, users will be able to provision a Dataproc cluster through DataprocHub notebook. You +can use example [pyspark notebooks](../demo/GCP/Mortgage-ETL-GPU.ipynb) to experiment. diff --git a/docs/supported_ops.md b/docs/supported_ops.md index 28ee94e9720..e0a4bf73758 100644 --- a/docs/supported_ops.md +++ b/docs/supported_ops.md @@ -67,6 +67,7 @@ the reasons why this particular operator or expression is on the CPU or GPU. |UDT|User defined types and java Objects. These are not standard SQL types.| ## Support + |Value|Description| |---------|----------------| |S| (Supported) Both Apache Spark and the RAPIDS Accelerator support this type.| @@ -727,7 +728,7 @@ Accelerator supports are described below. NS -* as was stated previously Decimal is only supported up to a precision of +* As was stated previously Decimal is only supported up to a precision of 18 and Timestamp is only supported in the UTC time zone. Decimals are off by default due to performance impact in some cases. diff --git a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala index 4fe4cd5a486..714a16779b4 100644 --- a/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala +++ b/sql-plugin/src/main/scala/com/nvidia/spark/rapids/TypeChecks.scala @@ -1073,6 +1073,7 @@ object SupportedOpsDocs { println("|UDT|User defined types and java Objects. These are not standard SQL types.|") println() println("## Support") + println() println("|Value|Description|") println("|---------|----------------|") println("|S| (Supported) Both Apache Spark and the RAPIDS Accelerator support this type.|") @@ -1117,7 +1118,7 @@ object SupportedOpsDocs { } } println("") - println("* as was stated previously Decimal is only supported up to a precision of") + println("* As was stated previously Decimal is only supported up to a precision of") println(s"${DType.DECIMAL64_MAX_PRECISION} and Timestamp is only supported in the") println("UTC time zone. Decimals are off by default due to performance impact in") println("some cases.") @@ -1394,4 +1395,4 @@ object SupportedOpsDocs { } } } -} \ No newline at end of file +} From 3d545b616ae3dc846dbfffb6bd48b7eadcf8f22b Mon Sep 17 00:00:00 2001 From: Sameer Raheja Date: Thu, 21 Jan 2021 16:43:31 -0800 Subject: [PATCH 2/3] Further file formatting changes Signed-off-by: Sameer Raheja --- docs/FAQ.md | 10 +- docs/compatibility.md | 224 ++++---- docs/dev/README.md | 4 +- docs/examples.md | 12 +- docs/get-started/getting-started-aws-emr.md | 555 ++++++++++---------- docs/get-started/getting-started-on-prem.md | 41 +- 6 files changed, 450 insertions(+), 396 deletions(-) diff --git a/docs/FAQ.md b/docs/FAQ.md index 3f434384ffc..474b3ee7b50 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -11,11 +11,11 @@ Apache Spark caches what is used to build the output of the `explain()` function knowledge about configs, so it may return results that are not up to date with the current config settings. This is true of all configs in Spark. If you changed `spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting -query would not change to reflect that config and still show a `SortMergeJoin` even though the -new config might have changed to be a `BroadcastHashJoin` instead. When actually running something -like with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to re-plan the -query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain -shows stale results. +query would not change to reflect that config and still show a `SortMergeJoin` even though the new +config might have changed to be a `BroadcastHashJoin` instead. When actually running something like +with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to re-plan the +query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows +stale results. ### What versions of Apache Spark does the RAPIDS Accelerator for Apache Spark support? diff --git a/docs/compatibility.md b/docs/compatibility.md index 460cfbd8576..ddc0c6507ac 100644 --- a/docs/compatibility.md +++ b/docs/compatibility.md @@ -6,15 +6,14 @@ nav_order: 5 # RAPIDS Accelerator for Apache Spark Compatibility with Apache Spark -The SQL plugin tries to produce results that are bit for bit identical with Apache Spark. -There are a number of cases where there are some differences. In most cases operators -that produce different results are off by default, and you can look at the -[configs](configs.md) for more information on how to enable them. In some cases -we felt that enabling the incompatibility by default was worth the performance gain. All -of those operators can be disabled through configs if it becomes a problem. Please also look -at the current list of -[bugs](https://github.com/NVIDIA/spark-rapids/issues?q=is%3Aopen+is%3Aissue+label%3Abug) -which are typically incompatibilities that we have not yet addressed. +The SQL plugin tries to produce results that are bit for bit identical with Apache Spark. There are +a number of cases where there are some differences. In most cases operators that produce different +results are off by default, and you can look at the [configs](configs.md) for more information on +how to enable them. In some cases we felt that enabling the incompatibility by default was worth +the performance gain. All of those operators can be disabled through configs if it becomes a +problem. Please also look at the current list of +[bugs](https://github.com/NVIDIA/spark-rapids/issues?q=is%3Aopen+is%3Aissue+label%3Abug) which are +typically incompatibilities that we have not yet addressed. ## Ordering of Output @@ -31,62 +30,61 @@ not true for sorting. For all versions of the plugin `-0.0` == `0.0` for sorting ## Floating Point For most basic floating point operations like addition, subtraction, multiplication, and division -the plugin will produce a bit for bit identical result as Spark does. For other functions like `sin`, -`cos`, etc. the output may be different, but within the rounding error inherent in floating point -calculations. The ordering of operations to calculate the value may differ between -the underlying JVM implementation used by the CPU and the C++ standard library implementation -used by the GPU. - -For aggregations the underlying implementation is doing the aggregations in parallel and due to -race conditions within the computation itself the result may not be the same each time the query is -run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If -a query joins on a floating point value, which is not wise to do anyways, -and the value is the result of a floating point aggregation then the join may fail to work -properly with the plugin but would have worked with plain Spark. Because of this most -floating point aggregations are off by default but can be enabled with the config +the plugin will produce a bit for bit identical result as Spark does. For other functions like +`sin`, `cos`, etc. the output may be different, but within the rounding error inherent in floating +point calculations. The ordering of operations to calculate the value may differ between the +underlying JVM implementation used by the CPU and the C++ standard library implementation used by +the GPU. + +For aggregations the underlying implementation is doing the aggregations in parallel and due to race +conditions within the computation itself the result may not be the same each time the query is +run. This is inherent in how the plugin speeds up the calculations and cannot be "fixed." If a query +joins on a floating point value, which is not wise to do anyways, and the value is the result of a +floating point aggregation then the join may fail to work properly with the plugin but would have +worked with plain Spark. Because of this most floating point aggregations are off by default but can +be enabled with the config [`spark.rapids.sql.variableFloatAgg.enabled`](configs.md#sql.variableFloatAgg.enabled). Additionally, some aggregations on floating point columns that contain `NaN` can produce results -different from Spark in versions prior to Spark 3.1.0. -If it is known with certainty that the floating point columns do not contain `NaN`, -set [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled -aggregations on them. - -In the case of a distinct count on `NaN` values, prior to Spark 3.1.0, the issue - only shows up if you have different -`NaN` values. There are several different binary values that are all considered to be `NaN` by -floating point. The plugin treats all of these as the same value, where as Spark treats them -all as different values. Because this is considered to be rare we do not disable distinct count -for floating point values even if [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`. +different from Spark in versions prior to Spark 3.1.0. If it is known with certainty that the +floating point columns do not contain `NaN`, set +[`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) to `false` to run GPU enabled aggregations on +them. + +In the case of a distinct count on `NaN` values, prior to Spark 3.1.0, the issue only shows up if + you have different `NaN` values. There are several different binary values that are all considered + to be `NaN` by floating point. The plugin treats all of these as the same value, where as Spark + treats them all as different values. Because this is considered to be rare we do not disable + distinct count for floating point values even if + [`spark.rapids.sql.hasNans`](configs.md#sql.hasNans) is `true`. ### `0.0` vs `-0.0` -Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that -they should be interpreted as the same. Most databases normalize these values to always -be `0.0`. Spark does this in some cases but not all as is documented -[here](https://issues.apache.org/jira/browse/SPARK-32110). The underlying implementation of -this plugin treats them as the same for essentially all processing. This can result in some -differences with Spark for operations, prior to Spark 3.1.0, like sorting, and distinct count. -There are still differences with -[joins, and comparisons](https://github.com/NVIDIA/spark-rapids/issues/294) even after Spark -3.1.0. +Floating point allows zero to be encoded as `0.0` and `-0.0`, but the IEEE standard says that they +should be interpreted as the same. Most databases normalize these values to always be `0.0`. Spark +does this in some cases but not all as is documented +[here](https://issues.apache.org/jira/browse/SPARK-32110). The underlying implementation of this +plugin treats them as the same for essentially all processing. This can result in some differences +with Spark for operations, prior to Spark 3.1.0, like sorting, and distinct count. There are still +differences with [joins, and comparisons](https://github.com/NVIDIA/spark-rapids/issues/294) even +after Spark 3.1.0. -We do not disable operations that produce different results due to `-0.0` in the data because -it is considered to be a rare occurrence. +We do not disable operations that produce different results due to `-0.0` in the data because it is +considered to be a rare occurrence. ## Unicode Spark delegates Unicode operations to the underlying JVM. Each version of Java complies with a specific version of the Unicode standard. The SQL plugin does not use the JVM for Unicode support -and is compatible with Unicode version 12.1. Because of this there may be corner cases where -Spark will produce a different result compared to the plugin. +and is compatible with Unicode version 12.1. Because of this there may be corner cases where Spark +will produce a different result compared to the plugin. ## CSV Reading Spark is very strict when reading CSV and if the data does not conform with the expected format exactly it will result in a `null` value. The underlying parser that the SQL plugin uses is much -more lenient. If you have badly formatted CSV data you may get data back instead of nulls. -If this is a problem you can disable the CSV reader by setting the config +more lenient. If you have badly formatted CSV data you may get data back instead of nulls. If this +is a problem you can disable the CSV reader by setting the config [`spark.rapids.sql.format.csv.read.enabled`](configs.md#sql.format.csv.read.enabled) to `false`. Because the speed up is so large and the issues typically only show up in error conditions we felt it was worth having the CSV reader enabled by default. @@ -94,10 +92,10 @@ it was worth having the CSV reader enabled by default. There are also discrepancies/issues with specific types that are detailed below. ### CSV Strings -Writing strings to a CSV file in general for Spark can be problematic unless you can ensure -that your data does not have any line deliminators in it. The GPU accelerated CSV parser -handles quoted line deliminators similar to `multiLine` mode. But there are still a number -of issues surrounding it and they should be avoided. +Writing strings to a CSV file in general for Spark can be problematic unless you can ensure that +your data does not have any line deliminators in it. The GPU accelerated CSV parser handles quoted +line deliminators similar to `multiLine` mode. But there are still a number of issues surrounding +it and they should be avoided. Escaped quote characters `'\"'` are not supported well as described by this [issue](https://github.com/NVIDIA/spark-rapids/issues/129). @@ -117,20 +115,20 @@ Only a limited set of formats are supported when parsing dates. * `"MM-dd-yyyy"` * `"MM/dd/yyyy"` -The reality is that all of these formats are supported at the same time. The plugin -will only disable itself if you set a format that it does not support. +The reality is that all of these formats are supported at the same time. The plugin will only +disable itself if you set a format that it does not support. As a work around you can parse the column as a timestamp and then cast it to a date. ### CSV Timestamps -The CSV parser does not support time zones. It will ignore any trailing time zone -information, despite the format asking for a `XXX` or `[XXX]`. As such it is off by -default and you can enable it by setting -[`spark.rapids.sql.csvTimestamps.enabled`](configs.md#sql.csvTimestamps.enabled) to `true`. +The CSV parser does not support time zones. It will ignore any trailing time zone information, +despite the format asking for a `XXX` or `[XXX]`. As such it is off by default and you can enable it +by setting [`spark.rapids.sql.csvTimestamps.enabled`](configs.md#sql.csvTimestamps.enabled) to +`true`. -The formats supported for timestamps are limited similar to dates. The first part of -the format must be a supported date format. The second part must start with a `'T'` -to separate the time portion followed by one of the following formats: +The formats supported for timestamps are limited similar to dates. The first part of the format +must be a supported date format. The second part must start with a `'T'` to separate the time +portion followed by one of the following formats: * `HH:mm:ss.SSSXXX` * `HH:mm:ss[.SSS][XXX]` @@ -140,17 +138,17 @@ to separate the time portion followed by one of the following formats: * `HH:mm:ss.SSS` * `HH:mm:ss[.SSS]` -Just like with dates all timestamp formats are actually supported at the same time. -The plugin will disable itself if it sees a format it cannot support. +Just like with dates all timestamp formats are actually supported at the same time. The plugin will +disable itself if it sees a format it cannot support. ### CSV Floating Point -The CSV parser is not able to parse `Infinity`, `-Infinity`, or `NaN` values. All of -these are likely to be turned into null values, as described in this +The CSV parser is not able to parse `Infinity`, `-Infinity`, or `NaN` values. All of these are +likely to be turned into null values, as described in this [issue](https://github.com/NVIDIA/spark-rapids/issues/125). -Some floating-point values also appear to overflow but do not for the CPU as described -in this [issue](https://github.com/NVIDIA/spark-rapids/issues/124). +Some floating-point values also appear to overflow but do not for the CPU as described in this +[issue](https://github.com/NVIDIA/spark-rapids/issues/124). Any number that overflows will not be turned into a null value. @@ -164,21 +162,21 @@ The ORC format has fairly complete support for both reads and writes. There are issues. The first is for reading timestamps and dates around the transition between Julian and Gregorian calendars as described [here](https://github.com/NVIDIA/spark-rapids/issues/131). A similar issue exists for writing dates as described -[here](https://github.com/NVIDIA/spark-rapids/issues/139). Writing timestamps, however only -appears to work for dates after the epoch as described -[here](https://github.com/NVIDIA/spark-rapids/issues/140). +[here](https://github.com/NVIDIA/spark-rapids/issues/139). Writing timestamps, however only appears +to work for dates after the epoch as described +[here](https://github.com/NVIDIA/spark-rapids/issues/140). The plugin supports reading `uncompressed`, `snappy` and `zlib` ORC files and writing `uncompressed` - and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the - CPU when reading an unsupported compression format, and will error out in that case. + and `snappy` ORC files. At this point, the plugin does not have the ability to fall back to the + CPU when reading an unsupported compression format, and will error out in that case. ## Parquet The Parquet format has more configs because there are multiple versions with some compatibility -issues between them. Dates and timestamps are where the known issues exist. -For reads when `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED` -[timestamps](https://github.com/NVIDIA/spark-rapids/issues/132) before the transition -between the Julian and Gregorian calendars are wrong, but dates are fine. When +issues between them. Dates and timestamps are where the known issues exist. For reads when +`spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED` +[timestamps](https://github.com/NVIDIA/spark-rapids/issues/132) before the transition between the +Julian and Gregorian calendars are wrong, but dates are fine. When `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `LEGACY`, however both dates and timestamps are read incorrectly before the Gregorian calendar transition as described [here](https://github.com/NVIDIA/spark-rapids/issues/133). @@ -186,30 +184,31 @@ timestamps are read incorrectly before the Gregorian calendar transition as desc When writing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is currently ignored as described [here](https://github.com/NVIDIA/spark-rapids/issues/144). -The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files and writing -`uncompressed` and `snappy` Parquet files. At this point, the plugin does not have the ability to -fall back to the CPU when reading an unsupported compression format, and will error out -in that case. +The plugin supports reading `uncompressed`, `snappy` and `gzip` Parquet files and writing +`uncompressed` and `snappy` Parquet files. At this point, the plugin does not have the ability to +fall back to the CPU when reading an unsupported compression format, and will error out in that +case. ## Regular Expressions -The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard -matches. +The RAPIDS Accelerator for Apache Spark currently supports string literal matches, not wildcard +matches. -If a null char '\0' is in a string that is being matched by a regular expression, `LIKE` sees it as -the end of the string. This will be fixed in a future release. The issue is [here](https://github.com/NVIDIA/spark-rapids/issues/119). +If a null char '\0' is in a string that is being matched by a regular expression, `LIKE` sees it as +the end of the string. This will be fixed in a future release. The issue is +[here](https://github.com/NVIDIA/spark-rapids/issues/119). ## Timestamps -Spark stores timestamps internally relative to the JVM time zone. Converting an -arbitrary timestamp between time zones is not currently supported on the GPU. Therefore operations -involving timestamps will only be GPU-accelerated if the time zone used by the JVM is UTC. +Spark stores timestamps internally relative to the JVM time zone. Converting an arbitrary timestamp +between time zones is not currently supported on the GPU. Therefore operations involving timestamps +will only be GPU-accelerated if the time zone used by the JVM is UTC. ## Window Functions Because of ordering differences between the CPU and the GPU window functions especially row based window functions like `row_number`, `lead`, and `lag` can produce different results if the ordering -includes both `-0.0` and `0.0`, or if the ordering is ambiguous. Spark can produce -different results from one run to another if the ordering is ambiguous on a window function too. +includes both `-0.0` and `0.0`, or if the ordering is ambiguous. Spark can produce different results +from one run to another if the ordering is ambiguous on a window function too. ## Parsing strings as dates or timestamps @@ -232,30 +231,34 @@ specific issues with other formats are: some cases To enable all formats on GPU, set -[`spark.rapids.sql.incompatibleDateFormats.enabled`](configs.md#sql.incompatibleDateFormats.enabled) to `true`. +[`spark.rapids.sql.incompatibleDateFormats.enabled`](configs.md#sql.incompatibleDateFormats.enabled) +to `true`. ## Casting between types -In general, performing `cast` and `ansi_cast` operations on the GPU is compatible with the same operations on the CPU. However, there are some exceptions. For this reason, certain casts are disabled on the GPU by default and require configuration options to be specified to enable them. +In general, performing `cast` and `ansi_cast` operations on the GPU is compatible with the same +operations on the CPU. However, there are some exceptions. For this reason, certain casts are +disabled on the GPU by default and require configuration options to be specified to enable them. ### Float to Decimal -The GPU will use a different strategy from Java's BigDecimal to handle/store decimal values, which leads to restrictions: +The GPU will use a different strategy from Java's BigDecimal to handle/store decimal values, which +leads to restrictions: * It is only available when `ansiMode` is on. * Float values cannot be larger than `1e18` or smaller than `-1e18` after conversion. * The results produced by GPU slightly differ from the default results of Spark. To enable this operation on the GPU, set -[`spark.rapids.sql.castFloatToDecimal.enabled`](configs.md#sql.castFloatToDecimal.enabled) to `true` and set `spark.sql.ansi.enabled` to `true`. +[`spark.rapids.sql.castFloatToDecimal.enabled`](configs.md#sql.castFloatToDecimal.enabled) to `true` +and set `spark.sql.ansi.enabled` to `true`. ### Float to Integral Types -With both `cast` and `ansi_cast`, Spark uses the expression -`Math.floor(x) <= MAX && Math.ceil(x) >= MIN` to determine whether a floating-point value can be -converted to an integral type. Prior to Spark 3.1.0 the MIN and MAX values were floating-point -values such as `Int.MaxValue.toFloat` but starting with 3.1.0 these are now integral types such as -`Int.MaxValue` so this has slightly affected the valid range of values and now differs slightly -from the behavior on GPU in some cases. +With both `cast` and `ansi_cast`, Spark uses the expression `Math.floor(x) <= MAX && Math.ceil(x) >= +MIN` to determine whether a floating-point value can be converted to an integral type. Prior to +Spark 3.1.0 the MIN and MAX values were floating-point values such as `Int.MaxValue.toFloat` but +starting with 3.1.0 these are now integral types such as `Int.MaxValue` so this has slightly +affected the valid range of values and now differs slightly from the behavior on GPU in some cases. To enable this operation on the GPU when using Spark 3.1.0 or later, set [`spark.rapids.sql.castFloatToIntegralTypes.enabled`](configs.md#sql.castFloatToIntegralTypes.enabled) @@ -265,26 +268,31 @@ This configuration setting is ignored when using Spark versions prior to 3.1.0. ### Float to String -The GPU will use different precision than Java's toString method when converting floating-point data types to strings and this can produce results that differ from the default behavior in Spark. +The GPU will use different precision than Java's toString method when converting floating-point data +types to strings and this can produce results that differ from the default behavior in Spark. To enable this operation on the GPU, set [`spark.rapids.sql.castFloatToString.enabled`](configs.md#sql.castFloatToString.enabled) to `true`. ### String to Float -Casting from string to floating-point types on the GPU returns incorrect results when the string represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively. +Casting from string to floating-point types on the GPU returns incorrect results when the string +represents any number in the following ranges. In both cases the GPU returns `Double.MaxValue`. The +default behavior in Apache Spark is to return `+Infinity` and `-Infinity`, respectively. - `1.7976931348623158E308 <= x < 1.7976931348623159E308` - `-1.7976931348623159E308 < x <= -1.7976931348623158E308` Also, the GPU does not support casting from strings containing hex values. -To enable this operation on the GPU, set +To enable this operation on the GPU, set [`spark.rapids.sql.castStringToFloat.enabled`](configs.md#sql.castStringToFloat.enabled) to `true`. ### String to Integral Types -The GPU will return incorrect results for strings representing values greater than Long.MaxValue or less than Long.MinValue. The correct behavior would be to return null for these values, but the GPU currently overflows and returns an incorrect integer value. +The GPU will return incorrect results for strings representing values greater than Long.MaxValue or +less than Long.MinValue. The correct behavior would be to return null for these values, but the GPU +currently overflows and returns an incorrect integer value. To enable this operation on the GPU, set [`spark.rapids.sql.castStringToInteger.enabled`](configs.md#sql.castStringToInteger.enabled) to `true`. @@ -337,12 +345,16 @@ Casting from string to timestamp currently has the following limitations. Only timezone 'Z' (UTC) is supported. Casting unsupported formats will result in null values. ## UDF to Catalyst Expressions -To speedup the process of UDF, spark-rapids introduces a udf-compiler extension to translate UDFs to Catalyst expressions. + +To speedup the process of UDF, spark-rapids introduces a udf-compiler extension to translate UDFs to +Catalyst expressions. To enable this operation on the GPU, set [`spark.rapids.sql.udfCompiler.enabled`](configs.md#sql.udfCompiler.enabled) to `true`. -However, Spark may produce different results for a compiled udf and the non-compiled. For example: a udf of `x/y` where `y` happens to be `0`, the compiled catalyst expressions will return `NULL` while the original udf would fail the entire job with a `java.lang.ArithmeticException: / by zero` +However, Spark may produce different results for a compiled udf and the non-compiled. For example: a +udf of `x/y` where `y` happens to be `0`, the compiled catalyst expressions will return `NULL` while +the original udf would fail the entire job with a `java.lang.ArithmeticException: / by zero` When translating UDFs to Catalyst expressions, the supported UDF functions are limited: @@ -436,5 +448,5 @@ When translating UDFs to Catalyst expressions, the supported UDF functions are l | | x.toArray | | | lhs += rhs | | | lhs :+ rhs | -| Method call | Only if the method being called
  1. consists of operations supported by the UDF compiler, and
  2. is one of the folllowing:
    • a final method, or
    • a method in a final class, or
    • a method in a final object
| +| Method call | Only if the method being called 1. Consists of operations supported by the UDF compiler, and 2. is one of the folllowing: a final method, a method in a final class, or a method in a final object | diff --git a/docs/dev/README.md b/docs/dev/README.md index de774d59238..894a55220fb 100644 --- a/docs/dev/README.md +++ b/docs/dev/README.md @@ -201,8 +201,8 @@ i.e.: nodes that are transitioning the data to or from the GPU. Most nodes expect their input to already be on the GPU and produce output on the GPU, so those nodes do not need to worry about using `GpuSemaphore`. The general rules for using the semaphore are: -* If the plan node has inputs not on the GPU but produces outputs on the GPU then the node must acquire the semaphore by calling -`GpuSemaphore.acquireIfNecessary`. +* If the plan node has inputs not on the GPU but produces outputs on the GPU then the node must +acquire the semaphore by calling `GpuSemaphore.acquireIfNecessary`. * If the plan node has inputs on the GPU but produces outputs not on the GPU then the node must release the semaphore by calling `GpuSemaphore.releaseIfNecessary`. diff --git a/docs/examples.md b/docs/examples.md index c11022e78c5..91e1c483e05 100644 --- a/docs/examples.md +++ b/docs/examples.md @@ -5,11 +5,17 @@ nav_order: 11 --- # Demos -Example notebooks allow users to test drive "RAPIDS Accelerator for Apache Spark" with public datasets. +Example notebooks allow users to test drive "RAPIDS Accelerator for Apache Spark" with public +datasets. ##### [Mortgage ETL Notebook](demo/gpu-mortgage_accelerated.ipynb) [(Dataset)](https://docs.rapids.ai/datasets/mortgage-data) ##### About the Mortgage Dataset: -Dataset is derived from [Fannie Mae’s Single-Family Loan Performance Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent from Fannie Mae. +Dataset is derived from [Fannie Mae’s Single-Family Loan Performance +Data](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) with all +rights reserved by Fannie Mae. This processed dataset is redistributed with permission and consent +from Fannie Mae. -For the full raw dataset visit [Fannie Mae](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) to register for an account and to download. +For the full raw dataset visit [Fannie +Mae](http://www.fanniemae.com/portal/funding-the-market/data/loan-performance-data.html) to register +for an account and to download. diff --git a/docs/get-started/getting-started-aws-emr.md b/docs/get-started/getting-started-aws-emr.md index cdd96a16476..bbb0bcf240f 100644 --- a/docs/get-started/getting-started-aws-emr.md +++ b/docs/get-started/getting-started-aws-emr.md @@ -1,263 +1,292 @@ ---- -layout: page -title: AWS-EMR -nav_order: 2 -parent: Getting-Started ---- -# Get Started with RAPIDS on AWS EMR - -This is a getting started guide for the RAPIDS Accelerator for Apache Spark on AWS EMR. At the end -of this guide, the user will be able to run a sample Apache Spark application that runs on NVIDIA -GPUs on AWS EMR. - -The current EMR 6.2.0 release supports Spark version 3.0.1 and RAPIDS Accelerator version 0.2.0. For -more details of supported applications, please see the [EMR release -notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html). - -For more information on AWS EMR, please see the [AWS -documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). - -## Configure and Launch AWS EMR with GPU Nodes - -The following steps are based on the AWS EMR document ["Using the Nvidia Spark-RAPIDS Accelerator -for Spark"](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html) - -### Launch an EMR Cluster using AWS CLI - -You can use the AWS CLI to launch a cluster with one Master node (m5.xlarge) and two -g4dn.2xlarge nodes: - -``` -aws emr create-cluster \ ---release-label emr-6.2.0 \ ---applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \ ---service-role EMR_DefaultRole \ ---ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \ ---instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ - InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \ - InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \ ---configurations file:///my-configurations.json \ ---bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://my-bucket/my-bootstrap-action.sh -``` - -Please fill with actual value for `KeyName` and file paths. You can further customize SubnetId, -EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, name and region etc. - -The `my-configurations.json` installs the spark-rapids plugin on your cluster, configures YARN to use - -GPUs, configures Spark to use RAPIDS, and configures the YARN capacity scheduler. An example JSON - -configuration can be found in the section on launching in the GUI below. - -The `my-boostrap-action.sh` script referenced in the above script opens cgroup permissions to YARN -on your cluster. This is required for YARN to use GPUs. An example script is as follows: -```bash -#!/bin/bash - -set -ex - -sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct -sudo chmod a+rwx -R /sys/fs/cgroup/devices -``` - -### Launch an EMR Cluster using AWS Console (GUI) - -Go to the AWS Management Console and select the `EMR` service from the "Analytics" section. Choose -the region you want to launch your cluster in, e.g. US West (Oregon), using the dropdown menu in the -top right corner. Click `Create cluster` and select `Go to advanced options`, which will bring up a -detailed cluster configuration page. - -#### Step 1: Software Configuration and Steps - -Select **emr-6.2.0** for the release, uncheck all the software options, and then check **Hadoop -3.2.1**, **Spark 3.0.1**, **Livy 0.7.0** and **JupyterEnterpriseGateway 2.1.0**. - -In the "Edit software settings" field, copy and paste the configuration from the [EMR -document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). You can also -create a JSON file on you own S3 bucket. - -For clusters with 2x g4dn.2xlarge GPU instances as worker nodes, we recommend the following -default settings: -```json -[ - { - "Classification":"spark", - "Properties":{ - "enableSparkRapids":"true" - } - }, - { - "Classification":"yarn-site", - "Properties":{ - "yarn.nodemanager.resource-plugins":"yarn.io/gpu", - "yarn.resource-types":"yarn.io/gpu", - "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", - "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", - "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", - "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", - "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", - "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" - } - }, - { - "Classification":"container-executor", - "Properties":{ - - }, - "Configurations":[ - { - "Classification":"gpu", - "Properties":{ - "module.enabled":"true" - } - }, - { - "Classification":"cgroups", - "Properties":{ - "root":"/sys/fs/cgroup", - "yarn-hierarchy":"yarn" - } - } - ] - }, - { - "Classification":"spark-defaults", - "Properties":{ - "spark.plugins":"com.nvidia.spark.SQLPlugin", - "spark.sql.sources.useV1SourceList":"", - "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", - "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar", - "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", - "spark.rapids.sql.concurrentGpuTasks":"2", - "spark.executor.resource.gpu.amount":"1", - "spark.executor.cores":"8", - "spark.task.cpus ":"1", - "spark.task.resource.gpu.amount":"0.125", - "spark.rapids.memory.pinnedPool.size":"2G", - "spark.executor.memoryOverhead":"2G", - "spark.locality.wait":"0s", - "spark.sql.shuffle.partitions":"200", - "spark.sql.files.maxPartitionBytes":"256m", - "spark.sql.adaptive.enabled":"false" - } - }, - { - "Classification":"capacity-scheduler", - "Properties":{ - "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" - } - } -] - -``` -Adjust the settings as appropriate for your cluster. For example, setting the appropriate -number of cores based on the node type. The `spark.task.resource.gpu.amount` should be set to -1/(number of cores per executor) which will allow multiple tasks to run in parallel on the GPU. - -For example, for clusters with 2x g4dn.12xlarge as core nodes, use the following: - -```json - "spark.executor.cores":"12", - "spark.task.resource.gpu.amount":"0.0833", -``` - -More configuration details can be found in the [configuration](../configs.md) documentation. - -![Step 1: Step 1: Software, Configuration and Steps](../img/AWS-EMR/RAPIDS_EMR_GUI_1.png) - -#### Step 2: Hardware - -Select the desired VPC and availability zone in the "Network" and "EC2 Subnet" fields respectively. (Default network and subnet are ok) - -In the "Core" node row, change the "Instance type" to **g4dn.xlarge**, **g4dn.2xlarge**, or **p3.2xlarge** and ensure "Instance count" is set to **1** or any higher number. Keep the default "Master" node instance type of **m5.xlarge**. - -![Step 2: Hardware](../img/AWS-EMR/RAPIDS_EMR_GUI_2.png) - -#### Step 3: General Cluster Settings - -Enter a custom "Cluster name" and make a note of the s3 folder that cluster logs will be written to. - -Add a custom "Bootstrap Actions" to allow cgroup permissions to YARN on your cluster. An example -bootstrap script is as follows: -```bash -#!/bin/bash - -set -ex - -sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct -sudo chmod a+rwx -R /sys/fs/cgroup/devices -``` - -*Optionally* add key-value "Tags", configure a "Custom AMI" for the EMR cluster on this page. - -![Step 3: General Cluster Settings](../img/AWS-EMR/RAPIDS_EMR_GUI_3.png) - -#### Step 4: Security - -Select an existing "EC2 key pair" that will be used to authenticate SSH access to the cluster's nodes. If you do not have access to an EC2 key pair, follow these instructions to [create an EC2 key pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). - -*Optionally* set custom security groups in the "EC2 security groups" tab. - -In the "EC2 security groups" tab, confirm that the security group chosen for the "Master" node allows for SSH access. Follow these instructions to [allow inbound SSH traffic](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) if the security group does not allow it yet. - -![Step 4: Security](../img/AWS-EMR/RAPIDS_EMR_GUI_4.png) - -#### Finish Cluster Configuration - -The EMR cluster management page displays the status of multiple clusters or detailed information about a chosen cluster. In the detailed cluster view, the "Summary" and "Hardware" tabs can be used to monitor the status of master and core nodes as they provision and initialize. - -When the cluster is ready, a green-dot will appear next to the cluster name and the "Status" column will display **Waiting, cluster ready**. - -In the cluster's "Summary" tab, find the "Master public DNS" field and click the `SSH` button. Follow the instructions to SSH to the new cluster's master node. - -![Finish Cluster Configuration](../img/AWS-EMR/RAPIDS_EMR_GUI_5.png) - - -### Running an example joint operation using Spark Shell - -SSH to the EMR cluster's master node, get into sparks shell and run the sql join example to verify GPU operation. - -```bash -spark-shell -``` - -Running following Scala code in Spark Shell - -```scala -val data = 1 to 10000 -val df1 = sc.parallelize(data).toDF() -val df2 = sc.parallelize(data).toDF() -val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value") -out.count() -out.explain() -``` - -### Submit Spark jobs to a EMR Cluster Accelerated by GPUs - -Similar to spark-submit for on-prem clusters, AWS EMR supports a Spark application job to be submitted. The mortgage examples we use are also available as a spark application. You can also use **spark shell** to run the scala code or **pyspark** to run the python code on master node through CLI. - -### Running GPU Accelerated Mortgage ETL and XGBoost Example using EMR Notebook - -An EMR Notebook is a "serverless" Jupyter notebook. Unlike a traditional notebook, the contents of an EMR Notebook itself—the equations, visualizations, queries, models, code, and narrative text—are saved in Amazon S3 separately from the cluster that runs the code. This provides an EMR Notebook with durable storage, efficient access, and flexibility. - -You can use the following step-by-step guide to run the example mortgage dataset using RAPIDS on Amazon EMR GPU clusters. For more examples, please refer to [NVIDIA/spark-rapids for ETL](https://github.com/NVIDIA/spark-rapids/tree/main/docs/demo) and [NVIDIA/spark-rapids for XGBoost](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples) - -![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_2.png) - -#### Create EMR Notebook and Connect to EMR GPU Cluster - -Go to the AWS Management Console and select Notebooks on the left column. Click the Create notebook button. You can then click "Choose an existing cluster" and pick the right cluster after click Choose button. Once the instance is ready, launch the Jupyter from EMR Notebook instance. - -![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_1.png) - -#### Run Mortgage ETL PySpark Notebook on EMR GPU Cluster - -Download [the Mortgate ETL PySpark Notebook](../demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb). Make sure to use PySpark as kernel. This example use 1 year (year 2000) data for a two node g4dn GPU cluster. You can adjust settings in the notebook for full mortgage dataset ETL. - -When executing the ETL code, you can also saw the Spark Job Progress within the notebook and the code will also display how long it takes to run the query - -![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_3.png) - -#### Run Mortgage XGBoost Scala Notebook on EMR GPU Cluster - -Please refer to this [quick start guide](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-2/getting-started-guides/csp/aws/Using_EMR_Notebook.md) to running GPU accelerated XGBoost on EMR Spark Cluster. +--- +layout: page +title: AWS-EMR +nav_order: 2 +parent: Getting-Started +--- +# Get Started with RAPIDS on AWS EMR + +This is a getting started guide for the RAPIDS Accelerator for Apache Spark on AWS EMR. At the end +of this guide, the user will be able to run a sample Apache Spark application that runs on NVIDIA +GPUs on AWS EMR. + +The current EMR 6.2.0 release supports Spark version 3.0.1 and RAPIDS Accelerator version 0.2.0. For +more details of supported applications, please see the [EMR release +notes](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-release-6x.html). + +For more information on AWS EMR, please see the [AWS +documentation](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). + +## Configure and Launch AWS EMR with GPU Nodes + +The following steps are based on the AWS EMR document ["Using the Nvidia Spark-RAPIDS Accelerator +for Spark"](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html) + +### Launch an EMR Cluster using AWS CLI + +You can use the AWS CLI to launch a cluster with one Master node (m5.xlarge) and two +g4dn.2xlarge nodes: + +``` +aws emr create-cluster \ +--release-label emr-6.2.0 \ +--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \ +--service-role EMR_DefaultRole \ +--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \ +--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.4xlarge \ + InstanceGroupType=CORE,InstanceCount=1,InstanceType=g4dn.2xlarge \ + InstanceGroupType=TASK,InstanceCount=1,InstanceType=g4dn.2xlarge \ +--configurations file:///my-configurations.json \ +--bootstrap-actions Name='My Spark Rapids Bootstrap action',Path=s3://my-bucket/my-bootstrap-action.sh +``` + +Please fill with actual value for `KeyName` and file paths. You can further customize SubnetId, +EmrManagedSlaveSecurityGroup, EmrManagedMasterSecurityGroup, name and region etc. + +The `my-configurations.json` installs the spark-rapids plugin on your cluster, configures YARN to use + +GPUs, configures Spark to use RAPIDS, and configures the YARN capacity scheduler. An example JSON + +configuration can be found in the section on launching in the GUI below. + +The `my-boostrap-action.sh` script referenced in the above script opens cgroup permissions to YARN +on your cluster. This is required for YARN to use GPUs. An example script is as follows: +```bash +#!/bin/bash + +set -ex + +sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct +sudo chmod a+rwx -R /sys/fs/cgroup/devices +``` + +### Launch an EMR Cluster using AWS Console (GUI) + +Go to the AWS Management Console and select the `EMR` service from the "Analytics" section. Choose +the region you want to launch your cluster in, e.g. US West (Oregon), using the dropdown menu in the +top right corner. Click `Create cluster` and select `Go to advanced options`, which will bring up a +detailed cluster configuration page. + +#### Step 1: Software Configuration and Steps + +Select **emr-6.2.0** for the release, uncheck all the software options, and then check **Hadoop +3.2.1**, **Spark 3.0.1**, **Livy 0.7.0** and **JupyterEnterpriseGateway 2.1.0**. + +In the "Edit software settings" field, copy and paste the configuration from the [EMR +document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). You can also +create a JSON file on you own S3 bucket. + +For clusters with 2x g4dn.2xlarge GPU instances as worker nodes, we recommend the following +default settings: +```json +[ + { + "Classification":"spark", + "Properties":{ + "enableSparkRapids":"true" + } + }, + { + "Classification":"yarn-site", + "Properties":{ + "yarn.nodemanager.resource-plugins":"yarn.io/gpu", + "yarn.resource-types":"yarn.io/gpu", + "yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices":"auto", + "yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables":"/usr/bin", + "yarn.nodemanager.linux-container-executor.cgroups.mount":"true", + "yarn.nodemanager.linux-container-executor.cgroups.mount-path":"/sys/fs/cgroup", + "yarn.nodemanager.linux-container-executor.cgroups.hierarchy":"yarn", + "yarn.nodemanager.container-executor.class":"org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor" + } + }, + { + "Classification":"container-executor", + "Properties":{ + + }, + "Configurations":[ + { + "Classification":"gpu", + "Properties":{ + "module.enabled":"true" + } + }, + { + "Classification":"cgroups", + "Properties":{ + "root":"/sys/fs/cgroup", + "yarn-hierarchy":"yarn" + } + } + ] + }, + { + "Classification":"spark-defaults", + "Properties":{ + "spark.plugins":"com.nvidia.spark.SQLPlugin", + "spark.sql.sources.useV1SourceList":"", + "spark.executor.resource.gpu.discoveryScript":"/usr/lib/spark/scripts/gpu/getGpusResources.sh", + "spark.submit.pyFiles":"/usr/lib/spark/jars/xgboost4j-spark_3.0-1.0.0-0.2.0.jar", + "spark.executor.extraLibraryPath":"/usr/local/cuda/targets/x86_64-linux/lib:/usr/local/cuda/extras/CUPTI/lib64:/usr/local/cuda/compat/lib:/usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native", + "spark.rapids.sql.concurrentGpuTasks":"2", + "spark.executor.resource.gpu.amount":"1", + "spark.executor.cores":"8", + "spark.task.cpus ":"1", + "spark.task.resource.gpu.amount":"0.125", + "spark.rapids.memory.pinnedPool.size":"2G", + "spark.executor.memoryOverhead":"2G", + "spark.locality.wait":"0s", + "spark.sql.shuffle.partitions":"200", + "spark.sql.files.maxPartitionBytes":"256m", + "spark.sql.adaptive.enabled":"false" + } + }, + { + "Classification":"capacity-scheduler", + "Properties":{ + "yarn.scheduler.capacity.resource-calculator":"org.apache.hadoop.yarn.util.resource.DominantResourceCalculator" + } + } +] + +``` +Adjust the settings as appropriate for your cluster. For example, setting the appropriate +number of cores based on the node type. The `spark.task.resource.gpu.amount` should be set to +1/(number of cores per executor) which will allow multiple tasks to run in parallel on the GPU. + +For example, for clusters with 2x g4dn.12xlarge as core nodes, use the following: + +```json + "spark.executor.cores":"12", + "spark.task.resource.gpu.amount":"0.0833", +``` + +More configuration details can be found in the [configuration](../configs.md) documentation. + +![Step 1: Step 1: Software, Configuration and Steps](../img/AWS-EMR/RAPIDS_EMR_GUI_1.png) + +#### Step 2: Hardware + +Select the desired VPC and availability zone in the "Network" and "EC2 Subnet" fields +respectively. (Default network and subnet are ok) + +In the "Core" node row, change the "Instance type" to **g4dn.xlarge**, **g4dn.2xlarge**, or +**p3.2xlarge** and ensure "Instance count" is set to **1** or any higher number. Keep the default +"Master" node instance type of **m5.xlarge**. + +![Step 2: Hardware](../img/AWS-EMR/RAPIDS_EMR_GUI_2.png) + +#### Step 3: General Cluster Settings + +Enter a custom "Cluster name" and make a note of the s3 folder that cluster logs will be written to. + +Add a custom "Bootstrap Actions" to allow cgroup permissions to YARN on your cluster. An example +bootstrap script is as follows: +```bash +#!/bin/bash + +set -ex + +sudo chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct +sudo chmod a+rwx -R /sys/fs/cgroup/devices +``` + +*Optionally* add key-value "Tags", configure a "Custom AMI" for the EMR cluster on this page. + +![Step 3: General Cluster Settings](../img/AWS-EMR/RAPIDS_EMR_GUI_3.png) + +#### Step 4: Security + +Select an existing "EC2 key pair" that will be used to authenticate SSH access to the cluster's +nodes. If you do not have access to an EC2 key pair, follow these instructions to [create an EC2 key +pair](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html#having-ec2-create-your-key-pair). + +*Optionally* set custom security groups in the "EC2 security groups" tab. + +In the "EC2 security groups" tab, confirm that the security group chosen for the "Master" node +allows for SSH access. Follow these instructions to [allow inbound SSH +traffic](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/authorizing-access-to-an-instance.html) +if the security group does not allow it yet. + +![Step 4: Security](../img/AWS-EMR/RAPIDS_EMR_GUI_4.png) + +#### Finish Cluster Configuration + +The EMR cluster management page displays the status of multiple clusters or detailed information +about a chosen cluster. In the detailed cluster view, the "Summary" and "Hardware" tabs can be used +to monitor the status of master and core nodes as they provision and initialize. + +When the cluster is ready, a green-dot will appear next to the cluster name and the "Status" column +will display **Waiting, cluster ready**. + +In the cluster's "Summary" tab, find the "Master public DNS" field and click the `SSH` +button. Follow the instructions to SSH to the new cluster's master node. + +![Finish Cluster Configuration](../img/AWS-EMR/RAPIDS_EMR_GUI_5.png) + + +### Running an example joint operation using Spark Shell + +SSH to the EMR cluster's master node, get into sparks shell and run the sql join example to verify +GPU operation. + +```bash +spark-shell +``` + +Running following Scala code in Spark Shell + +```scala +val data = 1 to 10000 +val df1 = sc.parallelize(data).toDF() +val df2 = sc.parallelize(data).toDF() +val out = df1.as("df1").join(df2.as("df2"), $"df1.value" === $"df2.value") +out.count() +out.explain() +``` + +### Submit Spark jobs to a EMR Cluster Accelerated by GPUs + +Similar to spark-submit for on-prem clusters, AWS EMR supports a Spark application job to be +submitted. The mortgage examples we use are also available as a spark application. You can also use +**spark shell** to run the scala code or **pyspark** to run the python code on master node through +CLI. + +### Running GPU Accelerated Mortgage ETL and XGBoost Example using EMR Notebook + +An EMR Notebook is a "serverless" Jupyter notebook. Unlike a traditional notebook, the contents of +an EMR Notebook itself—the equations, visualizations, queries, models, code, and narrative text—are +saved in Amazon S3 separately from the cluster that runs the code. This provides an EMR Notebook +with durable storage, efficient access, and flexibility. + +You can use the following step-by-step guide to run the example mortgage dataset using RAPIDS on +Amazon EMR GPU clusters. For more examples, please refer to [NVIDIA/spark-rapids for +ETL](https://github.com/NVIDIA/spark-rapids/tree/main/docs/demo) and [NVIDIA/spark-rapids for +XGBoost](https://github.com/NVIDIA/spark-xgboost-examples/tree/spark-3/examples) + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_2.png) + +#### Create EMR Notebook and Connect to EMR GPU Cluster + +Go to the AWS Management Console and select Notebooks on the left column. Click the Create notebook +button. You can then click "Choose an existing cluster" and pick the right cluster after click +Choose button. Once the instance is ready, launch the Jupyter from EMR Notebook instance. + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_1.png) + +#### Run Mortgage ETL PySpark Notebook on EMR GPU Cluster + +Download [the Mortgate ETL PySpark Notebook](../demo/AWS-EMR/Mortgage-ETL-GPU-EMR.ipynb). Make sure +to use PySpark as kernel. This example use 1 year (year 2000) data for a two node g4dn GPU +cluster. You can adjust settings in the notebook for full mortgage dataset ETL. + +When executing the ETL code, you can also saw the Spark Job Progress within the notebook and the +code will also display how long it takes to run the query + +![Create EMR Notebook](../img/AWS-EMR/EMR_notebook_3.png) + +#### Run Mortgage XGBoost Scala Notebook on EMR GPU Cluster + +Please refer to this [quick start +guide](https://github.com/NVIDIA/spark-xgboost-examples/blob/spark-2/getting-started-guides/csp/aws/Using_EMR_Notebook.md) +to running GPU accelerated XGBoost on EMR Spark Cluster. diff --git a/docs/get-started/getting-started-on-prem.md b/docs/get-started/getting-started-on-prem.md index 364ee1aacb3..71c24f50985 100644 --- a/docs/get-started/getting-started-on-prem.md +++ b/docs/get-started/getting-started-on-prem.md @@ -394,10 +394,10 @@ The _RapidsShuffleManager_ is a beta feature! --- The _RapidsShuffleManager_ is an implementation of the `ShuffleManager` interface in Apache Spark -that allows custom mechanisms to exchange shuffle data. The _RapidsShuffleManager_ has two components: -a spillable cache, and a transport that can utilize _Remote Direct Memory Access (RDMA)_ and high-bandwidth -transfers within a node that has multiple GPUs. This is possible because the plugin utilizes -[Unified Communication X (UCX)](https://www.openucx.org/) as its transport. +that allows custom mechanisms to exchange shuffle data. The _RapidsShuffleManager_ has two +components: a spillable cache, and a transport that can utilize _Remote Direct Memory Access (RDMA)_ +and high-bandwidth transfers within a node that has multiple GPUs. This is possible because the +plugin utilizes [Unified Communication X (UCX)](https://www.openucx.org/) as its transport. - **Spillable cache**: This store keeps GPU data close by where it was produced in device memory, but can spill in the following cases: @@ -409,8 +409,9 @@ but can spill in the following cases: Tasks local to the producing executor will short-circuit read from the cache. -- **Transport**: Handles block transfers between executors using various means like: _NVLink_, _PCIe_, _Infiniband (IB)_, -_RDMA over Converged Ethernet (RoCE)_ or _TCP_, and as configured in UCX, in these scenarios: +- **Transport**: Handles block transfers between executors using various means like: _NVLink_, +_PCIe_, _Infiniband (IB)_, _RDMA over Converged Ethernet (RoCE)_ or _TCP_, and as configured in UCX, +in these scenarios: - _GPU-to-GPU_: Shuffle blocks that were able to fit in GPU memory. - _Host-to-GPU_ and _Disk-to-GPU_: Shuffle blocks that spilled to host (or disk) but will be manifested in the GPU in the downstream Spark task. @@ -418,9 +419,9 @@ _RDMA over Converged Ethernet (RoCE)_ or _TCP_, and as configured in UCX, in the In order to enable the _RapidsShuffleManager_, please follow these steps. If you don't have Mellanox hardware go to *step 2*: -1. If you have Mellanox NICs and an Infiniband(IB) or RoCE network, please ensure you have the -[MLNX_OFED driver](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), -and the [`nv_peer_mem` kernel module](https://www.mellanox.com/products/GPUDirect-RDMA) installed. +1. If you have Mellanox NICs and an Infiniband(IB) or RoCE network, please ensure you have the +[MLNX_OFED driver](https://www.mellanox.com/products/infiniband-drivers/linux/mlnx_ofed), and the +[`nv_peer_mem` kernel module](https://www.mellanox.com/products/GPUDirect-RDMA) installed. With `nv_peer_mem`, IB/RoCE-based transfers can perform zero-copy transfers directly from GPU memory. @@ -447,7 +448,7 @@ that matches your Spark version. Currently we support --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} ``` -Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a +Please note `LD_LIBRARY_PATH` should optionally be set if the UCX library is installed in a non-standard location. ### UCX Environment Variables @@ -468,10 +469,11 @@ Here are some settings that could be utilized to fine tune the _RapidsShuffleMan #### Bounce Buffers The following configs control the number of bounce buffers, and the size. Please note that for device buffers, two pools are created (for sending and receiving). Take this into account when -sizing your pools. +sizing your pools. -The GPU buffers should be smaller than the [`PCI BAR Size`](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#bar-sizes) -for your GPU. Please verify the [defaults](../configs.md) work in your case. +The GPU buffers should be smaller than the [`PCI BAR +Size`](https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#bar-sizes) for your GPU. Please verify +the [defaults](../configs.md) work in your case. - `spark.rapids.shuffle.ucx.bounceBuffers.device.count` - `spark.rapids.shuffle.ucx.bounceBuffers.host.count` @@ -490,9 +492,12 @@ The _GPU Scheduling for Pandas UDF_ is an experimental feature, and may change a --- -_GPU Scheduling for Pandas UDF_ is built on Apache Spark's [Pandas UDF(user defined function)](https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs), and has two components: +_GPU Scheduling for Pandas UDF_ is built on Apache Spark's [Pandas UDF(user defined +function)](https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-arrow.html#pandas-udfs-aka-vectorized-udfs), +and has two components: -- **Share GPU with JVM**: Let the Python process share JVM GPU. The Python process could run on the same GPU with JVM. +- **Share GPU with JVM**: Let the Python process share JVM GPU. The Python process could run on the + same GPU with JVM. - **Increase Speed**: Make the data transport faster between JVM process and Python process. @@ -500,7 +505,8 @@ _GPU Scheduling for Pandas UDF_ is built on Apache Spark's [Pandas UDF(user defi To enable _GPU Scheduling for Pandas UDF_, you need to configure your spark job with extra settings. -1. Make sure GPU exclusive mode is disabled. Note that this will not work if you are using exclusive mode to assign GPUs under spark. +1. Make sure GPU exclusive mode is disabled. Note that this will not work if you are using exclusive + mode to assign GPUs under spark. 2. Currently the python files are packed into the spark rapids plugin jar. On Yarn, you need to add @@ -543,7 +549,8 @@ spark.rapids.python.memory.gpu.allocFraction spark.rapids.python.memory.gpu.maxAllocFraction ``` -To find details on the above Python configuration settings, please see the [RAPIDS Accelerator for Apache Spark Configuration Guide](../configs.md). +To find details on the above Python configuration settings, please see the [RAPIDS Accelerator for +Apache Spark Configuration Guide](../configs.md). From 08b44a9574dfe41574f502ad7f31862eebf4850e Mon Sep 17 00:00:00 2001 From: Sameer Raheja Date: Fri, 22 Jan 2021 09:03:56 -0800 Subject: [PATCH 3/3] Change spark to Spark in the text Signed-off-by: Sameer Raheja --- docs/FAQ.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/FAQ.md b/docs/FAQ.md index 474b3ee7b50..0feb19fdaf6 100644 --- a/docs/FAQ.md +++ b/docs/FAQ.md @@ -13,7 +13,7 @@ settings. This is true of all configs in Spark. If you changed `spark.sql.autoBroadcastJoinThreshold` after running `explain()` on a `DataFrame`, the resulting query would not change to reflect that config and still show a `SortMergeJoin` even though the new config might have changed to be a `BroadcastHashJoin` instead. When actually running something like -with `collect`, `show` or `write` a new `DataFrame` is constructed causing spark to re-plan the +with `collect`, `show` or `write` a new `DataFrame` is constructed causing Spark to re-plan the query. This is why `spark.rapids.sql.enabled` is still respected when running, even if explain shows stale results.