clean up GCP guide, update NVIDIA driver/CUDA installation (NVIDIA#69)

* gcp guide * refine * add init script * update driver install * fix typos * add rapids spark and cudf version * remove .DS_Store * fix typo
wjxiz1992 · Nov 27, 2019 · da110fd · da110fd
1 parent 2620e60
commit da110fd
Show file tree

Hide file tree

Showing 4 changed files with 197 additions and 144 deletions.
diff --git a/getting-started-guides/csp/gcp/gcp.md b/getting-started-guides/csp/gcp/gcp.md
@@ -17,7 +17,6 @@ Prerequisites
 * `spark.dynamicAllocation.enabled` must be set to False for spark
 
 
-
 Before you begin, please make sure you have installed [Google Cloud SDK](https://cloud.google.com/sdk/) and selected your project directory on your local machine. The following steps require a GCP project directory and Google Storage bucket associated with the project directory.
 There are three steps to run a sample PySpark XGBoost app using Jupyter notebook on a GCP GPU Cluster from your local machine. 
 1. Initialization steps to download required files for Spark RAPIDS XGBoost app
@@ -26,54 +25,62 @@ There are three steps to run a sample PySpark XGBoost app using Jupyter notebook
 4. Optional step: Submit sample PySpark or Scala App using the gcloud command from your local machine
 
 
-### Step 1.  Initialization steps to download required files for Spark RAPIDS XGBoost PySpark app
+### Step 1.  Initialization steps to download required files for Spark RAPIDS XGBoost app
 
-Before you create a cluster, please git clone the [spark-examples directory](https://github.com/rapidsai/spark-examples) to your local machine. `cd` into the spark-examples/getting-started-guides/csp/gcp/spark-gpu directory. Open the rapids.sh script using a text editor.  Modify the `STORAGE_BUCKET=my-bucket` line to specify your google GCP bucket name.  
+Before you create a cluster, please git clone the [spark-examples directory](https://github.com/rapidsai/spark-examples) to your local machine. `cd` into the spark-examples/getting-started-guides/csp/gcp/spark-gpu directory. Open the rapids.sh script using a text editor.  Modify the `GCS_BUCKET=my-bucket` line and `INIT_ACTIONS_BUCKET=my-bucket`(this parameter will not be required after init action merged into dataproc official repo) to specify your google GCP bucket name.  
 
 Execute the commands below while in the spark-examples folder.  These commands will copy the following files into your GCP bucket: 
 
 1. Initialization scripts for GPU and RAPIDS Spark, 
 2. PySpark app files  
 3. A sample dataset for a XGBoost PySpark app 
-4.  The latest Spark RAPIDS XGBoost jar files from the public maven repository
+4. The latest Spark RAPIDS XGBoost jar files from the public maven repository
 
-```
-export STORAGE_BUCKET=my-bucket
+```bash
+export GCS_BUCKET=my-bucket
+export RAPIDS_SPARK_VERSION='2.x-1.0.0-Beta3'
+export RAPIDS_CUDF_VERSION='0.9.2-cuda10'
 pushd getting-started-guides/csp/gcp/spark-gpu
-gsutil cp -r internal gs://$STORAGE_BUCKET/spark-gpu/
-gsutil cp rapids.sh gs://$STORAGE_BUCKET/spark-gpu/rapids.sh
+gsutil cp -r internal gs://$GCS_BUCKET/spark-gpu/
+gsutil cp rapids.sh gs://$GCS_BUCKET/spark-gpu/rapids.sh
 popd
 pushd datasets/
 tar -xvf mortgage-small.tar.gz
-gsutil cp -r mortgage-small/ gs://$STORAGE_BUCKET/
+gsutil cp -r mortgage-small/ gs://$GCS_BUCKET/
 popd
-wget -O cudf-0.9.1-cuda10.jar https://search.maven.org/remotecontent?filepath=ai/rapids/cudf/0.9.1/cudf-0.9.1-cuda10.jar
-wget -O xgboost4j_2.11-1.0.0-Beta2.jar https://search.maven.org/remotecontent?filepath=ai/rapids/xgboost4j_2.11/1.0.0-Beta2/xgboost4j_2.11-1.0.0-Beta2.jar
-wget -O xgboost4j-spark_2.11-1.0.0-Beta2.jar https://search.maven.org/remotecontent?filepath=ai/rapids/xgboost4j-spark_2.11/1.0.0-Beta2/xgboost4j-spark_2.11-1.0.0-Beta2.jar
-gsutil cp cudf-0.9.1-cuda10.jar xgboost4j-spark_2.11-1.0.0-Beta2.jar xgboost4j_2.11-1.0.0-Beta2.jar gs://$STORAGE_BUCKET/
+wget -O cudf-${RAPIDS_CUDF_VERSION}.jar https://repo1.maven.org/maven2/ai/rapids/cudf/${RAPIDS_CUDF_VERSION%-*}/cudf-${RAPIDS_CUDF_VERSION}.jar
+wget -O xgboost4j_${RAPIDS_SPARK_VERSION}.jar https://repo1.maven.org/maven2/ai/rapids/xgboost4j_${RAPIDS_SPARK_VERSION/-/\/}/xgboost4j_${RAPIDS_SPARK_VERSION}.jar
+wget -O xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar https://repo1.maven.org/maven2/ai/rapids/xgboost4j-spark_${RAPIDS_SPARK_VERSION/-/\/}/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar
+gsutil cp cudf-${RAPIDS_CUDF_VERSION}.jar xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar xgboost4j_${RAPIDS_SPARK_VERSION}.jar gs://$GCS_BUCKET/
 ````
 
 After executing these commands, use your web browser to navigate to Google Cloud Platform console and make sure your Google storage bucket “my-bucket” directory structure has the following files:
-gs://my-bucket/spark-gpu/rapids.sh
-gs://my-bucket/spark-gpu/internal/install-gpu-driver-debian.sh
-gs://my-bucket/spark-gpu/internal/install-gpu-driver-ubuntu.sh
-gs://my-bucket/cudf-0.9.1-cuda10.jar
-gs://my-bucket/xgboost4j-spark_2.11-1.0.0-Beta2.jar
-gs://my-bucket/xgboost4j_2.11-1.0.0-Beta2.jar
-gs://my-bucket/mortgage-small/eval/mortgage-small.csv
-gs://my-bucket/mortgage-small/eval/mortgage-small.csv
-gs://my-bucket/mortgage-small/trainWithEval/test.csv
+* gs://my-bucket/spark-gpu/rapids.sh
+* gs://my-bucket/spark-gpu/internal/install-gpu-driver-debian.sh
+* gs://my-bucket/spark-gpu/internal/install-gpu-driver-ubuntu.sh
+* gs://my-bucket/cudf-${RAPIDS_CUDF_VERSION}.jar
+* gs://my-bucket/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar
+* gs://my-bucket/xgboost4j_${RAPIDS_SPARK_VERSION}.jar
+* gs://my-bucket/mortgage-small/eval/mortgage-small.csv
+* gs://my-bucket/mortgage-small/eval/mortgage-small.csv
+* gs://my-bucket/mortgage-small/trainWithEval/test.csv
 
 
-### Step 2. Create a GPU Cluster with pre-installed GPU drivers, Spark RAPIDS libraries, Spark XGBoost libraries and Jupyter notebook 
+### Step 2. Create a GPU Cluster with pre-installed GPU drivers, Spark RAPIDS libraries, Spark XGBoost libraries and Jupyter Notebook 
 
-Using the `gcloud` command creates a new cluster with Rapids Spark GPU initialization action. The following commands will create a new cluster named `<CLUSTER_NAME>` under your project directory. Here we use Ubuntu as our recommended OS for Spark-XGBoost on GCP.
+Using the `gcloud` command creates a new cluster with Rapids Spark GPU initialization action. The following commands 
+will create a new cluster named `<CLUSTER_NAME>` under your project directory. Here we use Ubuntu as our recommended 
+OS for Spark-XGBoost on GCP. Modify the `GCS_BUCKET=my-bucket` line to specify your google GCP bucket name. Also 
+modify `--properties` to include update-to-date jar file released by NVIDIA Spark XGBoost team. 
 
-```
+```bash
 export CLUSTER_NAME=my-gpu-cluster
+export RAPIDS_SPARK_VERSION='2.x-1.0.0-Beta3'
+export RAPIDS_CUDF_VERSION='0.9.2-cuda10'
 export ZONE=us-central1-b
 export REGION=us-central1
-export STORAGE_BUCKET=my-bucket
+export GCS_BUCKET=my-bucket
+export INIT_ACTIONS_BUCKET=my-bucket
 export NUM_GPUS=2
 export NUM_WORKERS=2
  
@@ -88,17 +95,18 @@ gcloud beta dataproc clusters create $CLUSTER_NAME  \
     --num-worker-local-ssds 1 \
     --num-workers $NUM_WORKERS \
     --image-version 1.4-ubuntu18 \
-    --bucket $STORAGE_BUCKET \
-    --metadata JUPYTER_PORT=80,INIT_ACTIONS_REPO="gs://$STORAGE_BUCKET",linux-dist="ubuntu" \
-    --initialization-actions gs://$STORAGE_BUCKET/spark-gpu/rapids.sh \
+    --bucket $GCS_BUCKET \
+    --metadata JUPYTER_PORT=8123,INIT_ACTIONS_REPO="gs://$INIT_ACTIONS_BUCKET",linux-dist="ubuntu",GCS_BUCKET="gs://$GCS_BUCKET" \
+    --initialization-actions gs://$INIT_ACTIONS_BUCKET/spark-gpu/rapids.sh \
     --optional-components=ANACONDA,JUPYTER \
     --subnet=default \
-    --properties '^#^spark:spark.dynamicAllocation.enabled=false#spark:spark.shuffle.service.enabled=false#spark:spark.submit.pyFiles=/usr/lib/spark/python/lib/xgboost4j-spark_2.11-1.0.0-Beta2.jar#spark:spark.jars=/usr/lib/spark/jars/xgboost4j-spark_2.11-1.0.0-Beta2.jar,/usr/lib/spark/jars/xgboost4j_2.11-1.0.0-Beta2.jar,/usr/lib/spark/jars/cudf-0.9.1-cuda10.jar' \
+    --properties "^#^spark:spark.dynamicAllocation.enabled=false#spark:spark.shuffle.service.enabled=false#spark:spark.submit.pyFiles=/usr/lib/spark/python/lib/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar#spark:spark.jars=/usr/lib/spark/jars/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar,/usr/lib/spark/jars/xgboost4j_${RAPIDS_SPARK_VERSION}.jar,/usr/lib/spark/jars/cudf-${RAPIDS_CUDF_VERSION}.jar" \
     --enable-component-gateway 
 ```
 
 After submitting the commands, please go to the Google Cloud Platform console on your browser. Search for "Dataproc" and click on the "Dataproc" icon. This will navigate you to the Dataproc clusters page. “Dataproc” page lists all Dataproc clusters created under your project directory. You can see “my-gpu-cluster” with Status "Running". This cluster is now ready to host RAPIDS Spark XGBoost applications.  
 
+
 ### Step 3. Upload and run a sample XGBoost PySpark app to the Jupyter notebook on your GCP cluster.
 
 To open the Jupyter notebook, click on the “my-gpu-cluster” under Dataproc page and navigate to the "Web Interfaces" Tab. Under the "Web Interfaces", click on the “Jupyter” link.
@@ -107,20 +115,21 @@ This will open the Jupyter Notebook. This notebook is running on the “my-gpu-c
 Next, to upload the Sample PySpark App into the Jupyter notebook, use the “Upload” button on the Jupyter notebook. Sample Pyspark notebook is inside the `spark-examples/examples/notebooks/python/’ directory. Once you upload the sample mortgage-gpu.ipynb, make sure to change the kernel to “PySpark” under the "Kernel" tab using "Change Kernel" selection.The Spark XGBoost Sample Jupyter notebook is now ready to run on a “my-gpu-cluster”.
 To run the Sample PySpark app on Jupyter notebook, please follow the instructions on the notebook and also update the data path for sample datasets.
 ```
-train_data = GpuDataReader(spark).schema(schema).option('header', True).csv('gs://$STORAGE_BUCKET/mortgage-small/train')
-eval_data = GpuDataReader(spark).schema(schema).option('header', True).csv('gs://$STORAGE_BUCKET/mortgage-small/eval')
+train_data = GpuDataReader(spark).schema(schema).option('header', True).csv('gs://$GCS_BUCKET/mortgage-small/train')
+eval_data = GpuDataReader(spark).schema(schema).option('header', True).csv('gs://$GCS_BUCKET/mortgage-small/eval')
 ```
 
 ### Step 4. [Optional] Submit Sample Apps 
 #### 4a) Submit Scala Spark App on GPUs
 
-Please build the sample_xgboost_apps jar with dependencies as specified in the [guide](/getting-started-guides/building-sample-apps/scala.md) and place the jar file (sample_xgboost_apps-0.1.4-jar-with-dependencies.jar) under the gs://$STORAGE_BUCKET/spark-gpu folder. To do this you can either drag and drop files from your local machine into the GCP [storage browser](https://console.cloud.google.com/storage/browser/rapidsai-test-1/?project=nv-ai-infra&organizationId=210881545417), or use the [gsutil cp](https://cloud.google.com/storage/docs/gsutil/commands/cp) as shown before to do this from a command line.
+Please build the `sample_xgboost_apps jar` with dependencies as specified in the [guide](/getting-started-guides/building-sample-apps/scala.md) and place the jar file (`sample_xgboost_apps-0.1.4-jar-with-dependencies.jar`) under the `gs://$GCS_BUCKET/spark-gpu` folder. To do this you can either drag and drop files from your local machine into the GCP [storage browser](https://console.cloud.google.com/storage/browser/rapidsai-test-1/?project=nv-ai-infra&organizationId=210881545417), or use the [gsutil cp](https://cloud.google.com/storage/docs/gsutil/commands/cp) as shown before to do this from a command line.
 
 Use the following commands to submit sample Scala app on this GPU cluster.
 
-```export MAIN_CLASS=ai.rapids.spark.examples.mortgage.GPUMain
-    export RAPIDS_JARS=gs://$STORAGE_BUCKET/spark-gpu/sample_xgboost_apps-0.1.4-jar-with-dependencies.jar
-    export DATA_PATH=$STORAGE_BUCKET
+```bash
+    export MAIN_CLASS=ai.rapids.spark.examples.mortgage.GPUMain
+    export RAPIDS_JARS=gs://$GCS_BUCKET/spark-gpu/sample_xgboost_apps-0.1.4-jar-with-dependencies.jar
+    export DATA_PATH=gs://$GCS_BUCKET
     export TREE_METHOD=gpu_hist
     export SPARK_NUM_EXECUTORS=4
     export CLUSTER_NAME=my-gpu-cluster
@@ -138,58 +147,59 @@ Use the following commands to submit sample Scala app on this GPU cluster.
         -numWorkers=${SPARK_NUM_EXECUTORS} \
         -treeMethod=${TREE_METHOD} \
         -trainDataPath=${DATA_PATH}/mortgage-small/train/mortgage_small.csv \
-         -evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
-         -maxDepth=8  
+        -evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
+        -maxDepth=8  
 ```
 
 #### 4b) Submit  PySpark App on GPUs
 
 Please build the sample_xgboost pyspark app as specified in the [guide](/getting-started-guides/building-sample-apps/python.md) and place the sample.zip file into GCP storage bucket.
 
-
 Use the following commands to submit sample PySpark app on this GPU cluster.
 
-```
-    export DATA_PATH=gs://$STORAGE_BUCKET
-    export LIBS_PATH=gs://$STORAGE_BUCKET
+
+```bash
+    export DATA_PATH=gs://$GCS_BUCKET
+    export LIBS_PATH=gs://$GCS_BUCKET
+    export RAPIDS_SPARK_VERSION='2.x-1.0.0-Beta3'
+    export RAPIDS_CUDF_VERSION='0.9.2-cuda10'
     export SPARK_DEPLOY_MODE=cluster
     export SPARK_PYTHON_ENTRYPOINT=${LIBS_PATH}/main.py
     export MAIN_CLASS=ai.rapids.spark.examples.mortgage.gpu_main
-    export RAPIDS_JARS=${LIBS_PATH}/cudf-0.9.1-cuda10.jar,${LIBS_PATH}/xgboost4j_2.11-1.0.0-Beta2.jar,${LIBS_PATH}/xgboost4j-spark_2.11-1.0.0-Beta2.jar
-    export SPARK_PY_FILES=${LIBS_PATH}/xgboost4j-spark_2.11-1.0.0-Beta2.jar,${LIBS_PATH}/sample.zip
+    export RAPIDS_JARS=${LIBS_PATH}/cudf-${RAPIDS_CUDF_VERSION}.jar,${LIBS_PATH}/xgboost4j_${RAPIDS_SPARK_VERSION}.jar,${LIBS_PATH}/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar
+    export SPARK_PY_FILES=${LIBS_PATH}/xgboost4j-spark_${RAPIDS_SPARK_VERSION}.jar,${LIBS_PATH}/sample.zip
     export TREE_METHOD=gpu_hist
     export SPARK_NUM_EXECUTORS=4
     export CLUSTER_NAME=my-gpu-cluster
     export REGION=us-central1
 
-gcloud beta dataproc jobs submit pyspark \
-    --cluster=$CLUSTER_NAME \
-    --region=$REGION \
-    --jars=$RAPIDS_JARS \
-    --properties=spark.executor.cores=1,spark.executor.instances=${SPARK_NUM_EXECUTORS},spark.executor.memory=8G,spark.executorEnv.LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu:/usr/local/cuda-10.0/lib64:${LD_LIBRARY_PATH} \
-    --py-files=${SPARK_PY_FILES} \
-    ${SPARK_PYTHON_ENTRYPOINT} \
-    -- \
-    --format=csv \
-    --numRound=100 \
-    --numWorkers=${SPARK_NUM_EXECUTORS} \
-    --treeMethod=${TREE_METHOD} \
-    --trainDataPath=${DATA_PATH}/mortgage-small/train/mortgage_small.csv \
-    --evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
-    --maxDepth=8 \
-    --mainClass=${MAIN_CLASS}
+    gcloud beta dataproc jobs submit pyspark \
+        --cluster=$CLUSTER_NAME \
+        --region=$REGION \
+        --jars=$RAPIDS_JARS \
+        --properties=spark.executor.cores=1,spark.executor.instances=${SPARK_NUM_EXECUTORS},spark.executor.memory=8G,spark.executorEnv.LD_LIBRARY_PATH=/usr/local/lib/x86_64-linux-gnu:/usr/local/cuda-10.0/lib64:${LD_LIBRARY_PATH} \
+        --py-files=${SPARK_PY_FILES} \
+        ${SPARK_PYTHON_ENTRYPOINT} \
+        -- \
+        --format=csv \
+        --numRound=100 \
+        --numWorkers=${SPARK_NUM_EXECUTORS} \
+        --treeMethod=${TREE_METHOD} \
+        --trainDataPath=${DATA_PATH}/mortgage-small/train/mortgage_small.csv \
+        --evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
+        --maxDepth=8 \
+        --mainClass=${MAIN_CLASS}
 ```
 
-
 #### 4c) Addendum: Submit a Spark Job on CPUs 
 
-Submitting a CPU job on this cluster is very similar. Given below is an example command that runs the same Mortgage application on CPUs:
+Submitting a CPU job on this cluster is very similar. Below's an example command that runs the same Mortgage application on CPUs:
 
-```
-    export STORAGE_BUCKET=dataproc-initialization-actions
+```bash
+    export GCS_BUCKET=my-bucket
     export MAIN_CLASS=ai.rapids.spark.examples.mortgage.CPUMain
-    export RAPIDS_JARS=gs://$STORAGE_BUCKET/spark-gpu/sample_xgboost_apps-0.1.4-jar-with-dependencies.jar
-    export DATA_PATH=hdfs:///tmp/xgboost4j_spark/mortgage/csv
+    export RAPIDS_JARS=gs://$GCS_BUCKET/spark-gpu/sample_xgboost_apps-0.1.4-jar-with-dependencies.jar
+    export DATA_PATH=gs://$GCS_BUCKET
     export TREE_METHOD=hist
     export SPARK_NUM_EXECUTORS=4
     export CLUSTER_NAME=my-gpu-cluster
@@ -207,16 +217,16 @@ Submitting a CPU job on this cluster is very similar. Given below is an example
         -numWorkers=${SPARK_NUM_EXECUTORS} \
         -treeMethod=${TREE_METHOD} \
         -trainDataPath=${DATA_PATH}/mortgage-small/train/mortgage_small.csv \
-         -evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
-         -maxDepth=8
+        -evalDataPath=${DATA_PATH}/mortgage-small/eval/mortgage_small.csv \
+        -maxDepth=8
 ```
 
 ### Step 5. Clean Up
 
 When you're done working on this cluster, don't forget to delete the cluster, using the following command (replacing the highlighted cluster name with yours):
 
-```
-    gcloud dataproc clusters delete my-gpu-cluster
+```bash
+    gcloud dataproc clusters delete my-gpu-cluster --region=$REGION
 ```
 
 <sup>*</sup> Please see our release [announcement](https://news.developer.nvidia.com/gpu-accelerated-spark-xgboost/) for official performance benchmarks.