Skip to content

Commit

Permalink
updated gcp docs with custom dataproc image instructions (#2254)
Browse files Browse the repository at this point in the history
Signed-off-by: Akshit Arora <akshita@nvidia.com>
  • Loading branch information
aroraakshit authored Apr 26, 2021
1 parent 13e199f commit 5de2f24
Showing 1 changed file with 272 additions and 5 deletions.
277 changes: 272 additions & 5 deletions docs/get-started/getting-started-gcp.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ parent: Getting-Started
GPUs](#run-pyspark-or-scala-notebook-on-a-dataproc-cluster-accelerated-by-gpus)
* [Submit the same sample ETL application as a Spark job to a Dataproc Cluster Accelerated by
GPUs](#submit-spark-jobs-to-a-dataproc-cluster-accelerated-by-gpus)
* [Build custom Dataproc image to accelerate cluster initialization time](#build-custom-dataproc-image-to-accelerate-cluster-init-time)

## Spin up a Dataproc Cluster Accelerated by GPUs

Expand Down Expand Up @@ -72,14 +73,16 @@ gcloud dataproc clusters create $CLUSTER_NAME \
--metadata rapids-runtime=SPARK \
--bucket $GCS_BUCKET \
--enable-component-gateway \
--properties="^#^spark:spark.yarn.unmanagedAM.enabled=false"
```

This may take around 5-15 minutes to complete. You can navigate to the Dataproc clusters tab in the
This may take around 10-15 minutes to complete. You can navigate to the Dataproc clusters tab in the
Google Cloud Console to see the progress.

![Dataproc Cluster](../img/GCP/dataproc-cluster.png)

If you'd like to further accelerate init time to 4-5 minutes, create a custom Dataproc image using
[this](#build-custom-dataproc-image-to-accelerate-cluster-init-time) guide.

## Run PySpark or Scala Notebook on a Dataproc Cluster Accelerated by GPUs
To use notebooks with a Dataproc cluster, click on the cluster name under the Dataproc cluster tab
and navigate to the "Web Interfaces" tab. Under "Web Interfaces", click on the JupyterLab or
Expand Down Expand Up @@ -137,9 +140,9 @@ can either drag and drop files from your local machine into the GCP storage brow
gsutil cp as shown before to do this from a command line. We can thereby submit the jar by:

```bash
export GCS_BUCKET=<bucket_name>
export CLUSTER_NAME=<cluster_name>
export REGION=<region>
export REGION=[Your Preferred GCP Region]
export GCS_BUCKET=[Your GCS Bucket]
export CLUSTER_NAME=[Your Cluster Name]
export SPARK_NUM_EXECUTORS=20
export SPARK_EXECUTOR_MEMORY=20G
export SPARK_EXECUTOR_MEMORYOVERHEAD=16G
Expand Down Expand Up @@ -169,3 +172,267 @@ The AI platform will connect to a Dataproc cluster through a yaml configuration.

In the future, users will be able to provision a Dataproc cluster through DataprocHub notebook. You
can use example [pyspark notebooks](../demo/GCP/Mortgage-ETL-GPU.ipynb) to experiment.

## Build custom dataproc image to accelerate cluster init time
In order to accelerate cluster init time to 4-5 minutes, we need to build a custom Dataproc image
that already has NVIDIA drivers and CUDA toolkit installed. In this section, we will be using [these
instructions from GCP](https://cloud.google.com/dataproc/docs/guides/dataproc-images) to create a
custom image.

Currently, the [GPU Driver](https://github.com/GoogleCloudDataproc/initialization-actions/tree/master/gpu) initialization actions:
1. Configure YARN, the YARN node manager, GPU isolation and GPU exclusive mode.
2. Install GPU drivers.

While step #1 is required at the time of cluster creation, step #2 can be done in advance. Let's
write a script to do that. `gpu_dataproc_packages.sh` will be used to create the Dataproc image:

```bash
#!/bin/bash

OS_NAME=$(lsb_release -is | tr '[:upper:]' '[:lower:]')
readonly OS_NAME
OS_DIST=$(lsb_release -cs)
readonly OS_DIST
CUDA_VERSION='10.2'
readonly CUDA_VERSION

readonly DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION='460.56'
readonly DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_URL="https://us.download.nvidia.com/XFree86/Linux-x86_64/${DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION}/NVIDIA-Linux-x86_64-${DEFAULT_NVIDIA_DEBIAN_GPU_DRIVER_VERSION}.run"

readonly NVIDIA_BASE_DL_URL='https://developer.download.nvidia.com/compute'

# Parameters for NVIDIA-provided Ubuntu GPU driver
readonly NVIDIA_UBUNTU_REPOSITORY_URL="${NVIDIA_BASE_DL_URL}/cuda/repos/ubuntu1804/x86_64"
readonly NVIDIA_UBUNTU_REPOSITORY_KEY="${NVIDIA_UBUNTU_REPOSITORY_URL}/7fa2af80.pub"
readonly NVIDIA_UBUNTU_REPOSITORY_CUDA_PIN="${NVIDIA_UBUNTU_REPOSITORY_URL}/cuda-ubuntu1804.pin"

function execute_with_retries() {
local -r cmd=$1
for ((i = 0; i < 10; i++)); do
if eval "$cmd"; then
return 0
fi
sleep 5
done
return 1
}

function install_nvidia_gpu_driver() {
curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
"${NVIDIA_UBUNTU_REPOSITORY_KEY}" | apt-key add -

curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
"${NVIDIA_UBUNTU_REPOSITORY_CUDA_PIN}" -o /etc/apt/preferences.d/cuda-repository-pin-600

add-apt-repository "deb ${NVIDIA_UBUNTU_REPOSITORY_URL} /"
execute_with_retries "apt-get update"

if [[ -n "${CUDA_VERSION}" ]]; then
local -r cuda_package=cuda-${CUDA_VERSION//./-}
else
local -r cuda_package=cuda
fi
# Without --no-install-recommends this takes a very long time.
execute_with_retries "apt-get install -y -q --no-install-recommends ${cuda_package}"

echo "NVIDIA GPU driver provided by NVIDIA was installed successfully"
}

function main() {

# updates
export DEBIAN_FRONTEND=noninteractive
execute_with_retries "apt-get update"
execute_with_retries "apt-get install -y -q pciutils"

execute_with_retries "apt-get install -y -q 'linux-headers-$(uname -r)'"

install_nvidia_gpu_driver
}

main
```

Google provides a `generate_custom_image.py` script that:
- Launches a temporary Compute Engine VM instance with the specified Dataproc base image.
- Then runs the customization script inside the VM instance to install custom packages and/or update
configurations.
- After the customization script finishes, it shuts down the VM instance and creates a Dataproc
custom image from the disk of the VM instance.
- The temporary VM is deleted after the custom image is created.
- The custom image is saved and can be used to create Dataproc clusters.

Copy the customization script below to a file called `gpu_dataproc_packages.sh`. The script uses
Google's `generate_custom_image.py` script. This step may take 20-25 minutes to complete.

```bash
git clone https://github.com/GoogleCloudDataproc/custom-images
cd custom-images

export CUSTOMIZATION_SCRIPT=/path/to/gpu_dataproc_packages.sh
export ZONE=[Your Preferred GCP Zone]
export GCS_BUCKET=[Your GCS Bucket]
export IMAGE_NAME=a207-ubuntu18-gpu-t4
export DATAPROC_VERSION=2.0.7-ubuntu18
export GPU_NAME=nvidia-tesla-t4
export GPU_COUNT=2

python generate_custom_image.py \
--image-name $IMAGE_NAME \
--dataproc-version $DATAPROC_VERSION \
--customization-script $CUSTOMIZATION_SCRIPT \
--zone $ZONE \
--gcs-bucket $GCS_BUCKET \
--machine-type n1-highmem-32 \
--accelerator type=$GPU_NAME,count=$GPU_COUNT \
--disk-size 40
```

See [here](https://cloud.google.com/dataproc/docs/guides/dataproc-images#running_the_code) for more
details on `generate_custom_image.py` script arguments and
[here](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions) for dataproc version description.

The image `sample-207-ubuntu18-gpu-t4` is now ready and can be viewed in the GCP console under
`Compute Engine > Storage > Images`. The next step is to launch the cluster using this new image and
new initialization actions (that do not install NVIDIA drivers since we are already past that step).

Here is the new custom GPU initialization action that only configures YARN, the YARN node manager,
GPU isolation and GPU exclusive mode:

```bash
#!/bin/bash

# Dataproc configurations
readonly HADOOP_CONF_DIR='/etc/hadoop/conf'
readonly HIVE_CONF_DIR='/etc/hive/conf'
readonly SPARK_CONF_DIR='/etc/spark/conf'

function execute_with_retries() {
local -r cmd=$1
for ((i = 0; i < 10; i++)); do
if eval "$cmd"; then
return 0
fi
sleep 5
done
return 1
}

function set_hadoop_property() {
local -r config_file=$1
local -r property=$2
local -r value=$3
bdconfig set_property \
--configuration_file "${HADOOP_CONF_DIR}/${config_file}" \
--name "${property}" --value "${value}" \
--clobber
}

function configure_yarn() {
if [[ ! -f ${HADOOP_CONF_DIR}/resource-types.xml ]]; then
printf '<?xml version="1.0" ?>\n<configuration/>' >"${HADOOP_CONF_DIR}/resource-types.xml"
fi
set_hadoop_property 'resource-types.xml' 'yarn.resource-types' 'yarn.io/gpu'

set_hadoop_property 'capacity-scheduler.xml' \
'yarn.scheduler.capacity.resource-calculator' \
'org.apache.hadoop.yarn.util.resource.DominantResourceCalculator'

set_hadoop_property 'yarn-site.xml' 'yarn.resource-types' 'yarn.io/gpu'
}

function configure_yarn_nodemanager() {
set_hadoop_property 'yarn-site.xml' 'yarn.nodemanager.resource-plugins' 'yarn.io/gpu'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.resource-plugins.gpu.allowed-gpu-devices' 'auto'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.resource-plugins.gpu.path-to-discovery-executables' '/usr/bin'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.linux-container-executor.cgroups.mount' 'true'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.linux-container-executor.cgroups.mount-path' '/sys/fs/cgroup'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.linux-container-executor.cgroups.hierarchy' 'yarn'
set_hadoop_property 'yarn-site.xml' \
'yarn.nodemanager.container-executor.class' \
'org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor'
set_hadoop_property 'yarn-site.xml' 'yarn.nodemanager.linux-container-executor.group' 'yarn'

# Fix local dirs access permissions
local yarn_local_dirs=()
readarray -d ',' yarn_local_dirs < <(bdconfig get_property_value \
--configuration_file "${HADOOP_CONF_DIR}/yarn-site.xml" \
--name "yarn.nodemanager.local-dirs" 2>/dev/null | tr -d '\n')
chown yarn:yarn -R "${yarn_local_dirs[@]/,/}"
}

function configure_gpu_exclusive_mode() {
# check if running spark 3, if not, enable GPU exclusive mode
local spark_version
spark_version=$(spark-submit --version 2>&1 | sed -n 's/.*version[[:blank:]]\+\([0-9]\+\.[0-9]\).*/\1/p' | head -n1)
if [[ ${spark_version} != 3.* ]]; then
# include exclusive mode on GPU
nvidia-smi -c EXCLUSIVE_PROCESS
fi
}

function configure_gpu_isolation() {
# Download GPU discovery script
local -r spark_gpu_script_dir='/usr/lib/spark/scripts/gpu'
mkdir -p ${spark_gpu_script_dir}
local -r gpu_resources_url=https://raw.githubusercontent.com/apache/spark/master/examples/src/main/scripts/getGpusResources.sh
curl -fsSL --retry-connrefused --retry 10 --retry-max-time 30 \
"${gpu_resources_url}" -o ${spark_gpu_script_dir}/getGpusResources.sh
chmod a+rwx -R ${spark_gpu_script_dir}

# enable GPU isolation
sed -i "s/yarn.nodemanager\.linux\-container\-executor\.group\=/yarn\.nodemanager\.linux\-container\-executor\.group\=yarn/g" "${HADOOP_CONF_DIR}/container-executor.cfg"
printf '\n[gpu]\nmodule.enabled=true\n[cgroups]\nroot=/sys/fs/cgroup\nyarn-hierarchy=yarn\n' >>"${HADOOP_CONF_DIR}/container-executor.cfg"

chmod a+rwx -R /sys/fs/cgroup/cpu,cpuacct
chmod a+rwx -R /sys/fs/cgroup/devices
}

function main() {

# This configuration should run on all nodes regardless of attached GPUs
configure_yarn

configure_yarn_nodemanager

configure_gpu_isolation

configure_gpu_exclusive_mode

}

main
```

Move this to your own bucket. Lets launch the cluster:

```bash
export REGION=[Your Preferred GCP Region]
export GCS_BUCKET=[Your GCS Bucket]
export CLUSTER_NAME=[Your Cluster Name]
export NUM_GPUS=2
export NUM_WORKERS=2

gcloud dataproc clusters create $CLUSTER_NAME \
--region $REGION \
--image=sample-207-ubuntu18-gpu-t4 \
--master-machine-type n1-highmem-32 \
--master-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS \
--num-workers $NUM_WORKERS \
--worker-accelerator type=nvidia-tesla-t4,count=$NUM_GPUS \
--worker-machine-type n1-highmem-32\
--num-worker-local-ssds 4 \
--initialization-actions gs://$GCS_BUCKET/custom_gpu_init_actions.sh,gs://goog-dataproc-initialization-actions-${REGION}/rapids/rapids.sh \
--optional-components=JUPYTER,ZEPPELIN \
--metadata gpu-driver-provider="NVIDIA" \
--metadata rapids-runtime=SPARK \
--bucket $GCS_BUCKET \
--enable-component-gateway \
```

The new cluster should be up and running within 4-5 minutes!

0 comments on commit 5de2f24

Please sign in to comment.