Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the pandas udf using cudf on Databricks #2061

Merged
merged 2 commits into from
Apr 5, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion jenkins/databricks/clusterutils.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Copyright (c) 2020, NVIDIA CORPORATION.
# Copyright (c) 2020-2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -46,6 +46,13 @@ def generate_create_templ(sshKey, cluster_name, runtime, idle_timeout,
templ['driver_node_type_id'] = driver_node_type
templ['ssh_public_keys'] = [ sshKey ]
templ['num_workers'] = num_workers
templ['init_scripts'] = [
{
"dbfs": {
"destination": "dbfs:/databricks/init_scripts/init_cudf_udf.sh"
}
}
]
return templ


Expand Down
30 changes: 30 additions & 0 deletions jenkins/databricks/init_cudf_udf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/bin/bash
#
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# The initscript to set up environment for the cudf_udf tests on Databrcks
# Will be automatically pushed into the dbfs:/databricks/init_scripts once it is updated.

CUDF_VER=${CUDF_VER:-0.19}

# Use mamba to install cudf-udf packages to speed up conda resolve time
base=$(conda info --base)
conda create -y -n mamba -c conda-forge mamba
pip uninstall -y pyarrow
Copy link
Collaborator

@tgravescs tgravescs Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version of pyarrow is to old or why uninstall here?

Copy link
Collaborator Author

@NvTimLiu NvTimLiu Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the pyarrow version is the same (1.0.1). The init script is from Zhu hao's confluence page, and I was told that if we do not uninstall pyarrow first, there will be some dependencies issues when running install/unstall "{base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas",

I tried to remove 'pip uninstall -y pyarrow' here and got python-jvm socket connection error as below, there should be errors like socket parameters mismatching.

Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at java.io.DataInputStream.readInt(DataInputStream.java:387)

${base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas
${base}/envs/mamba/bin/mamba install -y pyarrow=1.0.1 -c conda-forge
${base}/envs/mamba/bin/mamba install -y -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults cudf=$CUDF_VER cudatoolkit=10.1
conda env remove -n mamba
17 changes: 16 additions & 1 deletion jenkins/databricks/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,25 @@ sudo chmod 777 /databricks/data/logs/
sudo chmod 777 /databricks/data/logs/*
echo { \"port\":\"15002\" } > ~/.databricks-connect

CUDF_UDF_TEST_ARGS="--conf spark.python.daemon.module=rapids.daemon_databricks \
--conf spark.rapids.memory.gpu.allocFraction=0.1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this setting still works for the rest of the tests, did we see a change in runtime at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh never mind, I see below we run them separate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to rename these to be CUDF_UDF_TEST_ARGS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, let me change it

--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.concurrentPythonWorkers=2"

if [ -d "$LOCAL_JAR_PATH" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could reduce the redundancy of the two code paths here with something like this:

export LOCAL_JAR_PATH=${LOCAL_JAR_PATH:-/home/ubuntu/spark-rapids}
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh  --runtime_env="databricks"

## Run cudf-udf tests
PYTHON_JARS=$(ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v tests.jar)
SPARK_SUBMIT_FLAGS="$SPARK_SUBMIT_FLAGS --conf spark.executorEnv.PYTHONPATH="$PYTHON_JARS"
SPARK_SUBMIT_FLAGS=$SPARK_SUBMIT_FLAGS TEST_PARALLEL=1 \
        bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf

Copy link
Collaborator Author

@NvTimLiu NvTimLiu Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we now use the common script run_pyspark_from_build.sh to run both DB nightly-build pipeline and DB nightly-test pipeline.

For the nightly-build pipeline: rapids jars are built out in the sub-dirs (e.g. dist/target/, target/, udf-examples/target ) instead of the basedir /home/ubuntu/spark-rapids. So export LOCAL_JAR_PATH=/home/ubuntu/spark-rapids will NOT work for the nightly-build case, we depended on below scripts to set test jar paths, instead of export LOCAL_JAR_PATH
https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/integration_tests/run_pyspark_from_build.sh#L35--L38

For the nightly-IT pipeline, all the tests jars are downloaded from dependency repo into the basedir /home/ubuntu/. So the export LOCAL_JAR_PATH=/home/ubuntu works in this case.

To reduce the redundancy here and make LOCAL_JAR_PATH common, we can copy the nightly-build jars into the basedir /home/ubuntu/. I'd prefer copy jars for the nightly-build pipeline as below, it make the test script common , too. @jlowe @tgravescs What's your suggestion?

LOCAL_JAR_PATH=/home/ubuntu

Copy jars for the IT scripts LOCAL_JAR_PATH

cp /home/ubuntu/spark-rapids/integration_tests/target/dependency/cudf-.jar
/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_
.jar
/home/ubuntu/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*.jar
/home/ubuntu/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples*.jar $LOCAL_JAR_PATH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.

## Run tests with jars in the LOCAL_JAR_PATH dir downloading from the denpedency repo
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks"
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks"

## Run cudf-udf tests
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v 'tests.jar'`"
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf
else
## Run tests with jars building from the spark-rapids source code
bash /home/ubuntu/spark-rapids/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks"

## Run cudf-udf tests
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls /home/ubuntu/spark-rapids/dist/target/rapids-4-spark_*.jar | grep -v 'tests.jar'`"
SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \
bash /home/ubuntu/spark-rapids/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf
fi