Run the pandas udf using cudf on Databricks #2061

NvTimLiu · 2021-04-01T10:52:15Z

Fixes #2026

Add the init script to set up environment for the cudf_udf tests on Databrcks, run cudf-udf tests nightly

Signed-off-by: Tim Liu timl@nvidia.com

Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com>

NvTimLiu · 2021-04-01T10:52:36Z

build

NvTimLiu · 2021-04-01T10:53:44Z

Test PASS on the Databricks nightly pipeline and integration pipelines

tgravescs · 2021-04-01T13:08:35Z

jenkins/databricks/init_cudf_udf.sh

+# Use mamba to install cudf-udf packages to speed up conda resolve time
+base=$(conda info --base)
+conda create -y -n mamba -c conda-forge mamba
+pip uninstall -y pyarrow


the version of pyarrow is to old or why uninstall here?

In fact the pyarrow version is the same (1.0.1). The init script is from Zhu hao's confluence page, and I was told that if we do not uninstall pyarrow first, there will be some dependencies issues when running install/unstall "{base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas",

I tried to remove 'pip uninstall -y pyarrow' here and got python-jvm socket connection error as below, there should be errors like socket parameters mismatching.

Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at java.io.DataInputStream.readInt(DataInputStream.java:387)

tgravescs · 2021-04-01T13:11:32Z

jenkins/databricks/test.sh

@@ -33,10 +33,25 @@ sudo chmod 777 /databricks/data/logs/
 sudo chmod 777 /databricks/data/logs/*
 echo { \"port\":\"15002\" } > ~/.databricks-connect

+SPARK_SUBMIT_FLAGS="--conf spark.python.daemon.module=rapids.daemon_databricks \
+    --conf spark.rapids.memory.gpu.allocFraction=0.1 \


I assume this setting still works for the rest of the tests, did we see a change in runtime at all?

oh never mind, I see below we run them separate

it might be nice to rename these to be CUDF_UDF_TEST_ARGS

Good suggestion, let me change it

jlowe · 2021-04-01T13:16:38Z

jenkins/databricks/test.sh

+    --conf spark.rapids.memory.gpu.allocFraction=0.1 \
+    --conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
+    --conf spark.rapids.python.concurrentPythonWorkers=2"
+
 if [ -d "$LOCAL_JAR_PATH" ]; then


Seems like we could reduce the redundancy of the two code paths here with something like this:

export LOCAL_JAR_PATH=${LOCAL_JAR_PATH:-/home/ubuntu/spark-rapids} bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" ## Run cudf-udf tests PYTHON_JARS=$(ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v tests.jar) SPARK_SUBMIT_FLAGS="$SPARK_SUBMIT_FLAGS --conf spark.executorEnv.PYTHONPATH="$PYTHON_JARS" SPARK_SUBMIT_FLAGS=$SPARK_SUBMIT_FLAGS TEST_PARALLEL=1 \ bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf

As we now use the common script run_pyspark_from_build.sh to run both DB nightly-build pipeline and DB nightly-test pipeline.

For the nightly-build pipeline: rapids jars are built out in the sub-dirs (e.g. dist/target/, target/, udf-examples/target ) instead of the basedir /home/ubuntu/spark-rapids. So export LOCAL_JAR_PATH=/home/ubuntu/spark-rapids will NOT work for the nightly-build case, we depended on below scripts to set test jar paths, instead of export LOCAL_JAR_PATH
https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/integration_tests/run_pyspark_from_build.sh#L35--L38

For the nightly-IT pipeline, all the tests jars are downloaded from dependency repo into the basedir /home/ubuntu/. So the export LOCAL_JAR_PATH=/home/ubuntu works in this case.

To reduce the redundancy here and make LOCAL_JAR_PATH common, we can copy the nightly-build jars into the basedir /home/ubuntu/. I'd prefer copy jars for the nightly-build pipeline as below, it make the test script common , too. @jlowe @tgravescs What's your suggestion?

LOCAL_JAR_PATH=/home/ubuntu

Copy jars for the IT scripts LOCAL_JAR_PATH

cp /home/ubuntu/spark-rapids/integration_tests/target/dependency/cudf-.jar
/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_.jar
/home/ubuntu/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*.jar
/home/ubuntu/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples*.jar $LOCAL_JAR_PATH

If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.

jlowe · 2021-04-02T14:01:14Z

jenkins/databricks/test.sh

+    --conf spark.rapids.memory.gpu.allocFraction=0.1 \
+    --conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
+    --conf spark.rapids.python.concurrentPythonWorkers=2"
+
 if [ -d "$LOCAL_JAR_PATH" ]; then


If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.

NvTimLiu · 2021-04-05T04:48:22Z

build

* Run the pands udf using cudf on Databricks Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com> * Update, user 'CUDF_UDF_TEST_ARGS'

Run the pands udf using cudf on Databricks

3d9f748

Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com>

NvTimLiu added the build Related to CI / CD or cleanly building label Apr 1, 2021

NvTimLiu self-assigned this Apr 1, 2021

NvTimLiu requested review from GaryShen2008, jlowe, revans2 and tgravescs as code owners April 1, 2021 10:52

NvTimLiu requested a review from pxLi April 1, 2021 10:52

tgravescs reviewed Apr 1, 2021

View reviewed changes

jlowe reviewed Apr 1, 2021

View reviewed changes

NvTimLiu mentioned this pull request Apr 2, 2021

Add the flag 'TEST_TYPE' to avoid integration tests silently skipping some test cases #2059

Merged

Update, user 'CUDF_UDF_TEST_ARGS'

7b96dd6

jlowe approved these changes Apr 2, 2021

View reviewed changes

NvTimLiu merged commit 12d84c8 into NVIDIA:branch-0.5 Apr 5, 2021

sameerz changed the title ~~Run the pands udf using cudf on Databricks~~ Run the pandas udf using cudf on Databricks Apr 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the pandas udf using cudf on Databricks #2061

Run the pandas udf using cudf on Databricks #2061

NvTimLiu commented Apr 1, 2021 •

edited by jlowe

Loading

NvTimLiu commented Apr 1, 2021

NvTimLiu commented Apr 1, 2021

tgravescs Apr 1, 2021 •

edited

Loading

NvTimLiu Apr 2, 2021 •

edited

Loading

tgravescs Apr 1, 2021

tgravescs Apr 1, 2021

tgravescs Apr 1, 2021

NvTimLiu Apr 2, 2021

jlowe Apr 1, 2021

NvTimLiu Apr 2, 2021 •

edited

Loading

jlowe Apr 2, 2021

jlowe Apr 2, 2021

NvTimLiu commented Apr 5, 2021

Run the pandas udf using cudf on Databricks #2061

Run the pandas udf using cudf on Databricks #2061

Conversation

NvTimLiu commented Apr 1, 2021 • edited by jlowe Loading

NvTimLiu commented Apr 1, 2021

NvTimLiu commented Apr 1, 2021

tgravescs Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

NvTimLiu Apr 2, 2021 • edited Loading

Choose a reason for hiding this comment

tgravescs Apr 1, 2021

Choose a reason for hiding this comment

tgravescs Apr 1, 2021

Choose a reason for hiding this comment

tgravescs Apr 1, 2021

Choose a reason for hiding this comment

NvTimLiu Apr 2, 2021

Choose a reason for hiding this comment

jlowe Apr 1, 2021

Choose a reason for hiding this comment

NvTimLiu Apr 2, 2021 • edited Loading

Choose a reason for hiding this comment

Copy jars for the IT scripts LOCAL_JAR_PATH

jlowe Apr 2, 2021

Choose a reason for hiding this comment

jlowe Apr 2, 2021

Choose a reason for hiding this comment

NvTimLiu commented Apr 5, 2021

NvTimLiu commented Apr 1, 2021 •

edited by jlowe

Loading

tgravescs Apr 1, 2021 •

edited

Loading

NvTimLiu Apr 2, 2021 •

edited

Loading

NvTimLiu Apr 2, 2021 •

edited

Loading