Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run the pandas udf using cudf on Databricks #2061

Merged
merged 2 commits into from
Apr 5, 2021

Conversation

NvTimLiu
Copy link
Collaborator

@NvTimLiu NvTimLiu commented Apr 1, 2021

Fixes #2026

Add the init script to set up environment for the cudf_udf tests on Databrcks, run cudf-udf tests nightly

Signed-off-by: Tim Liu timl@nvidia.com

Issue: 2026

Add the init script to set up environment for the cudf_udf tests on Databrcks

Run cudf-udf test cases nightly

Signed-off-by: Tim Liu <timl@nvidia.com>
@NvTimLiu NvTimLiu added the build Related to CI / CD or cleanly building label Apr 1, 2021
@NvTimLiu NvTimLiu self-assigned this Apr 1, 2021
@NvTimLiu NvTimLiu requested a review from pxLi April 1, 2021 10:52
@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Apr 1, 2021

build

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Apr 1, 2021

Test PASS on the Databricks nightly pipeline and integration pipelines

# Use mamba to install cudf-udf packages to speed up conda resolve time
base=$(conda info --base)
conda create -y -n mamba -c conda-forge mamba
pip uninstall -y pyarrow
Copy link
Collaborator

@tgravescs tgravescs Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the version of pyarrow is to old or why uninstall here?

Copy link
Collaborator Author

@NvTimLiu NvTimLiu Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact the pyarrow version is the same (1.0.1). The init script is from Zhu hao's confluence page, and I was told that if we do not uninstall pyarrow first, there will be some dependencies issues when running install/unstall "{base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas",

I tried to remove 'pip uninstall -y pyarrow' here and got python-jvm socket connection error as below, there should be errors like socket parameters mismatching.

Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at java.io.DataInputStream.readInt(DataInputStream.java:387)

@@ -33,10 +33,25 @@ sudo chmod 777 /databricks/data/logs/
sudo chmod 777 /databricks/data/logs/*
echo { \"port\":\"15002\" } > ~/.databricks-connect

SPARK_SUBMIT_FLAGS="--conf spark.python.daemon.module=rapids.daemon_databricks \
--conf spark.rapids.memory.gpu.allocFraction=0.1 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this setting still works for the rest of the tests, did we see a change in runtime at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh never mind, I see below we run them separate

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be nice to rename these to be CUDF_UDF_TEST_ARGS

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, let me change it

--conf spark.rapids.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.concurrentPythonWorkers=2"

if [ -d "$LOCAL_JAR_PATH" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we could reduce the redundancy of the two code paths here with something like this:

export LOCAL_JAR_PATH=${LOCAL_JAR_PATH:-/home/ubuntu/spark-rapids}
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh  --runtime_env="databricks"

## Run cudf-udf tests
PYTHON_JARS=$(ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v tests.jar)
SPARK_SUBMIT_FLAGS="$SPARK_SUBMIT_FLAGS --conf spark.executorEnv.PYTHONPATH="$PYTHON_JARS"
SPARK_SUBMIT_FLAGS=$SPARK_SUBMIT_FLAGS TEST_PARALLEL=1 \
        bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf

Copy link
Collaborator Author

@NvTimLiu NvTimLiu Apr 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we now use the common script run_pyspark_from_build.sh to run both DB nightly-build pipeline and DB nightly-test pipeline.

For the nightly-build pipeline: rapids jars are built out in the sub-dirs (e.g. dist/target/, target/, udf-examples/target ) instead of the basedir /home/ubuntu/spark-rapids. So export LOCAL_JAR_PATH=/home/ubuntu/spark-rapids will NOT work for the nightly-build case, we depended on below scripts to set test jar paths, instead of export LOCAL_JAR_PATH
https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/integration_tests/run_pyspark_from_build.sh#L35--L38

For the nightly-IT pipeline, all the tests jars are downloaded from dependency repo into the basedir /home/ubuntu/. So the export LOCAL_JAR_PATH=/home/ubuntu works in this case.

To reduce the redundancy here and make LOCAL_JAR_PATH common, we can copy the nightly-build jars into the basedir /home/ubuntu/. I'd prefer copy jars for the nightly-build pipeline as below, it make the test script common , too. @jlowe @tgravescs What's your suggestion?

LOCAL_JAR_PATH=/home/ubuntu

Copy jars for the IT scripts LOCAL_JAR_PATH

cp /home/ubuntu/spark-rapids/integration_tests/target/dependency/cudf-.jar
/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_
.jar
/home/ubuntu/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*.jar
/home/ubuntu/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples*.jar $LOCAL_JAR_PATH

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.

--conf spark.rapids.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \
--conf spark.rapids.python.concurrentPythonWorkers=2"

if [ -d "$LOCAL_JAR_PATH" ]; then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.

@NvTimLiu
Copy link
Collaborator Author

NvTimLiu commented Apr 5, 2021

build

@NvTimLiu NvTimLiu merged commit 12d84c8 into NVIDIA:branch-0.5 Apr 5, 2021
@sameerz sameerz changed the title Run the pands udf using cudf on Databricks Run the pandas udf using cudf on Databricks Apr 10, 2021
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Run the pands udf using cudf on Databricks

Issue: 2026

Add the init script to set up environment for the cudf_udf tests on Databrcks

Run cudf-udf test cases nightly

Signed-off-by: Tim Liu <timl@nvidia.com>

* Update, user 'CUDF_UDF_TEST_ARGS'
nartal1 pushed a commit to nartal1/spark-rapids that referenced this pull request Jun 9, 2021
* Run the pands udf using cudf on Databricks

Issue: 2026

Add the init script to set up environment for the cudf_udf tests on Databrcks

Run cudf-udf test cases nightly

Signed-off-by: Tim Liu <timl@nvidia.com>

* Update, user 'CUDF_UDF_TEST_ARGS'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Related to CI / CD or cleanly building
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[TEST] run the pandas udf using cudf on Databricks
3 participants