-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the pandas udf using cudf on Databricks #2061
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
#!/bin/bash | ||
# | ||
# Copyright (c) 2021, NVIDIA CORPORATION. All rights reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# | ||
|
||
# The initscript to set up environment for the cudf_udf tests on Databrcks | ||
# Will be automatically pushed into the dbfs:/databricks/init_scripts once it is updated. | ||
|
||
CUDF_VER=${CUDF_VER:-0.19} | ||
|
||
# Use mamba to install cudf-udf packages to speed up conda resolve time | ||
base=$(conda info --base) | ||
conda create -y -n mamba -c conda-forge mamba | ||
pip uninstall -y pyarrow | ||
${base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas | ||
${base}/envs/mamba/bin/mamba install -y pyarrow=1.0.1 -c conda-forge | ||
${base}/envs/mamba/bin/mamba install -y -c rapidsai -c rapidsai-nightly -c nvidia -c conda-forge -c defaults cudf=$CUDF_VER cudatoolkit=10.1 | ||
conda env remove -n mamba |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,10 +33,25 @@ sudo chmod 777 /databricks/data/logs/ | |
sudo chmod 777 /databricks/data/logs/* | ||
echo { \"port\":\"15002\" } > ~/.databricks-connect | ||
|
||
CUDF_UDF_TEST_ARGS="--conf spark.python.daemon.module=rapids.daemon_databricks \ | ||
--conf spark.rapids.memory.gpu.allocFraction=0.1 \ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I assume this setting still works for the rest of the tests, did we see a change in runtime at all? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh never mind, I see below we run them separate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it might be nice to rename these to be CUDF_UDF_TEST_ARGS There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good suggestion, let me change it |
||
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \ | ||
--conf spark.rapids.python.concurrentPythonWorkers=2" | ||
|
||
if [ -d "$LOCAL_JAR_PATH" ]; then | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Seems like we could reduce the redundancy of the two code paths here with something like this: export LOCAL_JAR_PATH=${LOCAL_JAR_PATH:-/home/ubuntu/spark-rapids}
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks"
## Run cudf-udf tests
PYTHON_JARS=$(ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v tests.jar)
SPARK_SUBMIT_FLAGS="$SPARK_SUBMIT_FLAGS --conf spark.executorEnv.PYTHONPATH="$PYTHON_JARS"
SPARK_SUBMIT_FLAGS=$SPARK_SUBMIT_FLAGS TEST_PARALLEL=1 \
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As we now use the common script For the nightly-build pipeline: rapids jars are built out in the sub-dirs (e.g. dist/target/, target/, udf-examples/target ) instead of the basedir For the nightly-IT pipeline, all the tests jars are downloaded from dependency repo into the basedir To reduce the redundancy here and make LOCAL_JAR_PATH=/home/ubuntu Copy jars for the IT scripts LOCAL_JAR_PATHcp /home/ubuntu/spark-rapids/integration_tests/target/dependency/cudf-.jar There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it. |
||
## Run tests with jars in the LOCAL_JAR_PATH dir downloading from the denpedency repo | ||
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" | ||
LOCAL_JAR_PATH=$LOCAL_JAR_PATH bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" | ||
|
||
## Run cudf-udf tests | ||
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v 'tests.jar'`" | ||
LOCAL_JAR_PATH=$LOCAL_JAR_PATH SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \ | ||
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf | ||
else | ||
## Run tests with jars building from the spark-rapids source code | ||
bash /home/ubuntu/spark-rapids/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" | ||
|
||
## Run cudf-udf tests | ||
CUDF_UDF_TEST_ARGS="$CUDF_UDF_TEST_ARGS --conf spark.executorEnv.PYTHONPATH=`ls /home/ubuntu/spark-rapids/dist/target/rapids-4-spark_*.jar | grep -v 'tests.jar'`" | ||
SPARK_SUBMIT_FLAGS=$CUDF_UDF_TEST_ARGS TEST_PARALLEL=1 \ | ||
bash /home/ubuntu/spark-rapids/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the version of pyarrow is to old or why uninstall here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the pyarrow version is the same (1.0.1). The init script is from Zhu hao's confluence page, and I was told that if we do not uninstall pyarrow first, there will be some dependencies issues when running install/unstall "{base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas",
I tried to remove 'pip uninstall -y pyarrow' here and got python-jvm socket connection error as below, there should be errors like socket parameters mismatching.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at java.io.DataInputStream.readInt(DataInputStream.java:387)