-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run the pandas udf using cudf on Databricks #2061
Conversation
Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com>
build |
Test PASS on the Databricks nightly pipeline and integration pipelines |
# Use mamba to install cudf-udf packages to speed up conda resolve time | ||
base=$(conda info --base) | ||
conda create -y -n mamba -c conda-forge mamba | ||
pip uninstall -y pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the version of pyarrow is to old or why uninstall here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact the pyarrow version is the same (1.0.1). The init script is from Zhu hao's confluence page, and I was told that if we do not uninstall pyarrow first, there will be some dependencies issues when running install/unstall "{base}/envs/mamba/bin/mamba remove -y c-ares zstd libprotobuf pandas",
I tried to remove 'pip uninstall -y pyarrow' here and got python-jvm socket connection error as below, there should be errors like socket parameters mismatching.
Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:210)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at java.net.SocketInputStream.read(SocketInputStream.java:224)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
@@ -33,10 +33,25 @@ sudo chmod 777 /databricks/data/logs/ | |||
sudo chmod 777 /databricks/data/logs/* | |||
echo { \"port\":\"15002\" } > ~/.databricks-connect | |||
|
|||
SPARK_SUBMIT_FLAGS="--conf spark.python.daemon.module=rapids.daemon_databricks \ | |||
--conf spark.rapids.memory.gpu.allocFraction=0.1 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this setting still works for the rest of the tests, did we see a change in runtime at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh never mind, I see below we run them separate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be nice to rename these to be CUDF_UDF_TEST_ARGS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, let me change it
--conf spark.rapids.memory.gpu.allocFraction=0.1 \ | ||
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \ | ||
--conf spark.rapids.python.concurrentPythonWorkers=2" | ||
|
||
if [ -d "$LOCAL_JAR_PATH" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could reduce the redundancy of the two code paths here with something like this:
export LOCAL_JAR_PATH=${LOCAL_JAR_PATH:-/home/ubuntu/spark-rapids}
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks"
## Run cudf-udf tests
PYTHON_JARS=$(ls $LOCAL_JAR_PATH/rapids-4-spark_*.jar | grep -v tests.jar)
SPARK_SUBMIT_FLAGS="$SPARK_SUBMIT_FLAGS --conf spark.executorEnv.PYTHONPATH="$PYTHON_JARS"
SPARK_SUBMIT_FLAGS=$SPARK_SUBMIT_FLAGS TEST_PARALLEL=1 \
bash $LOCAL_JAR_PATH/integration_tests/run_pyspark_from_build.sh --runtime_env="databricks" -m "cudf_udf" --cudf_udf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we now use the common script run_pyspark_from_build.sh
to run both DB nightly-build pipeline and DB nightly-test pipeline.
For the nightly-build pipeline: rapids jars are built out in the sub-dirs (e.g. dist/target/, target/, udf-examples/target ) instead of the basedir /home/ubuntu/spark-rapids
. So export LOCAL_JAR_PATH=/home/ubuntu/spark-rapids will NOT work for the nightly-build case, we depended on below scripts to set test jar paths, instead of export LOCAL_JAR_PATH
https://github.com/NVIDIA/spark-rapids/blob/branch-0.5/integration_tests/run_pyspark_from_build.sh#L35--L38
For the nightly-IT pipeline, all the tests jars are downloaded from dependency repo into the basedir /home/ubuntu/
. So the export LOCAL_JAR_PATH=/home/ubuntu
works in this case.
To reduce the redundancy here and make LOCAL_JAR_PATH
common, we can copy the nightly-build jars into the basedir /home/ubuntu/
. I'd prefer copy jars for the nightly-build pipeline as below, it make the test script common , too. @jlowe @tgravescs What's your suggestion?
LOCAL_JAR_PATH=/home/ubuntu
Copy jars for the IT scripts LOCAL_JAR_PATH
cp /home/ubuntu/spark-rapids/integration_tests/target/dependency/cudf-.jar
/home/ubuntu/spark-rapids/dist/target/rapids-4-spark_.jar
/home/ubuntu/spark-rapids/integration_tests/target/rapids-4-spark-integration-tests*.jar
/home/ubuntu/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples*.jar $LOCAL_JAR_PATH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.
--conf spark.rapids.memory.gpu.allocFraction=0.1 \ | ||
--conf spark.rapids.python.memory.gpu.allocFraction=0.1 \ | ||
--conf spark.rapids.python.concurrentPythonWorkers=2" | ||
|
||
if [ -d "$LOCAL_JAR_PATH" ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it's complicated to commonize this code, that's fine let's leave that suggestion out of this PR. We can do a followup PR to tackle it.
build |
* Run the pands udf using cudf on Databricks Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com> * Update, user 'CUDF_UDF_TEST_ARGS'
* Run the pands udf using cudf on Databricks Issue: 2026 Add the init script to set up environment for the cudf_udf tests on Databrcks Run cudf-udf test cases nightly Signed-off-by: Tim Liu <timl@nvidia.com> * Update, user 'CUDF_UDF_TEST_ARGS'
Fixes #2026
Add the init script to set up environment for the cudf_udf tests on Databrcks, run cudf-udf tests nightly
Signed-off-by: Tim Liu timl@nvidia.com