You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
running v0.3.0 spark-rapids integration tests(pytests) Enabling UCX on Yarn Cluster. It FAILED on src/main/python/cache_test.py always hanging there
@abellina As PR #1540 merged into spark-rapids branch-0.4, should we close this issue, and verify UCX pytests against latest rapids 0.4.0-SNAPSHOT/cudf-0.18-SNAPSHOT?
Describe the bug
running v0.3.0 spark-rapids integration tests(pytests) Enabling UCX on Yarn Cluster. It FAILED on src/main/python/cache_test.py always hanging there
Steps/Code to reproduce bug
• Spark-rapids v0.3.0: https://oss.sonatype.org/content/repositories/comnvidia-1036/com/nvidia/rapids-4-spark_2.12/0.3.0/rapids-4-spark_2.12-0.3.0.jar
• cuDF v0.17 : https://urm.nvidia.com/artifactory/sw-spark-maven/ai/rapids/cudf/0.17/cudf-0.17-cuda10-1.jar
• spark-submit scripts: spark-egx-03:/home/timl/yarn-IT/ucx-submit-yarn.sh
• DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx-patch
• Yarn Log: http://spark-egx-03:8088/proxy/application_1603128018386_5631/
#!/bin/bash
set +ex
export SPARK_CONF_DIR=/usr/hdp/current/spark2-client/conf/
export HADOOP_HOME=/usr/hdp/3.1.0.0-78/hadoop
export HADOOP_CONF_DIR=/usr/hdp/3.1.0.0-78/hadoop/conf
export SPARK_HOME=/home/timl/spark-3.0.1-bin-hadoop3.2
export PATH=$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH
export PYSPARK_PYTHON=/usr/bin/python3.6
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6
OS_TYPE=ubuntu18
CUDA_NAME=cuda10.1
CUDA_DOCKER_TAG=${CUDA_NAME/./-}
CUDA_CLASSIFIER=${CUDA_NAME/./-}
FINAL=${CUDA_CLASSIFIER: -2}
if [ "$FINAL" == "-0" ]; then
CUDA_CLASSIFIER=${CUDA_CLASSIFIER%-0}
fi
echo $CUDA_CLASSIFIER
DOCKER_IMAGE=quay.io/nvidia/spark:${OS_TYPE}${CUDA_DOCKER_TAG}-yarn3
DOCKER_IMAGE=quay.io/nvidia/spark:${OS_TYPE}cudf17-${CUDA_CLASSIFIER}-udf
DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx-patch
DOCKER_IMAGE=quay.io/nvidia/spark:abellina_ubuntu18cuda10-1-yarn3-ucx
SUBMIT_ARGS="
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark301.RapidsShuffleManager
--conf spark.shuffle.service.enabled=false
--conf spark.rapids.shuffle.maxMetadataSize=1MB
--conf spark.rapids.shuffle.transport.enabled=true
--conf spark.rapids.shuffle.compression.codec=none
--conf spark.executorEnv.UCX_TLS=cuda_copy,cuda_ipc,rc,tcp
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
--conf spark.executorEnv.UCX_CUDA_IPC_CACHE=y
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.rapids.shuffle.ucx.bounceBuffers.size=4MB
--conf spark.rapids.shuffle.ucx.bounceBuffers.device.count=32
--conf spark.rapids.shuffle.ucx.bounceBuffers.host.count=32
--conf spark.executorEnv.LD_LIBRARY_PATH=/usr/lib:/usr/lib/ucx
--conf spark.sql.shuffle.partitions=200
--conf spark.executor.extraClassPath=/usr/lib:/usr/lib/ucx:cudf-0.17-${CUDA_CLASSIFIER}.jar:rapids-4-spark_2.12-0.3.0.jar
--conf spark.sql.broadcastTimeout=7200
--conf spark.network.timeout=3600s
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_RUN_PRIVILEGED_CONTAINER=true
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_RUN_PRIVILEGED_CONTAINER=true
--conf spark.executorEnv.UCX_NET_DEVICES=mlx5_3:1
--master yarn --deploy-mode cluster
--num-executors 4
--driver-memory 40G --executor-memory 200G
--conf spark.executor.cores=40 --conf spark.task.cpus=1
--conf spark.sql.files.maxPartitionBytes=4294967296
--conf spark.yarn.maxAppAttempts=1
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.prefer-pinned=true
--conf spark.rapids.memory.pinnedPool.size=8g
--conf spark.rapids.sql.concurrentGpuTasks=2
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.executorEnv.PYSPARK_PYTHON=/usr/bin/python3.6
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/hadoop/conf:/etc/hadoop/conf:ro
--conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro,/etc/hadoop/conf:/etc/hadoop/conf:ro
--conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$DOCKER_IMAGE
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=/usr/bin/python3.6
--conf spark.driver.PYSPARK_DRIVER_PYTHON=/usr/bin/python3.6
--jars hdfs:/jars/tim-test/cudf-0.17-${CUDA_CLASSIFIER}.jar,hdfs:/jars/tim-test/rapids-4-spark_2.12-0.3.0.jar
"
cd /home/timl/yarn-IT/jars/integration_tests
spark-submit $SUBMIT_ARGS
--archives /home/timl/yarn-IT/jars/integration_tests.zip#sampletests
/home/timl/yarn-IT/jars/run-3.6.py src/main/python/cache_test.py -v -rfExXs
Expected behavior
Tests can PASS
Environment details (please complete the following information)
Yarn cluster, spark-submit scripts: spark-egx-03:/home/timl/yarn-IT/ucx-submit-yarn.sh
Additional context
The text was updated successfully, but these errors were encountered: