Skip to content

Commit

Permalink
Fix 4315 decrease concurrentGpuTasks to avoid sum test OOM (#4326)
Browse files Browse the repository at this point in the history
* Nightly test concurrentGpuTasks to 1 for OOM

Signed-off-by: Peixin Li <pxli@nyu.edu>

* update warning filter for pytest-order

* reduce executor core num

* reorder tests
  • Loading branch information
pxLi authored Dec 8, 2021
1 parent 367c5d7 commit 51a083c
Show file tree
Hide file tree
Showing 2 changed files with 6 additions and 4 deletions.
2 changes: 2 additions & 0 deletions integration_tests/pytest.ini
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,5 @@ markers =
validate_execs_in_gpu_plan([execs]): Exec class names to validate they exist in the GPU plan.
shuffle_test: Mark to include test in the RAPIDS Shuffle Manager
premerge_ci_1: Mark test that will run in first k8s pod in case of parallel build premerge job
filterwarnings =
ignore:.*pytest.mark.order.*:_pytest.warning_types.PytestUnknownMarkWarning
8 changes: 4 additions & 4 deletions jenkins/spark-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -142,10 +142,10 @@ export SEQ_CONF="--executor-memory 16G \
PARALLELISM=${PARALLELISM:-'4'}
MEMORY_FRACTION=$(python -c "print(1/($PARALLELISM + 0.2))")
export PARALLEL_CONF="--executor-memory 4G \
--total-executor-cores 2 \
--conf spark.executor.cores=2 \
--total-executor-cores 1 \
--conf spark.executor.cores=1 \
--conf spark.task.cpus=1 \
--conf spark.rapids.sql.concurrentGpuTasks=2 \
--conf spark.rapids.sql.concurrentGpuTasks=1 \
--conf spark.rapids.memory.gpu.minAllocFraction=0 \
--conf spark.rapids.memory.gpu.allocFraction=${MEMORY_FRACTION} \
--conf spark.rapids.memory.gpu.maxAllocFraction=${MEMORY_FRACTION}"
Expand Down Expand Up @@ -215,7 +215,7 @@ if [[ $TEST_MODE == "ALL" || $TEST_MODE == "IT_ONLY" ]]; then
# integration tests
if [[ $PARALLEL_TEST == "true" ]] && [ -x "$(command -v parallel)" ]; then
# put most time-consuming tests at the head of queue
time_consuming_tests="join_test.py hash_aggregate_test.py parquet_write_test.py"
time_consuming_tests="hash_aggregate_test.py join_test.py generate_expr_test.py parquet_write_test.py"
tests_list=$(find "$SCRIPT_PATH"/src/main/python/ -name "*_test.py" -printf "%f ")
tests=$(echo "$time_consuming_tests $tests_list" | tr ' ' '\n' | awk '!x[$0]++' | xargs)
# --halt "now,fail=1": exit when the first job fail, and kill running jobs.
Expand Down

0 comments on commit 51a083c

Please sign in to comment.