[BUG] Segfault with UCX and ASYNC allocator #4695

rongou · 2022-02-05T00:51:08Z

Describe the bug
With UCX on and using the ASYNC allocator, seeing segfaults with q64 and q93.

Steps/Code to reproduce bug
Run queries with UCX on and setting the gpu memory pool to ASYNC.

Expected behavior
Should not crash.

Environment details (please complete the following information)

Environment location: Standalone
Spark configuration settings related to the issue

spark.driver.maxResultSize=2GB
spark.executor.cores=16
spark.executor.memory=240G
spark.driver.memory=50G
spark.locality.wait=0
spark.sql.adaptive.enabled=true
spark.sql.files.maxPartitionBytes=2g
spark.rapids.sql.concurrentGpuTasks=4
spark.executor.resource.gpu.amount=1
spark.task.resource.gpu.amount=0.0625
spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR:$SPARK_RAPIDS_PLUGIN_INTEGRATION_TEST_JAR:$SCALLOP_JAR
spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR
spark.rapids.memory.host.spillStorageSize=32G
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.memory.pinnedPool.size=8g
spark.rapids.shuffle.maxMetadataSize=1MB
spark.rapids.sql.incompatibleOps.enabled=true
spark.rapids.sql.variableFloatAgg.enabled=true
spark.rapids.sql.hasNans=false
spark.executor.instances=8
spark.shuffle.service.enabled=false
spark.rapids.shuffle.transport.enabled=true
spark.shuffle.manager=com.nvidia.spark.rapids.spark311.RapidsShuffleManager
spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
spark.executorEnv.UCX_ERROR_SIGNALS=
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA=yes
spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
spark.rapids.shuffle.ucx.bounceBuffers.size=4MB
spark.rapids.shuffle.ucx.bounceBuffers.device.count=64
spark.rapids.shuffle.ucx.bounceBuffers.host.count=64
spark.rapids.sql.decimalType.enabled=true
spark.rapids.sql.castFloatToDecimal.enabled=true

Additional context
Stdout:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc15e2616a3, pid=123327, tid=0x00007f83ae3fe700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x16d6a3]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /opt/spark/spark-3.1.1-bin-hadoop3.2/work/app-20220204010648-14398/6/hs_err_pid123327.log
[thread 140203929093888 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Error log: hs_err_pid123327.log

The text was updated successfully, but these errors were encountered:

abellina · 2022-02-07T22:48:29Z

Early results indicate that UCX_MEMTYPE_CACHE=n seems to help this case. We are trying to confirm with UCX folks to see if this is expected.

rongou · 2022-02-11T01:51:55Z

@abellina so is setting UCX_MEMTYPE_CACHE=n the official solution? Should we update the doc?

abellina · 2022-02-11T02:23:49Z

No this isn’t the solution, it may be a workaround but we haven’t been able to explain why yet.

The UCX mem type cache is not aware of cuda async allocations or frees. I will need to spend more time trying to debug this.

abellina · 2022-02-23T16:28:51Z

When running with UCX 1.12.1-rc2 the issue looks to be gone for both q63 and q93. Myself and others have been able to reproduce it consistently with 1.11.2 and 1.12.0. I'll find out more on why but that's the news so far.

abellina · 2022-03-21T20:13:42Z

I am not seeing this issue in UCX 1.12.1 as released. I have run both q63 and q93 ~20 times with ASYNC enabled and I can't reproduce it anymore, whereas I certainly could with prior versions of UCX. There are fixes around binary instrumentation of memory hooks and CUDA 11.5 included in this version. I do see that binary hooks are enabled per UCX log:

[1647890294.401576] [rl-r7525-d32-u09:129587:194] cudamem.c:247 UCX INFO cuda memory hooks mode bistro: installed 8 on driver API
[1647890294.401578] [rl-r7525-d32-u09:129587:194] cudamem.c:213 UCX DEBUG cuda memory hooks mode reloc is disabled for driver API
[1647890294.401581] [rl-r7525-d32-u09:129587:194] cudamem.c:213 UCX DEBUG cuda memory hooks mode reloc is disabled for runtime API

Which is a good sign. This version of UCX would complain if it had failed.

This is not a high priority for 22.04 and we are moving to 22.06 since ARENA (not ASYNC) is used by default in 22.04, and there is a workaround by upgrading to UCX 1.12.1, without changes to the JUCX in the plugin.

abellina · 2022-04-01T17:26:43Z

I still can't repro this segfault with 22.06, async and 1.12.1. We may be closing this and suggesting that the minimum UCX version is 1.12.1.

rongou added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 5, 2022

abellina mentioned this issue Feb 7, 2022

[BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator #4710

Closed

sameerz removed the ? - Needs Triage Need team to review and classify label Feb 8, 2022

sameerz assigned abellina Feb 8, 2022

sameerz added the P1 Nice to have for release label Feb 11, 2022

abellina mentioned this issue Mar 7, 2022

Make ARENA the default allocator for 22.04 #4909

Merged

abellina mentioned this issue Apr 4, 2022

Upgrade to UCX 1.12.1 for 22.06 #5141

Merged

jlowe closed this as completed in #5141 Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Segfault with UCX and ASYNC allocator #4695

[BUG] Segfault with UCX and ASYNC allocator #4695

rongou commented Feb 5, 2022

abellina commented Feb 7, 2022

rongou commented Feb 11, 2022

abellina commented Feb 11, 2022

abellina commented Feb 23, 2022

abellina commented Mar 21, 2022 •

edited

Loading

abellina commented Apr 1, 2022

[BUG] Segfault with UCX and ASYNC allocator #4695

[BUG] Segfault with UCX and ASYNC allocator #4695

Comments

rongou commented Feb 5, 2022

abellina commented Feb 7, 2022

rongou commented Feb 11, 2022

abellina commented Feb 11, 2022

abellina commented Feb 23, 2022

abellina commented Mar 21, 2022 • edited Loading

abellina commented Apr 1, 2022

abellina commented Mar 21, 2022 •

edited

Loading