Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segfault with UCX and ASYNC allocator #4695

Closed
rongou opened this issue Feb 5, 2022 · 6 comments · Fixed by #5141
Closed

[BUG] Segfault with UCX and ASYNC allocator #4695

rongou opened this issue Feb 5, 2022 · 6 comments · Fixed by #5141
Assignees
Labels
bug Something isn't working P1 Nice to have for release

Comments

@rongou
Copy link
Collaborator

rongou commented Feb 5, 2022

Describe the bug
With UCX on and using the ASYNC allocator, seeing segfaults with q64 and q93.

@abellina

Steps/Code to reproduce bug
Run queries with UCX on and setting the gpu memory pool to ASYNC.

Expected behavior
Should not crash.

Environment details (please complete the following information)

  • Environment location: Standalone
  • Spark configuration settings related to the issue
spark.driver.maxResultSize=2GB
spark.executor.cores=16
spark.executor.memory=240G
spark.driver.memory=50G
spark.locality.wait=0
spark.sql.adaptive.enabled=true
spark.sql.files.maxPartitionBytes=2g
spark.rapids.sql.concurrentGpuTasks=4
spark.executor.resource.gpu.amount=1
spark.task.resource.gpu.amount=0.0625
spark.driver.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR:$SPARK_RAPIDS_PLUGIN_INTEGRATION_TEST_JAR:$SCALLOP_JAR
spark.executor.extraClassPath=$SPARK_RAPIDS_PLUGIN_JAR:$CUDF_JAR
spark.rapids.memory.host.spillStorageSize=32G
spark.plugins=com.nvidia.spark.SQLPlugin
spark.rapids.memory.pinnedPool.size=8g
spark.rapids.shuffle.maxMetadataSize=1MB
spark.rapids.sql.incompatibleOps.enabled=true
spark.rapids.sql.variableFloatAgg.enabled=true
spark.rapids.sql.hasNans=false
spark.executor.instances=8
spark.shuffle.service.enabled=false
spark.rapids.shuffle.transport.enabled=true
spark.shuffle.manager=com.nvidia.spark.rapids.spark311.RapidsShuffleManager
spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
spark.executorEnv.UCX_ERROR_SIGNALS=
spark.executorEnv.UCX_IB_GPU_DIRECT_RDMA=yes
spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
spark.rapids.shuffle.ucx.bounceBuffers.size=4MB
spark.rapids.shuffle.ucx.bounceBuffers.device.count=64
spark.rapids.shuffle.ucx.bounceBuffers.host.count=64
spark.rapids.sql.decimalType.enabled=true
spark.rapids.sql.castFloatToDecimal.enabled=true

Additional context
Stdout:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fc15e2616a3, pid=123327, tid=0x00007f83ae3fe700
#
# JRE version: OpenJDK Runtime Environment (8.0_292-b10) (build 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.292-b10 mixed mode linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x16d6a3]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /opt/spark/spark-3.1.1-bin-hadoop3.2/work/app-20220204010648-14398/6/hs_err_pid123327.log
[thread 140203929093888 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

Error log: hs_err_pid123327.log

@rongou rongou added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 5, 2022
@abellina
Copy link
Collaborator

abellina commented Feb 7, 2022

Early results indicate that UCX_MEMTYPE_CACHE=n seems to help this case. We are trying to confirm with UCX folks to see if this is expected.

@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Feb 8, 2022
@rongou
Copy link
Collaborator Author

rongou commented Feb 11, 2022

@abellina so is setting UCX_MEMTYPE_CACHE=n the official solution? Should we update the doc?

@abellina
Copy link
Collaborator

No this isn’t the solution, it may be a workaround but we haven’t been able to explain why yet.

The UCX mem type cache is not aware of cuda async allocations or frees. I will need to spend more time trying to debug this.

@sameerz sameerz added the P1 Nice to have for release label Feb 11, 2022
@abellina
Copy link
Collaborator

When running with UCX 1.12.1-rc2 the issue looks to be gone for both q63 and q93. Myself and others have been able to reproduce it consistently with 1.11.2 and 1.12.0. I'll find out more on why but that's the news so far.

@abellina
Copy link
Collaborator

abellina commented Mar 21, 2022

I am not seeing this issue in UCX 1.12.1 as released. I have run both q63 and q93 ~20 times with ASYNC enabled and I can't reproduce it anymore, whereas I certainly could with prior versions of UCX. There are fixes around binary instrumentation of memory hooks and CUDA 11.5 included in this version. I do see that binary hooks are enabled per UCX log:

[1647890294.401576] [rl-r7525-d32-u09:129587:194] cudamem.c:247 UCX INFO cuda memory hooks mode bistro: installed 8 on driver API
[1647890294.401578] [rl-r7525-d32-u09:129587:194] cudamem.c:213 UCX DEBUG cuda memory hooks mode reloc is disabled for driver API
[1647890294.401581] [rl-r7525-d32-u09:129587:194] cudamem.c:213 UCX DEBUG cuda memory hooks mode reloc is disabled for runtime API

Which is a good sign. This version of UCX would complain if it had failed.

This is not a high priority for 22.04 and we are moving to 22.06 since ARENA (not ASYNC) is used by default in 22.04, and there is a workaround by upgrading to UCX 1.12.1, without changes to the JUCX in the plugin.

@abellina
Copy link
Collaborator

abellina commented Apr 1, 2022

I still can't repro this segfault with 22.06, async and 1.12.1. We may be closing this and suggesting that the minimum UCX version is 1.12.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1 Nice to have for release
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants