-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Segfault with UCX and ASYNC allocator #4695
Comments
Early results indicate that |
@abellina so is setting |
No this isn’t the solution, it may be a workaround but we haven’t been able to explain why yet. The UCX mem type cache is not aware of cuda async allocations or frees. I will need to spend more time trying to debug this. |
When running with UCX 1.12.1-rc2 the issue looks to be gone for both q63 and q93. Myself and others have been able to reproduce it consistently with 1.11.2 and 1.12.0. I'll find out more on why but that's the news so far. |
I am not seeing this issue in UCX 1.12.1 as released. I have run both q63 and q93 ~20 times with ASYNC enabled and I can't reproduce it anymore, whereas I certainly could with prior versions of UCX. There are fixes around binary instrumentation of memory hooks and CUDA 11.5 included in this version. I do see that binary hooks are enabled per UCX log:
Which is a good sign. This version of UCX would complain if it had failed. This is not a high priority for 22.04 and we are moving to 22.06 since ARENA (not ASYNC) is used by default in 22.04, and there is a workaround by upgrading to UCX 1.12.1, without changes to the JUCX in the plugin. |
I still can't repro this segfault with 22.06, async and 1.12.1. We may be closing this and suggesting that the minimum UCX version is 1.12.1. |
Describe the bug
With UCX on and using the ASYNC allocator, seeing segfaults with
q64
andq93
.@abellina
Steps/Code to reproduce bug
Run queries with UCX on and setting the gpu memory pool to
ASYNC
.Expected behavior
Should not crash.
Environment details (please complete the following information)
Additional context
Stdout:
Error log: hs_err_pid123327.log
The text was updated successfully, but these errors were encountered: