Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] occasional crashes in CI #1797

Closed
revans2 opened this issue Feb 23, 2021 · 6 comments
Closed

[BUG] occasional crashes in CI #1797

revans2 opened this issue Feb 23, 2021 · 6 comments
Assignees
Labels
bug Something isn't working P0 Must have for release

Comments

@revans2
Copy link
Collaborator

revans2 commented Feb 23, 2021

Describe the bug
I have been seeing some crashes in CI that look like they are caused by some kind of memory corruption, but I am not sure. They are intermittent

16:51:11  �[32mAll tests passed.�[0m
16:51:12  #
16:51:12  # A fatal error has been detected by the Java Runtime Environment:
16:51:12  #
16:51:12  #  SIGSEGV (0xb) at pc=0x00007f2b75cad562, pid=30764, tid=0x00007f253dddb700
16:51:12  #
16:51:12  # JRE version: OpenJDK Runtime Environment (8.0_282-b08) (build 1.8.0_282-8u282-b08-0ubuntu1~16.04-b08)
16:51:12  # Java VM: OpenJDK 64-Bit Server VM (25.282-b08 mixed mode linux-amd64 compressed oops)
16:51:12  # Problematic frame:
16:51:12  # C  [libc.so.6+0x84562]  cfree+0x22
16:51:12  #
16:51:12  # Core dump written. Default location: /.../core or core.30764
16:51:12  #
16:51:12  # An error report file with more information is saved as:
16:51:12  # /.../hs_err_pid30764.log
16:51:12  Compiled method (nm)  557628 20278     n 0       sun.misc.Unsafe::freeMemory (native)
16:51:12   total in heap  [0x00007f2b60a1a690,0x00007f2b60a1a9c0] = 816
16:51:12   relocation     [0x00007f2b60a1a7b8,0x00007f2b60a1a800] = 72
16:51:12   main code      [0x00007f2b60a1a800,0x00007f2b60a1a9c0] = 448
16:51:12  Compiled method (nm)  557628 20278     n 0       sun.misc.Unsafe::freeMemory (native)
16:51:12   total in heap  [0x00007f2b60a1a690,0x00007f2b60a1a9c0] = 816
16:51:12   relocation     [0x00007f2b60a1a7b8,0x00007f2b60a1a800] = 72
16:51:12   main code      [0x00007f2b60a1a800,0x00007f2b60a1a9c0] = 448
16:51:12  #
16:51:12  # If you would like to submit a bug report, please visit:
16:51:12  #   http://bugreport.java.com/bugreport/crash.jsp
16:51:12  #
16:51:12  [INFO] ----------------------------------------------
@revans2 revans2 added bug Something isn't working ? - Needs Triage Need team to review and classify labels Feb 23, 2021
@pxLi
Copy link
Collaborator

pxLi commented Mar 1, 2021

Copied from #1823

jvm crash in nightly build during UT intermittently

OutOfMemory and StackOverflow Exception counts:
OutOfMemoryError java_heap_errors=1

hs_err_pid2506.log

I will try keep monitoring this to tell what was the root cause for the crash

@revans2
Copy link
Collaborator Author

revans2 commented Mar 1, 2021

I think I found the issue. This is related to the concurrent modification exception that we occasionally see when the tests are shutting down. The stack trace for the bad free in hs_err_pid2506.log is when we are freeing a host buffer that was "leaked". I cannot tell the size of the host buffer, but from the stack trace it is not a pinned buffer, so it is not the pinned memory pool, which we expect to leak. But it is being cleaned by the MemoryCleaner on shutdown to verify any leaks that we might have. This is something that no one would turn on in production, so I think we are OK with shipping 0.4 without any fix in place. When I look at the cleaner code there is not locking. It was written with the assumption that GC would remove any need for locking because it would show up in the queue when there are no references left to the object. When we force the cleanup to happen at shutdown there are now race conditions. I think we need to put in some locking in the actual cleanup code for each buffer. It should be a simple change and cheap because there should be no lock contention in the common case.

@revans2
Copy link
Collaborator Author

revans2 commented Mar 1, 2021

I spoke with @abellina and he had a patch he was working on related to synchronization in the Memory cleaner for UCX work. He is going to extend that patch to also cover what I suspect the cause of this issue is too.

@revans2 revans2 added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Mar 1, 2021
@revans2
Copy link
Collaborator Author

revans2 commented Mar 1, 2021

Because the bug is in the java CUDF code and not in this code I am going to target this to the 0.5 release. Also the only way that this can be triggered, if I am correct in my analysis, is when we turn on the debug leak detection, which no one should ever do in production.

@abellina abellina added this to the Mar 1 - Mar 12 milestone Mar 3, 2021
rapids-bot bot pushed a commit to rapidsai/cudf that referenced this issue Mar 4, 2021
Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797)

Authors:
  - Alessandro Bellina (@abellina)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - Jason Lowe (@jlowe)

URL: #7474
@abellina
Copy link
Collaborator

My fix targeted the cleaner synchronization. I'd say close this one and re-open if more CI crashes show up?

@abellina
Copy link
Collaborator

abellina commented Mar 12, 2021

We are closing this one for now as we haven't seen this (after rapidsai/cudf#7474). If you happen to see a similar bug show up in CI please reopen.

hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this issue Mar 25, 2021
Add synchronization in `cleanImpl` and `close` in various places where race conditions could exist, and also within the `MemoryCleaner` to address some concurrent modification issues we've seen in tests while shutting down (i.e. invoking the cleaner) (i.e. NVIDIA/spark-rapids#1797)

Authors:
  - Alessandro Bellina (@abellina)

Approvers:
  - Robert (Bobby) Evans (@revans2)
  - Jason Lowe (@jlowe)

URL: rapidsai#7474
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P0 Must have for release
Projects
None yet
Development

No branches or pull requests

3 participants