Spurious segfault during report generation (?) #1267

ingomueller-net · 2020-06-19T20:17:12Z

I am running a CI target of a native Python module compiled with -fno-omit-frame-pointer -fsanitize=address,undefined -fno-sanitize=vptr. I am using stock Python and preload ASan as described here.

Since a few days, sporadically, this happens:

...
=========== 2196 passed, 14 skipped, 20 warnings in 4249.02 seconds ============
Tracer caught signal 11: addr=0x0 pc=0x7f7860e5ac39 sp=0x7f783a61ed30
==413==LeakSanitizer has encountered a fatal error.
==413==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==413==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

The line starting with ==== is usually the last line executed by the program. When I rerun the test target, it usually completes without problem.

How can I debug this? I can't run the program in gdb (or can I?), I can't make it produce a core dump when it segfaults (or can I?), I can't make it print a stack trace when it segfaults (or can I?), so how I can find out what is happening? The only information I seem to get is pc=0x7f7860e5ac39 sp=0x7f783a61ed30. What do those mean?

I have tried ASAN_OPTIONS=handle_segv=0 and similar, but none changed the behavior.

Note that our CI runs a single process (Python running pytest) that runs for about 72 minutes. I somehow suspect that it fails because "it runs for too long"; at least, that would explain why this has started to happen over time with no apparent change other than adding more tests...

The text was updated successfully, but these errors were encountered:

ingomueller-net · 2020-11-10T17:09:48Z

I am now running into this problem again. This time, all commits after one specific commit fail with above error every single time. Interestingly, that commit has (seemingly?) nothing to do with the C++ module I am debugging and just changes some imports of Python (!) modules.

However, the problem only occurs if run by the Gitlab CI runner. I have tried reproducing it with the same docker image and running the same test, but that works. I have even tried logging into the running docker container and running the same test manually that CI would also run by copying all environment variables from the original, concurrently running CI job (as described here) -- my manual invocation works but the CI job fails with the above error.

Also, I have tried the sanitizers in LLVM 11 with the same result.

I suspect that some random factor like memory layout or similar changes whether or not the problem occurs.

The important question is: how can I debug this further?

ingomueller-net · 2020-11-10T18:33:33Z

The work-around described in #1322 to set ASAN_OPTIONS=intercept_tls_get_addr=0 seems to be working for me. Thanks, @InverseRE, for linking to my issue!

ingomueller-net · 2020-11-13T10:03:57Z

Another piece of information that may be useful to somebody: All previous attempts (which failed) were carried out with Docker images based on Ubuntu bionic, which uses glibc v2.27. I just now updated to Ubuntu focal, which uses glibc v2.31, where I get the same behaviour.

InverseRE mentioned this issue Sep 28, 2020

Detecting GLIBC version (DTLS SIGSEGV). #1322

Open

JanJecmen mentioned this issue Aug 6, 2021

Try preventing leak sanitizer crashes reactorlabs/rir#1089

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious segfault during report generation (?) #1267

Spurious segfault during report generation (?) #1267

ingomueller-net commented Jun 19, 2020

ingomueller-net commented Nov 10, 2020

ingomueller-net commented Nov 10, 2020

ingomueller-net commented Nov 13, 2020

Spurious segfault during report generation (?) #1267

Spurious segfault during report generation (?) #1267

Comments

ingomueller-net commented Jun 19, 2020

ingomueller-net commented Nov 10, 2020

ingomueller-net commented Nov 10, 2020

ingomueller-net commented Nov 13, 2020