Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious segfault during report generation (?) #1267

Open
ingomueller-net opened this issue Jun 19, 2020 · 3 comments
Open

Spurious segfault during report generation (?) #1267

ingomueller-net opened this issue Jun 19, 2020 · 3 comments

Comments

@ingomueller-net
Copy link

I am running a CI target of a native Python module compiled with -fno-omit-frame-pointer -fsanitize=address,undefined -fno-sanitize=vptr. I am using stock Python and preload ASan as described here.

Since a few days, sporadically, this happens:

...
=========== 2196 passed, 14 skipped, 20 warnings in 4249.02 seconds ============
Tracer caught signal 11: addr=0x0 pc=0x7f7860e5ac39 sp=0x7f783a61ed30
==413==LeakSanitizer has encountered a fatal error.
==413==HINT: For debugging, try setting environment variable LSAN_OPTIONS=verbosity=1:log_threads=1
==413==HINT: LeakSanitizer does not work under ptrace (strace, gdb, etc)

The line starting with ==== is usually the last line executed by the program. When I rerun the test target, it usually completes without problem.

How can I debug this? I can't run the program in gdb (or can I?), I can't make it produce a core dump when it segfaults (or can I?), I can't make it print a stack trace when it segfaults (or can I?), so how I can find out what is happening? The only information I seem to get is pc=0x7f7860e5ac39 sp=0x7f783a61ed30. What do those mean?

I have tried ASAN_OPTIONS=handle_segv=0 and similar, but none changed the behavior.

Note that our CI runs a single process (Python running pytest) that runs for about 72 minutes. I somehow suspect that it fails because "it runs for too long"; at least, that would explain why this has started to happen over time with no apparent change other than adding more tests...

@ingomueller-net
Copy link
Author

I am now running into this problem again. This time, all commits after one specific commit fail with above error every single time. Interestingly, that commit has (seemingly?) nothing to do with the C++ module I am debugging and just changes some imports of Python (!) modules.

However, the problem only occurs if run by the Gitlab CI runner. I have tried reproducing it with the same docker image and running the same test, but that works. I have even tried logging into the running docker container and running the same test manually that CI would also run by copying all environment variables from the original, concurrently running CI job (as described here) -- my manual invocation works but the CI job fails with the above error.

Also, I have tried the sanitizers in LLVM 11 with the same result.

I suspect that some random factor like memory layout or similar changes whether or not the problem occurs.

The important question is: how can I debug this further?

@ingomueller-net
Copy link
Author

The work-around described in #1322 to set ASAN_OPTIONS=intercept_tls_get_addr=0 seems to be working for me. Thanks, @InverseRE, for linking to my issue!

@ingomueller-net
Copy link
Author

Another piece of information that may be useful to somebody: All previous attempts (which failed) were carried out with Docker images based on Ubuntu bionic, which uses glibc v2.27. I just now updated to Ubuntu focal, which uses glibc v2.31, where I get the same behaviour.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant