Hanging in ARCH_FORK with CPUPROFILE #701

GoogleCodeExporter · 2015-07-23T10:24:44Z

There are two ways I have been able to reproduce the problem.
The first method occurs at random, and in spans of time (running in release 
mode).
The second seems to occur every time I run internal tools linked against 
libprofiler with gdb/cgdb.
I have been unable to generate a simplified reproducer that can be shared.

What steps will reproduce the problem?
1. compile code in debug mode, linked against libprofiler.so
2. run executable in cgdb
3. wait
4. interrupt execution and observe that:
    a. all but one thread are waiting in poll, or epoll, or pthread_cond_wait, or etc.
    b. one thread is stuck in a fork system call, on the ARCH_FORK line
    c. CPU is at 100%

What is the expected output? What do you see instead?
The program is expected to finish normally.
The program hangs 'forever' in a call to fork(). On the ARCH_FORK() macro with 
$rax = -ERESTARTNOINTR

What version of the product are you using? On what operating system?
2.2.1 / 2.4
RHEL6

Please provide any additional information below.
I have a quick (non-complete) fix (attached) for this using pthread_atfork and 
pthread_sigmask to block SIGPROF before a fork and then re-enable it 
afterwards. From my testing, this always prevents the hanging issue.

I have communicated my fix with Developer Services at my job and they have 
indicated that it would be preferred if this solution could be patched into the 
gperftools source code.

While this is probably sufficient for the usecase at my job, it feels 
incomplete for the purposes of patching into the gperftools codebase.

Original issue reported on code.google.com by Sam.J.Ja...@gmail.com on 20 Jul 2015 at 5:18

Attachments:

[hang in ARCH_FORK.png](https://storage.googleapis.com/google-code-attachments/gperftools/issue-701/comment-0/hang in ARCH_FORK.png)
cpu_profiler_nohang.cpp

The text was updated successfully, but these errors were encountered:

GoogleCodeExporter · 2015-07-23T10:24:44Z

Thanks for bug report.

I would like to understand it a bit more. I.e. it's great that blocking SIGPROF 
during fork helps your case, but I'm really curious why not having it causes 
fork to spin. Is that because signal always triggers during fork? But then how 
is that possible ?

Can you please submit some test program that causes this behavior ? Or maybe 
elaborate more on your finding?

Original comment by alkondratenko on 21 Jul 2015 at 2:44

GoogleCodeExporter · 2015-07-23T10:24:44Z

The signal does not always trigger during fork when run in release mode. 
However, as far as I can tell is does always trigger with GDB/CGDB.

From my understanding, this errno is handled by the kernel by re-attempting the 
interrupted syscall (reset $rax and move the instruction pointer back). Why 
this gets trapped in a spin is beyond me though.

As I mentioned, I have as-of-yet been unable to create a reproducer case, but I 
will keep looking into it.

Original comment by Sam.J.Ja...@gmail.com on 21 Jul 2015 at 3:10

GoogleCodeExporter added Priority-Medium Type-Defect auto-migrated labels Jul 23, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hanging in ARCH_FORK with CPUPROFILE #701

Hanging in ARCH_FORK with CPUPROFILE #701

GoogleCodeExporter commented Jul 23, 2015

GoogleCodeExporter commented Jul 23, 2015

GoogleCodeExporter commented Jul 23, 2015

Hanging in ARCH_FORK with CPUPROFILE #701

Hanging in ARCH_FORK with CPUPROFILE #701

Comments

GoogleCodeExporter commented Jul 23, 2015

GoogleCodeExporter commented Jul 23, 2015

GoogleCodeExporter commented Jul 23, 2015