Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hypnotoad / Prefork dump cores on small values of accepts and disabled keep-alive #1449

Closed
melhesedek opened this issue Dec 11, 2019 · 13 comments

Comments

@melhesedek
Copy link

melhesedek commented Dec 11, 2019

  • Mojolicious version: 8.27
  • Perl version: 5.30.1
  • Operating system: Linux 5.3.13-arch1-1

Steps to reproduce the behavior

It's a TL;DR version, scroll down and see attached archive for code and more details
I run a simple synchronous server

get '/' => sub {
    usleep 1_000_000 * rand;
    shift->render(text => "i'm ok");
};

with a command

./server.pl prefork --listen 'http://*:3002' \
  --accepts 1 \
  --clients 1 \
  --requests 1 \
  --workers 10

and a client in a second terminal to generate some load.

./client.pl

Expected behavior

I expect each worker process to exit normally after it served one request.

Actual behavior

Eventually a core dump of a worker process is generated. gdb shows Program terminated with signal SIGQUIT, Quit..

Setting up a clean environment

I use a Linux box, plenv to compile recent Perl version with debug symbols and cpanm to install Mojolicious locally.

plenv install --as=5.30.1-debugging 5.30.1 -DDEBUGGING
plenv local 5.30.1-debugging
plenv install-cpanm
cpanm -L local Mojolicious@8.27 --quiet

Then make sure core dumps will end up in a current working directory (you may want to back up previous value but it won't persist across system boots). It's not a necessary step but client.pl and gdb commands here assume it's done.

echo 'core' | sudo tee /proc/sys/kernel/core_pattern

Start server

PERL5LIB=local/lib/perl5 ./server.pl prefork --listen 'http://*:3002' \
  --accepts 1 \
  --clients 1 \
  --requests 1 \
  --workers 10

and client from another terminal.

PERL5LIB=local/lib/perl5 ./client.pl

It usually takes up to 15 minutes to reproduce on my Intel i5-2500K box. After client terminates there should be 1 to 3 core.<pid> files in a current working directory.

If a core dump was generated by e.g. a process with pid 33156, it's backtrace can be inspected with the following command:

gdb -ex bt ~/.plenv/versions/5.30.1-debugging/bin/perl core.33156

To observe signals-related behaviour I run server via strace.
It generates a lot of traces/trace.<pid> files which could take up some disk space.

mkdir traces
PERL5LIB=local/lib/perl5 strace -o traces/trace -ff -e signal,write -s 128 -tt -- ./server.pl prefork \
  --listen 'http://*:3002' \
  --accepts 1 \
  --clients 1 \
  --requests 1 \
  --workers 10

Dumps and traces

According to backtraces it seems processes usually terminate inside Perl_pp_exit or perl_destruct functions. In a syscall traces of dumped processes SIGQUIT arrives some microseconds after SIGQUIT handler is restored to SIG_DFL (a default one).

I assume there's a race condition due to a time gap between signal handlers are reset by Perl and process terminates.

Potential fix

See server-fixed.pl in attached archive. This seems to do the trick but race condition still remains (AFAIK a signal may arrive between rt_sigaction setting SIGQUIT handler to SIG_DFL and rt_sigprocmask blocking the process from recieving SIGQUIT).

app->hook(before_server_start => sub {
	my ($server, $app) = @_;

	my $sigset = POSIX::SigSet->new(SIGQUIT);
	$server->ioloop->on(finish => sub {
		my $loop = shift;

		sigprocmask(SIG_BLOCK, $sigset) // die "Could not block SIGQUIT\n";
	});
});
@jhthorsen
Copy link
Member

When is this a problem in the real world?

@melhesedek
Copy link
Author

When your app does something memory-intensive but not very often, so you lower accepts value to release memory back to the OS. AFAIK this (or memory leakage) is the primary use case for similar setting in other web servers, e. g. httpd's MaxConnectionsPerChild.

In my case setting accepts to 10 resulted in several dozens core dumps a day on a modest 2 RPS.

@stale
Copy link

stale bot commented Aug 30, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Aug 30, 2020
@kraih kraih removed the stale label Nov 5, 2020
@kraih
Copy link
Member

kraih commented Feb 10, 2021

The problem seems too obscure for us to fix in Mojolicious.

@kraih kraih closed this as completed Feb 10, 2021
@oschwald
Copy link

oschwald commented Sep 24, 2021

We see 100s of similar core dumps a week in production with accepts set to 100 and requests set to 70. Although it isn't a significant issue, it does cause many spurious logs that must be filtered through when monitoring for real issues.

@ksmadsen
Copy link

We see the same issue in our production. At this point, our conclusion is that when exit is called, Perl will reset the signal handlers early in the process before it actually exits. This appears to happen even before the END blocks are executed. So if a worker is in the process of shutting down and is executing an END block, when the parent process sends it the QUIT signal, the worker process will core dump.

That Perl resets the signal handlers before END is executed is illustrated by the following short script:

$SIG{QUIT} = sub {};
kill 'QUIT', $$;
print "After first quit\n";

END {
    kill 'QUIT', $$;
    print "After second quit\n";
}

Running this program will only cause the first print-statement to be executed, as the program will core-dump before reaching the second statement (tested on perl 5.34.0)

It seems that it is only handlers that are reset. If sub {} is switched out with 'IGNORE', the print-statement in the END block will also be executed.

I'm not familiar with the innards of the perl-interpreter, but it seems plausible that it will have to reset the signal handlers before exiting, leaving a small window open for a race between exiting and receiving a QUIT signal from its parent.

@kiwiroy
Copy link
Contributor

kiwiroy commented Oct 29, 2021

localizing the signal handler in the END block results in handling the second kill.

@ksmadsen
Copy link

I can't see that localizing the signal handler in the END block eliminates the race completely. We can see that Perl resets the signal handler before the END blocks are executed. The QUIT signal could come after the signal handlers are reset and before our END block that reinstalls the signal handler is executed.

AFAICT the potential fix with the before_server_start-hook in the original post eliminates the race. A "simpler" solution, would be to add:

$SIG{QUIT} = 'IGNORE';

inbetween $loop->start; and exit 0; in Mojo::Server::Prefork::_spawn (we are already exiting, so no need to know that we should do it gracefully - that won't have any effect anyway).

I would submit a pull-request for this, but I haven't yet wrapped my head around how to write a test to ensure that it actually works.

@jixam
Copy link
Contributor

jixam commented Oct 30, 2021

@kraih Please consider reopening. We have accepts at 1000 but this issue actually caused an outage for us when Ubuntu started storing core dumps in an unexpected location after a routine apt upgrade.

Here is a maybe less obscure repro which dumps core for me about 50% of the time:

Server (using strace in case core dumps are disabled):

strace -e kill perl -Mojo -E 'get "/" => {inline => "%= sleep 5"}; app->start' prefork -a 1

Client:

while sleep 1; do curl http://127.0.0.1:3000/; done

The trick is doing enough work (sleep 5) that a heartbeat happens while the IO loop is shutting down after reaching its maximum number of requests (i.e. max_accepts). That can end up trying to signal SIGQUIT when the worker has unregistered its handler during exit().

@jixam
Copy link
Contributor

jixam commented Nov 1, 2021

It turns out the Prefork signal handlers are local so an effective workaround is to globally ignore SIGQUIT before startup. As @ksmadsen has shown, this will also ignore the signals during exit().

My test server then becomes:

strace -e kill perl -Mojo -E '$SIG{QUIT} = "IGNORE"; get "/" => {inline => "%= sleep 5"}; app->start' prefork -a 1

which never core dumps for me.

I still think a change is warranted so that all users do not have to know about such a workaround.

@mschout
Copy link

mschout commented Mar 1, 2022

This bit us hard. Eating up huge amounts of disk space once we lowered accepts.

@kraih
Copy link
Member

kraih commented Mar 1, 2022

There's a possible fix, unfortunately nobody has reviewed it yet.

@jixam
Copy link
Contributor

jixam commented Mar 2, 2022

FWIW, this simple workaround has completely solved the problem for us (this is also what I discussed above):

BEGIN {
    # Protect against stray signals from the process manager
    $SIG{QUIT} = 'IGNORE';
}

The PR #1883 ensures that the stray signals never happen and the workaround is no longer needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants