Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

fluxd can overload and become unresponsive #714

Closed
squaremo opened this issue Aug 23, 2017 · 12 comments
Closed

fluxd can overload and become unresponsive #714

squaremo opened this issue Aug 23, 2017 · 12 comments
Assignees
Labels

Comments

@squaremo
Copy link
Member

In our own dev environment, we have noticed from time to time that fluxd has stopped automatically deploying images. When we go to look at the pod, it shows up as using gigabytes of RAM and having hundreds of goroutines -- normally we'd expect maybe 50MB and tens of goroutines.

Luckily, I managed to capture a stack dump from fluxd this most recent time. Notable bit:

goroutine 18627648 [semacquire]:
sync.runtime_Semacquire(0x1faef64)
	/usr/local/go/src/runtime/sema.go:47 +0x30
sync.(*Mutex).Lock(0x1faef60)
	/usr/local/go/src/sync/mutex.go:85 +0xd0
sync.(*RWMutex).Lock(0x1faef60)
	/usr/local/go/src/sync/rwmutex.go:86 +0x31
syscall.forkExec(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d781310, 0x0, 0xc41fee25b8, 0xc44d288ea0)
	/usr/local/go/src/syscall/exec_unix.go:185 +0x1fd
syscall.StartProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d781310, 0x2, 0x4, 0x20, 0x20)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d7814b8, 0x0, 0x15e6fe0, 0xc420dafd50)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc44f06df40, 0xc, 0xc457fc2000, 0x6, 0x6, 0xc45d7814b8, 0x0, 0x0, 0x6)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc42716edc0, 0xc457fc2001, 0xc420dafd50)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
os/exec.(*Cmd).Run(0xc42716edc0, 0xc420dafd50, 0xc44f06df60)
	/usr/local/go/src/os/exec/exec.go:276 +0x2b
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc420dafce0, 0xc45d781618, 0x5, 0x5, 0x1f1ee60, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc442af4ec4, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc442af4ec4, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc43b2af4a0, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc43b2af4a0, 0x24, 0xc422cd9140, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc45d781ee0, 0x3, 0x3, 0x414a59, 0xc4208640b8, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc45d781ee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc42cae4f20, 0x13fd740, 0xc4239f5020, 0x198, 0x145ece0, 0xc422cd9140, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

There's about 600 of those, all stuck in the same place.

fluxd-dev-stackdump.txt

@squaremo
Copy link
Member Author

Here's one that is (I think) in the critical section the others are trying to enter:

goroutine 18715803 [runnable]:
sync.runtime_Semrelease(0x1faef64)
	/usr/local/go/src/runtime/sema.go:57 +0x2b
sync.(*Mutex).Unlock(0x1faef60)
	/usr/local/go/src/sync/mutex.go:124 +0xaa
sync.(*RWMutex).Unlock(0x1faef60)
	/usr/local/go/src/sync/rwmutex.go:125 +0x95
syscall.forkExec(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc469683310, 0x0, 0xc41dfd63ca, 0xc46053ab60)
	/usr/local/go/src/syscall/exec_unix.go:198 +0x306
syscall.StartProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc469683310, 0x2, 0x4, 0x20, 0x20)
	/usr/local/go/src/syscall/exec_unix.go:240 +0x64
os.startProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc4696834b8, 0x0, 0x15e6fe0, 0xc4531fa930)
	/usr/local/go/src/os/exec_posix.go:45 +0x1a3
os.StartProcess(0xc453206c60, 0xc, 0xc460542000, 0x6, 0x6, 0xc4696834b8, 0x0, 0x0, 0x6)
	/usr/local/go/src/os/doc.go:28 +0x64
os/exec.(*Cmd).Start(0xc462e33a20, 0xc460542001, 0xc4531fa930)
	/usr/local/go/src/os/exec/exec.go:358 +0x3c9
os/exec.(*Cmd).Run(0xc462e33a20, 0xc4531fa930, 0xc453206c80)
	/usr/local/go/src/os/exec/exec.go:276 +0x2b
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc4531fa8c0, 0xc469683618, 0x5, 0x5, 0x1f1ee19, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc455f3185a, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc455f3185a, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc444295f50, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc444295f50, 0x24, 0xc44bf50180, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc469683ee0, 0x3, 0x3, 0xc420b2a340, 0x4, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc469683ee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc43415eba0, 0x13fd740, 0xc436108a30, 0x198, 0x145ece0, 0xc44bf50180, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

@squaremo
Copy link
Member Author

Superficially, this looks like a bunch of git note ... os.exec's are queued up waiting to run, all in the service of answering daemon.JobStatus().

The JobStatus method responds to things like fluxctl release and the web UI. It will first look in its cache of recent jobs, and failing that, will go through all the commits from HEAD backwards to try and find one with a note representing the job in question. That latter step could of course launch as many execs as there are commits, if the job in question doesn't exist.

One possible mitigation would be to put positive and negative results back in the cache, so the next call will find them (until they get evicted); this will at least de-amplify the execs.

It would be good to figure out why they are piling up, though -- since they are exec'd serially, I would take 600 of them to indicate 600 outstanding JobStatus requests, rather than one request and 600 execs.

@philwinder philwinder added the bug label Aug 24, 2017
@philwinder
Copy link
Contributor

philwinder commented Aug 24, 2017

All of these traces are blocking inside exec_unix. They can't get a lock to be able to fork the
process (source: https://golang.org/src/syscall/exec_unix.go?s=5442:5460).

Why it can't get a lock there is a mystery. I can't find any other references to any code that would have that lock.

So my guess is that something happened in the middle of one fork calls and for some reason it didn't release the lock.

After a search around I've seen various references to certain issues with intel processors, BSD and quite a few changes to the codebase. But I can't see a general theme or fix. But this does look quite low level.

What is even more strange is that we have our own lock that forces us to only perform one operation on a checkout at a time.

@philwinder
Copy link
Contributor

Found a trace that is different to the rest of the 600 in the stack.

If you look closely, it is still a JobStatus->GetNote->execGitCmd request, but it is a little further on than the others.

It diverges from the others at this point:
All others: https://github.com/golang/go/blob/go1.7.1/src/os/exec/exec.go#L276
This one: https://github.com/golang/go/blob/go1.7.1/src/os/exec/exec.go#L279

So it has managed to start the process, but is now stuck waiting for the underlying OS process.

I think it is this individual goroutine that has got stuck and all others are blocked behind it.

goroutine 18631036 [syscall]:
syscall.Syscall6(0xf7, 0x1, 0xe2e2, 0xc4320ff980, 0x1000004, 0x0, 0x0, 0x149da00, 0xc46898d3b0, 0x43991e)
	/usr/local/go/src/syscall/asm_linux_amd64.s:44 +0x5
os.(*Process).blockUntilWaitable(0xc43abf5a10, 0x16fa1c0, 0xc42becadc0, 0xc43390a6e0)
	/usr/local/go/src/os/wait_waitid.go:28 +0xbc
os.(*Process).wait(0xc43abf5a10, 0x0, 0x415f5e, 0xc4339086b0)
	/usr/local/go/src/os/exec_unix.go:22 +0xab
os.(*Process).Wait(0xc43abf5a10, 0xc46898d4b8, 0x0, 0x0)
	/usr/local/go/src/os/doc.go:49 +0x2b
os/exec.(*Cmd).Wait(0xc42becadc0, 0x0, 0x0)
	/usr/local/go/src/os/exec/exec.go:434 +0x6d
os/exec.(*Cmd).Run(0xc42becadc0, 0xc44e7af3b0, 0xc433908650)
	/usr/local/go/src/os/exec/exec.go:279 +0x48
github.com/weaveworks/flux/git.execGitCmd(0xc4204f36e0, 0x20, 0x0, 0x0, 0x1f1de20, 0xc44e7af340, 0xc46898d618, 0x5, 0x5, 0x1f1ee60, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:206 +0x193
github.com/weaveworks/flux/git.getNote(0xc4204f36e0, 0x20, 0xc420832410, 0xf, 0xc44222bd3c, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/operations.go:109 +0x12c
github.com/weaveworks/flux/git.(*Checkout).GetNote(0xc420124240, 0xc44222bd3c, 0x28, 0x0, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/git/repo.go:170 +0xcb
github.com/weaveworks/flux/daemon.(*Daemon).JobStatus(0xc42067c600, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/daemon.go:326 +0x177
github.com/weaveworks/flux/daemon.(*Ref).JobStatus(0xc4202ce870, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/daemon/ref.go:65 +0x8c
github.com/weaveworks/flux/remote.(*ErrorLoggingPlatform).JobStatus(0xc4204f3580, 0xc437d31950, 0x24, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/logging.go:77 +0xde
github.com/weaveworks/flux/remote/rpc.(*RPCServer).JobStatus(0xc437ee0f80, 0xc437d31950, 0x24, 0xc43ce49b80, 0x0, 0x0)
	/home/ubuntu/src/github.com/weaveworks/flux/remote/rpc/server.go:77 +0x62
reflect.Value.call(0xc4207a6b40, 0xc437cab728, 0x13, 0x16254fb, 0x4, 0xc46898dee0, 0x3, 0x3, 0xc420ab9d40, 0x4, ...)
	/usr/local/go/src/reflect/value.go:434 +0x5c8
reflect.Value.Call(0xc4207a6b40, 0xc437cab728, 0x13, 0xc46898dee0, 0x3, 0x3, 0x0, 0x0, 0x0)
	/usr/local/go/src/reflect/value.go:302 +0xa4
net/rpc.(*service).call(0xc42f866580, 0xc42f866540, 0xc437ee1030, 0xc437eec380, 0xc433902780, 0x13fd740, 0xc439d54a30, 0x198, 0x145ece0, 0xc43ce49b80, ...)
	/usr/local/go/src/net/rpc/server.go:383 +0x148
created by net/rpc.(*Server).ServeCodec
	/usr/local/go/src/net/rpc/server.go:477 +0x421

@philwinder
Copy link
Contributor

philwinder commented Aug 24, 2017

So one potential solution here is to use a CommandContext type instead of a Command, and pass a timeout that can be cancelled. This should avoid any blocks caused by an underlying issue with a forked process.

The only reason I can give for bypassing our lock is that these are separate checkouts. I.e. this represents 600 daemon loops that are all blocked by a fundamental inability to fork a process.

It doesn't explain why it happened, but I'm fairly happy that it wasn't caused by our code. But we should put a timeout in place to mitigate against it.

@rade
Copy link
Contributor

rade commented Aug 24, 2017

Next time this happens, it would be worth checking the process tree, since that should allow us to determine the exact command that got stuck (if that is indeed what happens).

@squaremo
Copy link
Member Author

The only reason I can give for bypassing our lock is that these are separate checkouts.

It's a RWLock, and we take a RLock to do GetNote and CommitsBetween etc.

@squaremo
Copy link
Member Author

The ForkLock is taken and released when the process is started (inside os.StartProcess()), so it will have been released already by goroutine 18631036 [syscall], which is past that point.

philwinder pushed a commit that referenced this issue Aug 24, 2017
To mitigate against any low level execution problems, add a context timeout to git commands so that if they fail, it won't cause subsequent requests to back up.

Fixes #714
@philwinder philwinder self-assigned this Aug 24, 2017
@squaremo
Copy link
Member Author

Another mitigation: listing the notes first (one exec), and only calling git note if there's known to be one for a particular commit, would help reduce the number of execs.

In our own config repo, there's about 700 notes and 7000 commits (although we have only been using flux gitsync relatively recently).

philwinder added a commit that referenced this issue Aug 28, 2017
This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.
philwinder added a commit that referenced this issue Aug 28, 2017
This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.
philwinder added a commit that referenced this issue Aug 29, 2017
This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.
philwinder added a commit that referenced this issue Aug 29, 2017
This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.
philwinder added a commit that referenced this issue Aug 29, 2017
This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.
@squaremo
Copy link
Member Author

I suspect this isn't over.

@squaremo squaremo reopened this Aug 29, 2017
philwinder added a commit that referenced this issue Aug 29, 2017
* Add method to list revisions with notes

This method uses git notes list to get all notes for a given note ref.
We then turn the list into an array and take the second field from the
results (which corresponds to the object reference - the commit id in
our case). Finally, the result is placed in a map to make it easier
to do "if note is in" type queries later.

* Check if notes exist before requesting them

This is an optimisation for #714. We perform a single gitCommandExec to
get hashes for all commits with notes, place them into a map and use
them to query whether a commit has a note or not. This prevents
multiple calls to gitCommandExec just to see if there is a note
attached.

* Add extra check for error seen on ubuntu systems.

* Add context.
@squaremo
Copy link
Member Author

We've not seen the explosion in goroutines and memory since #721 and #726, so I'm tentatively closing this again -- I think the git note pile-up has at least been mitigated.

@suprememoocow
Copy link

suprememoocow commented Dec 14, 2017

I stumbled across this issue as we're having a very similar issue on our project. We're building a git RPC service for GitLab and, in production, each daemon process is spawning ~20 git commands per second. Occasionally we've been experiencing lockups, including health check ports not accepting. Like this issue, there appears to be contention around the ForkLock in exec_unix.go..

Seeing that you worked around this issue rather than completely solved it, I figured I would share our progress with you, in case you experience it again in future. 🙂

The issue is documented here: https://gitlab.com/gitlab-org/gitaly/issues/823.

We've investigated many different potential solutions, but we're focusing our debugging on a slowdown in fork performance related to process VMM size: see golang/go#5838 (comment)

Generally, right before the process becomes unresponsive we see an increase in VM size.

Go 1.9 switched to posix_spawn which does not copy memory to the child process and seems to have more stable performance over various VMM sizes.

We're planning on switching to a Go 1.9 binary to test if this solves the problem.

Hope this comes in handy at some point! 🙂

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants