-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: Goroutine leak when communication fails #8975
Comments
Well that didn't take long. Another node started leaking goroutines:
Looks like #8692; it's all synchronous intent resolution. |
How are these leaked though? |
See cockroachdb#8975. Open for better suggestions here, but this should give us some idea without being overly burdensome.
See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).
My guess is that certain pairs of nodes were having trouble talking to each other thanks to the GRPC bug, and the node ended up in a state where it couldn't talk to any member of some important range. But I don't have any evidence to back that up. |
See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).
Ok. We have logging for that infinite DistSender loop now, so there's nothing actionable here at this point except seeing whether it happens again and then looking at the logs. |
This was on 1a34ca3. Seems like that had the grpc fix already. Perhaps some other form of range unavailability, I'll keep looking at rho to see what state it's in. |
Here's one more from just now on cc #9034 |
@mberhault pointed out that these spikes are always post-restart, so they likely occur on freshly started nodes during initialization. Still worth investigating, but likely not the cause of the crash in the picture above. |
We fixed several bugs with cancellation propagation since this issue was filed. Closing for now. We should open a new bug if the problem reoccurs. |
On beta, running 2a2fdd9, node 6 leaked goroutines over a period of several hours until it ran out of memory and died:
It's not clear what's going on (since memory and goroutine profiles are not saved for very long after they are collected). One thing worth mentioning is that node 6 was the one implicated in #8939, so it was having trouble talking to certain other nodes. My best guess is that some key ranges (maybe the first range) got rebalanced onto a set of nodes that node 6 could not talk to. This is probably part of our general need to have better backpressure instead of allowing goroutines to pile up indefinitely.
Next time this happens we should be sure to save the goroutine profile. We should probably also change the way that old profiles are cleaned up, so that we have more history available. (e.g. instead of just keeping the last N, keep the most recent few and then one per minute for the last hour, one per hour for the last day, etc)
The text was updated successfully, but these errors were encountered: