-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock: Error in eventloop deadlock with peerController shutdown #82
Conversation
The following deadlock will occur when: - We are running process.eventLoop in a go group which means that it'll cancel context on error from any member of the group. - Inside of process.eventLoop, we spin up a set of peerControlers in a secondary go group that rely on the parent context (associated with the top level group) for cancellation. - There is an error in eventLoop due to an unknown channel, which prompts it to return. - The defer function waits for the peerController group to exit. - Since the top level context will only be cancelled when eventLoop successfully exits (because receiving the error will cancel ctx), we hit a deadlock: -> eventLoop is waiting for all peerControllers to exit on a canceled context to return an error. -> the group is waiting on eventLoop to return an error to cancel context.
One higher-level question this introduced for me is whether we want circuitbreaker to exit on As suggested here, introducing pending close channels would completely cover us (so that we never expect this error) but the short channel ID isn't currently included with pending channel information (lightningnetwork/lnd#5575). Given that there's a PR open (and people seemingly haven't hit this issue before) I think it's okay to error out and just wait on the upstream fix? But also happy to put something temporary in place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is indeed fine to exit on channelNotFound
. It isn't a regression and as you say, we can wait for the upstream fix.
_ = group.Wait() | ||
}() | ||
|
||
func (p *process) eventLoop(ctx context.Context, group *errgroup.Group) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Originally I created a secondary group to keep process
maximally self-contained. I think the change that you propose is effective.
I am wondering though if there is an easy way to avoid the group parameter here. I suppose that would be to also run the event loop itself as a member of the secondary group? This would get rid of the defer too I think and just make group.Wait()
the last statement.
Curious to hear what you think of this option. If you feel it is over-engineered, it's fine to keep what you have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried to look up best practices for errGroup
, couldn't find much. Docs say: "A Group is a collection of goroutines working on subtasks that are part of the same overall task." which makes me think that a second group is more canonical?
Also nice for the sake of having eventLoop
completely terminate before returning (current impl lifting that wait up to the first group is a slight behavior change ).
Remove deadlock where: - Eventloop errors out, but waits on peerContollers to exit to return. - peerContollers require eventLoop to exit for their context to be be cancelled. This is achieved by sharing a go error group between the peer controllers and the event loop. By sharing a group, errors anywhere will shut everything down. This change introduces a change in how we wait for clean shutdown: - Previously: eventLoop would only exit once peerControllers shut down. - Now: eventLoop exits on error, and we wait on peerControllers at the call site of eventLoop.
e2e656a
to
66b6bcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy that you kept the sub-group. Thanks for the bugfix 🙏
Ran into a shutdown issue while testing #80, opening up in a separate PR to demonstrate pre-existing deadlock.
This PR takes the approach of sharing a go group, but the same could be achieved by adding a
ctx
withcancel
ineventLoop
and cancelling it before wegroup.Wait
but that seems like a manual way of achieving the functionality of groups.