deadlock: Error in eventloop deadlock with peerController shutdown #82

carlaKC · 2023-10-11T14:26:33Z

Ran into a shutdown issue while testing #80, opening up in a separate PR to demonstrate pre-existing deadlock.

This PR takes the approach of sharing a go group, but the same could be achieved by adding a ctx with cancel in eventLoop and cancelling it before we group.Wait but that seems like a manual way of achieving the functionality of groups.

The following deadlock will occur when: - We are running process.eventLoop in a go group which means that it'll cancel context on error from any member of the group. - Inside of process.eventLoop, we spin up a set of peerControlers in a secondary go group that rely on the parent context (associated with the top level group) for cancellation. - There is an error in eventLoop due to an unknown channel, which prompts it to return. - The defer function waits for the peerController group to exit. - Since the top level context will only be cancelled when eventLoop successfully exits (because receiving the error will cancel ctx), we hit a deadlock: -> eventLoop is waiting for all peerControllers to exit on a canceled context to return an error. -> the group is waiting on eventLoop to return an error to cancel context.

carlaKC · 2023-10-11T14:31:47Z

One higher-level question this introduced for me is whether we want circuitbreaker to exit on channelNotFound?

As suggested here, introducing pending close channels would completely cover us (so that we never expect this error) but the short channel ID isn't currently included with pending channel information (lightningnetwork/lnd#5575).

Given that there's a PR open (and people seemingly haven't hit this issue before) I think it's okay to error out and just wait on the upstream fix? But also happy to put something temporary in place.

joostjager

I think it is indeed fine to exit on channelNotFound. It isn't a regression and as you say, we can wait for the upstream fix.

joostjager · 2023-10-17T17:57:18Z

process.go

-		_ = group.Wait()
-	}()
-
+func (p *process) eventLoop(ctx context.Context, group *errgroup.Group) error {


Originally I created a secondary group to keep process maximally self-contained. I think the change that you propose is effective.

I am wondering though if there is an easy way to avoid the group parameter here. I suppose that would be to also run the event loop itself as a member of the secondary group? This would get rid of the defer too I think and just make group.Wait() the last statement.

Curious to hear what you think of this option. If you feel it is over-engineered, it's fine to keep what you have.

Tried to look up best practices for errGroup, couldn't find much. Docs say: "A Group is a collection of goroutines working on subtasks that are part of the same overall task." which makes me think that a second group is more canonical?

Also nice for the sake of having eventLoop completely terminate before returning (current impl lifting that wait up to the first group is a slight behavior change ).

Remove deadlock where: - Eventloop errors out, but waits on peerContollers to exit to return. - peerContollers require eventLoop to exit for their context to be be cancelled. This is achieved by sharing a go error group between the peer controllers and the event loop. By sharing a group, errors anywhere will shut everything down. This change introduces a change in how we wait for clean shutdown: - Previously: eventLoop would only exit once peerControllers shut down. - Now: eventLoop exits on error, and we wait on peerControllers at the call site of eventLoop.

joostjager

Happy that you kept the sub-group. Thanks for the bugfix 🙏

carlaKC requested a review from joostjager October 11, 2023 14:32

carlaKC mentioned this pull request Oct 11, 2023

followups: Configurable Forwarding Limit and Stub Improvements #80

Merged

joostjager reviewed Oct 17, 2023

View reviewed changes

carlaKC force-pushed the eventloop-err branch from e2e656a to 66b6bcf Compare October 18, 2023 14:51

carlaKC requested a review from joostjager October 18, 2023 15:25

joostjager approved these changes Oct 19, 2023

View reviewed changes

carlaKC merged commit cefc7ee into lightningequipment:master Oct 20, 2023
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deadlock: Error in eventloop deadlock with peerController shutdown #82

deadlock: Error in eventloop deadlock with peerController shutdown #82

carlaKC commented Oct 11, 2023

carlaKC commented Oct 11, 2023

joostjager left a comment

joostjager Oct 17, 2023

carlaKC Oct 18, 2023

joostjager left a comment

deadlock: Error in eventloop deadlock with peerController shutdown #82

deadlock: Error in eventloop deadlock with peerController shutdown #82

Conversation

carlaKC commented Oct 11, 2023

carlaKC commented Oct 11, 2023

joostjager left a comment

Choose a reason for hiding this comment

joostjager Oct 17, 2023

Choose a reason for hiding this comment

carlaKC Oct 18, 2023

Choose a reason for hiding this comment

joostjager left a comment

Choose a reason for hiding this comment