Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock: Error in eventloop deadlock with peerController shutdown #82

Merged
merged 2 commits into from
Oct 20, 2023

Conversation

carlaKC
Copy link
Collaborator

@carlaKC carlaKC commented Oct 11, 2023

Ran into a shutdown issue while testing #80, opening up in a separate PR to demonstrate pre-existing deadlock.

This PR takes the approach of sharing a go group, but the same could be achieved by adding a ctx with cancel in eventLoop and cancelling it before we group.Wait but that seems like a manual way of achieving the functionality of groups.

The following deadlock will occur when:
- We are running process.eventLoop in a go group which means that
  it'll cancel context on error from any member of the group.
- Inside of process.eventLoop, we spin up a set of peerControlers in
  a secondary go group that rely on the parent context (associated
  with the top level group) for cancellation.
- There is an error in eventLoop due to an unknown channel, which
  prompts it to return.
- The defer function waits for the peerController group to exit.
- Since the top level context will only be cancelled when eventLoop
  successfully exits (because receiving the error will cancel ctx),
  we hit a deadlock:
  -> eventLoop is waiting for all peerControllers to exit on a
     canceled context to return an error.
  -> the group is waiting on eventLoop to return an error to cancel
     context.
@carlaKC
Copy link
Collaborator Author

carlaKC commented Oct 11, 2023

One higher-level question this introduced for me is whether we want circuitbreaker to exit on channelNotFound?

As suggested here, introducing pending close channels would completely cover us (so that we never expect this error) but the short channel ID isn't currently included with pending channel information (lightningnetwork/lnd#5575).

Given that there's a PR open (and people seemingly haven't hit this issue before) I think it's okay to error out and just wait on the upstream fix? But also happy to put something temporary in place.

Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is indeed fine to exit on channelNotFound. It isn't a regression and as you say, we can wait for the upstream fix.

_ = group.Wait()
}()

func (p *process) eventLoop(ctx context.Context, group *errgroup.Group) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally I created a secondary group to keep process maximally self-contained. I think the change that you propose is effective.

I am wondering though if there is an easy way to avoid the group parameter here. I suppose that would be to also run the event loop itself as a member of the secondary group? This would get rid of the defer too I think and just make group.Wait() the last statement.

Curious to hear what you think of this option. If you feel it is over-engineered, it's fine to keep what you have.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to look up best practices for errGroup, couldn't find much. Docs say: "A Group is a collection of goroutines working on subtasks that are part of the same overall task." which makes me think that a second group is more canonical?

Also nice for the sake of having eventLoop completely terminate before returning (current impl lifting that wait up to the first group is a slight behavior change ).

Remove deadlock where:
- Eventloop errors out, but waits on peerContollers to exit to return.
- peerContollers require eventLoop to exit for their context to be
  be cancelled.

This is achieved by sharing a go error group between the peer
controllers and the event loop. By sharing a group, errors anywhere
will shut everything down.

This change introduces a change in how we wait for clean shutdown:
- Previously: eventLoop would only exit once peerControllers shut down.
- Now: eventLoop exits on error, and we wait on peerControllers at the
  call site of eventLoop.
Copy link
Contributor

@joostjager joostjager left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy that you kept the sub-group. Thanks for the bugfix 🙏

@carlaKC carlaKC merged commit cefc7ee into lightningequipment:master Oct 20, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants