Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

need ability to force disconnect from broker peer #3805

Closed
garlick opened this issue Jul 29, 2021 · 0 comments
Closed

need ability to force disconnect from broker peer #3805

garlick opened this issue Jul 29, 2021 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Jul 29, 2021

As we debug broker resiliency for the system instance, it may be useful for testing or to trigger recovery in a hung instance, to be able to force a broker peer to "panic" and restart its subtree, as though its parent had crashed.

This may be useful in conjunction with #2797

garlick added a commit to garlick/flux-core that referenced this issue Aug 27, 2021
Problem: a misbehaving node may need to be administratively
detached from the flux instance.

Define a keepalive message type of KEEPALIVE_DISCONNECT that can
be sent by parent to child to force a disconnect.  Upon receiving
this message, the child disconnects the socket, purges the parent
RPC tracker, and marks the connection offline so future RPCs fail
with EHOSTUNREACH.

Add an RPC overlay.disconnect-subtree that takes a rank argument,
so that a system administrator could initiate teardown of a problem
node.

Fixes flux-framework#3805
@mergify mergify bot closed this as completed in 49ab11d Sep 1, 2021
chu11 pushed a commit to chu11/flux-core that referenced this issue Sep 28, 2021
Problem: a misbehaving node may need to be administratively
detached from the flux instance.

Define a keepalive message type of KEEPALIVE_DISCONNECT that can
be sent by parent to child to force a disconnect.  Upon receiving
this message, the child disconnects the socket, purges the parent
RPC tracker, and marks the connection offline so future RPCs fail
with EHOSTUNREACH.

Add an RPC overlay.disconnect-subtree that takes a rank argument,
so that a system administrator could initiate teardown of a problem
node.

Fixes flux-framework#3805
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant