Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

broker: track RPC state and send error responses for lost peers #3800

Closed
garlick opened this issue Jul 27, 2021 · 0 comments
Closed

broker: track RPC state and send error responses for lost peers #3800

garlick opened this issue Jul 27, 2021 · 0 comments

Comments

@garlick
Copy link
Member

garlick commented Jul 27, 2021

When a broker peer is lost, there is no possibility for responses to be received for pending RPCs. Brokers should track outstanding RPCs and send error responses when the next hop dies to avoid hangs.

garlick added a commit to garlick/flux-core that referenced this issue Aug 15, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
garlick added a commit to garlick/flux-core that referenced this issue Aug 15, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
garlick added a commit to garlick/flux-core that referenced this issue Aug 16, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
garlick added a commit to garlick/flux-core that referenced this issue Aug 16, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
garlick added a commit to garlick/flux-core that referenced this issue Aug 17, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
@mergify mergify bot closed this as completed in 8cc6bee Aug 23, 2021
chu11 pushed a commit to chu11/flux-core that referenced this issue Sep 28, 2021
Problem: when TBON children become lost, any pending RPCs
passing through them may go unanswered, leading to hangs
in other parts of the system.

Track pending RPCs for each TBON child.  When a child's
state transitions from an online state to offline/lost,
responses are generated for these RPCs.

RPCs are considered terminated when the RPC request has:
- the NORESPONSE flag is set
- the STREAMING flag is set, and a matching error response is received
- neither flag set, and any matching response is received
- the same sending UUID as a disconnect request

Note: this ony affects RPCs where the next hop is in the
downstream/leaves direction.  Each broker along the path
of a multi-hop RPC tracks RPCs routed to its downstream peer,
but only the broker whose downstream peer transitions to
lost or offline sends an error response.

This PR does not address loss of the parent.

Fixes flux-framework#3800
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant