-
Notifications
You must be signed in to change notification settings - Fork 486
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clustering: delay startup until after the HTTP server is up #3909
Conversation
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
pkg/cluster/cluster.go
Outdated
// Nodes initially join in the Viewer state. We can move to the | ||
// Participant state to signal that we wish to participate in reading | ||
// or writing data. | ||
err = node.ChangeState(context.Background(), peer.StateParticipant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use a context with a timeout here? Could it block for too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will block until the message has been broadcast to at least one peer; cancelling the context here will still try to broadcast the message but we'll not wait for the confirmation.
Since joining a new cluster will prompt other peers to Update their components and re-distribute the work based on what our State is, I think we should validate that we've successfully broadcast our intention to participate in splitting the workload, too.
WDYT? Otherwise if this message never reached other peers, we'll think we got it through and try to perform duplicate work since the cluster won't agree on our State.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking more about the case where we'll wait forever for the state to change - e.g. when we try to broadcast to a single peer that became unresponsive.
But yeah, I gather if we had a timeout here, we could broadcast state change to become a Participant successfully and at the same time timeout, leaving with an incorrect cluster state. Would the cluster still recover because there would be no heartbeats after this instance quits?
Another question: after ChangeState
returns - did at least one peer successfully receive the broadcast, or did we just successfully send the broadcast, but don't know if it was received? I couldn't immediately see this from the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so it looks like that ChangeState returns as soon as the message is about to be broadcast to its peers (we don't know if it was received).
Let's start with adding a hardcoded timeout of 5 seconds just to be safe; this indeed could allow us to avoid other deadlock situations in the future.
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
PR Description
We would like to delay the clusterer implementation's startup (which includes connecting to the configured list of peers) until after the Flow HTTP server is up.
In that case, when Start() is experiencing delays, or it deadlocks, we will still be able to grab pprof profiles to figure out what's wrong. This can also allow the clusterer implementation to always have at least one valid peer to connect to; itself.
Which issue(s) this PR fixes
No issue filed.
Notes to the Reviewer
Nothing in particular.
PR Checklist