Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clustering: delay startup until after the HTTP server is up #3909

Merged
merged 3 commits into from
May 18, 2023

Conversation

tpaschalis
Copy link
Member

PR Description

We would like to delay the clusterer implementation's startup (which includes connecting to the configured list of peers) until after the Flow HTTP server is up.

In that case, when Start() is experiencing delays, or it deadlocks, we will still be able to grab pprof profiles to figure out what's wrong. This can also allow the clusterer implementation to always have at least one valid peer to connect to; itself.

Which issue(s) this PR fixes

No issue filed.

Notes to the Reviewer

Nothing in particular.

PR Checklist

  • CHANGELOG updated (N/A)
  • Documentation added (N/A)
  • Tests updated (N/A)

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
@tpaschalis tpaschalis changed the title Delay clusterer Start() call until after the HTTP is up clustering: delay startup until after the HTTP server is up May 17, 2023
@tpaschalis tpaschalis requested a review from thampiotr May 17, 2023 08:19
cmd/internal/flowmode/cmd_run.go Outdated Show resolved Hide resolved
pkg/cluster/cluster.go Outdated Show resolved Hide resolved
// Nodes initially join in the Viewer state. We can move to the
// Participant state to signal that we wish to participate in reading
// or writing data.
err = node.ChangeState(context.Background(), peer.StateParticipant)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use a context with a timeout here? Could it block for too long?

Copy link
Member Author

@tpaschalis tpaschalis May 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will block until the message has been broadcast to at least one peer; cancelling the context here will still try to broadcast the message but we'll not wait for the confirmation.

Since joining a new cluster will prompt other peers to Update their components and re-distribute the work based on what our State is, I think we should validate that we've successfully broadcast our intention to participate in splitting the workload, too.

WDYT? Otherwise if this message never reached other peers, we'll think we got it through and try to perform duplicate work since the cluster won't agree on our State.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking more about the case where we'll wait forever for the state to change - e.g. when we try to broadcast to a single peer that became unresponsive.

But yeah, I gather if we had a timeout here, we could broadcast state change to become a Participant successfully and at the same time timeout, leaving with an incorrect cluster state. Would the cluster still recover because there would be no heartbeats after this instance quits?

Another question: after ChangeState returns - did at least one peer successfully receive the broadcast, or did we just successfully send the broadcast, but don't know if it was received? I couldn't immediately see this from the code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so it looks like that ChangeState returns as soon as the message is about to be broadcast to its peers (we don't know if it was received).

Let's start with adding a hardcoded timeout of 5 seconds just to be safe; this indeed could allow us to avoid other deadlock situations in the future.

pkg/cluster/cluster.go Outdated Show resolved Hide resolved
pkg/cluster/cluster.go Show resolved Hide resolved
@tpaschalis tpaschalis marked this pull request as ready for review May 17, 2023 11:46
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
Copy link
Contributor

@thampiotr thampiotr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tpaschalis tpaschalis merged commit b70e118 into grafana:main May 18, 2023
@tpaschalis tpaschalis deleted the delay-cluster-bootstrap branch May 18, 2023 14:33
clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023
Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>
@github-actions github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 25, 2024
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants