clustering: delay startup until after the HTTP server is up #3909

tpaschalis · 2023-05-17T08:00:44Z

PR Description

We would like to delay the clusterer implementation's startup (which includes connecting to the configured list of peers) until after the Flow HTTP server is up.

In that case, when Start() is experiencing delays, or it deadlocks, we will still be able to grab pprof profiles to figure out what's wrong. This can also allow the clusterer implementation to always have at least one valid peer to connect to; itself.

Which issue(s) this PR fixes

No issue filed.

Notes to the Reviewer

Nothing in particular.

PR Checklist

CHANGELOG updated (N/A)
Documentation added (N/A)
Tests updated (N/A)

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

cmd/internal/flowmode/cmd_run.go

pkg/cluster/cluster.go

thampiotr · 2023-05-17T08:57:34Z

pkg/cluster/cluster.go

+		// Nodes initially join in the Viewer state. We can move to the
+		// Participant state to signal that we wish to participate in reading
+		// or writing data.
+		err = node.ChangeState(context.Background(), peer.StateParticipant)


Should we use a context with a timeout here? Could it block for too long?

This will block until the message has been broadcast to at least one peer; cancelling the context here will still try to broadcast the message but we'll not wait for the confirmation.

Since joining a new cluster will prompt other peers to Update their components and re-distribute the work based on what our State is, I think we should validate that we've successfully broadcast our intention to participate in splitting the workload, too.

WDYT? Otherwise if this message never reached other peers, we'll think we got it through and try to perform duplicate work since the cluster won't agree on our State.

I was thinking more about the case where we'll wait forever for the state to change - e.g. when we try to broadcast to a single peer that became unresponsive.

But yeah, I gather if we had a timeout here, we could broadcast state change to become a Participant successfully and at the same time timeout, leaving with an incorrect cluster state. Would the cluster still recover because there would be no heartbeats after this instance quits?

Another question: after ChangeState returns - did at least one peer successfully receive the broadcast, or did we just successfully send the broadcast, but don't know if it was received? I couldn't immediately see this from the code.

Ok, so it looks like that ChangeState returns as soon as the message is about to be broadcast to its peers (we don't know if it was received).

Let's start with adding a hardcoded timeout of 5 seconds just to be safe; this indeed could allow us to avoid other deadlock situations in the future.

pkg/cluster/cluster.go

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

thampiotr

LGTM!

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

Delay clusterer Start() call until after the HTTP is up

ece28ce

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

tpaschalis changed the title ~~Delay clusterer Start() call until after the HTTP is up~~ clustering: delay startup until after the HTTP server is up May 17, 2023

tpaschalis requested a review from thampiotr May 17, 2023 08:19

thampiotr reviewed May 17, 2023

View reviewed changes

tpaschalis marked this pull request as ready for review May 17, 2023 11:46

Add TODO comment about backoff/retries during restart

2110c45

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

tpaschalis mentioned this pull request May 18, 2023

Clustering Mode: deadlock during initialization #3919

Closed

Add a timeout to the original ChangeState call

af9b20c

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

tpaschalis requested a review from thampiotr May 18, 2023 14:12

rfratto mentioned this pull request May 18, 2023

Fix deadlock with Flow clustering #3922

Merged

thampiotr approved these changes May 18, 2023

View reviewed changes

tpaschalis merged commit b70e118 into grafana:main May 18, 2023

tpaschalis deleted the delay-cluster-bootstrap branch May 18, 2023 14:33

clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023

clustering: delay startup until after the HTTP server is up (#3909)

781ace8

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

clayton-cornell pushed a commit that referenced this pull request Aug 14, 2023

clustering: delay startup until after the HTTP server is up (#3909)

4d847d2

Signed-off-by: Paschalis Tsilias <paschalis.tsilias@grafana.com>

github-actions bot added the frozen-due-to-age Locked due to a period of inactivity. Please open new issues or PRs if more discussion is needed. label Feb 25, 2024

github-actions bot locked as resolved and limited conversation to collaborators Feb 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering: delay startup until after the HTTP server is up #3909

clustering: delay startup until after the HTTP server is up #3909

tpaschalis commented May 17, 2023

thampiotr May 17, 2023

tpaschalis May 17, 2023 •

edited

Loading

thampiotr May 17, 2023

tpaschalis May 18, 2023

thampiotr left a comment

clustering: delay startup until after the HTTP server is up #3909

clustering: delay startup until after the HTTP server is up #3909

Conversation

tpaschalis commented May 17, 2023

PR Description

Which issue(s) this PR fixes

Notes to the Reviewer

PR Checklist

thampiotr May 17, 2023

Choose a reason for hiding this comment

tpaschalis May 17, 2023 • edited Loading

Choose a reason for hiding this comment

thampiotr May 17, 2023

Choose a reason for hiding this comment

tpaschalis May 18, 2023

Choose a reason for hiding this comment

thampiotr left a comment

Choose a reason for hiding this comment

tpaschalis May 17, 2023 •

edited

Loading