Make frontend drain traffic time configurable #3934

yux0 · 2023-02-10T00:27:43Z

What changed?
Make frontend drain traffic time configurable

Why?
Make frontend drain traffic time configurable

How did you test it?

Potential risks

Is hotfix candidate?

dnr · 2023-02-10T01:24:19Z

service/frontend/service.go


 	logger.Info("ShutdownHandler: Updating gRPC health status to ShuttingDown")
 	s.healthServer.Shutdown()

 	logger.Info("ShutdownHandler: Waiting for others to discover I am unhealthy")
-	time.Sleep(failureDetectionTime)
+	time.Sleep(10 * time.Second)


should we have another dynamic config for this? this actually seems like the one that's more dependent on the environment (external health check frequency). the requestDrainTime can be fixed to 5s or 10s since we use 5s or 10s timeout on rpcs

+1 for having a separate knob for this.

The timeout is from client which can be quite long I think? Or may be I misunderstood something?

sure. Will add a different config.

My understanding is during this sleep, any rpcs that end up here will still be handled, but we expect some external system to do a health check, notice the "shutting down" response, and adjust its state to stop sending rpcs here. That timeout is controlled by the load balancer.

The second sleep is when we stop accepting rpcs, but continue processing ones that have already come in. That one depends on how long we expect our operations to take, which we have more control over. For long-polls, we can just fail and let them get retried. For everything else, if it takes more than 10s something is probably going wrong, so it seems okay to fail. But no harm in making that configurable too

dnr · 2023-02-10T19:10:38Z

common/dynamicconfig/constants.go

@@ -202,6 +202,8 @@ const (
 	FrontendThrottledLogRPS = "frontend.throttledLogRPS"
 	// FrontendShutdownDrainDuration is the duration of traffic drain during shutdown
 	FrontendShutdownDrainDuration = "frontend.shutdownDrainDuration"
+	// FrontendMembershipFailureDetectionDuration is the duration of membership failure detection
+	FrontendMembershipFailureDetectionDuration = "frontend.membershipFailureDetectionDuration"


This is about grpc health checks (as done by an external load balancer or similar component), not membership. I don't think ringpop uses grpc health checks, does it?

Suggested change

FrontendMembershipFailureDetectionDuration = "frontend.membershipFailureDetectionDuration"

FrontendShutdownFailHealthcheckDuration = "frontend.membershipShutdownFailHealthcheckDuration"

No. it doesn't

although, maybe it also makes sense to add a call to membershipMonitor.EvictSelf() at the same time we start failing health checks? that will make workers stop sending rpcs to this frontend, I think

dnr · 2023-02-10T19:38:31Z

service/frontend/service.go

@@ -207,6 +208,7 @@ func NewConfig(dc *dynamicconfig.Collection, numHistoryShards int32, enableReadF
 		BlobSizeLimitWarn:                      dc.GetIntPropertyFilteredByNamespace(dynamicconfig.BlobSizeLimitWarn, 256*1024),
 		ThrottledLogRPS:                        dc.GetIntProperty(dynamicconfig.FrontendThrottledLogRPS, 20),
 		ShutdownDrainDuration:                  dc.GetDurationProperty(dynamicconfig.FrontendShutdownDrainDuration, 0*time.Second),
+		ShutdownFailureDetectionDuration:       dc.GetDurationProperty(dynamicconfig.FrontendShutdownFailHealthcheckDuration, 10*time.Second),


name this the same as the property too?

Make frontend drain traffic time configurable

a7c0b90

yux0 requested a review from a team as a code owner February 10, 2023 00:27

yux0 requested a review from yycptt February 10, 2023 00:27

dnr reviewed Feb 10, 2023

View reviewed changes

add config for failure detection time

d27e14b

dnr reviewed Feb 10, 2023

View reviewed changes

rename

f087c9e

dnr approved these changes Feb 10, 2023

View reviewed changes

rename

6d968f0

yux0 enabled auto-merge (squash) February 10, 2023 19:44

yux0 merged commit 1fb0697 into temporalio:master Feb 10, 2023

yux0 deleted the configurable-drain branch February 10, 2023 20:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make frontend drain traffic time configurable #3934

Make frontend drain traffic time configurable #3934

yux0 commented Feb 10, 2023

dnr Feb 10, 2023

yycptt Feb 10, 2023

yux0 Feb 10, 2023

dnr Feb 10, 2023

dnr Feb 10, 2023

yux0 Feb 10, 2023

dnr Feb 10, 2023

dnr Feb 10, 2023

	FrontendMembershipFailureDetectionDuration = "frontend.membershipFailureDetectionDuration"
	FrontendShutdownFailHealthcheckDuration = "frontend.membershipShutdownFailHealthcheckDuration"

Make frontend drain traffic time configurable #3934

Make frontend drain traffic time configurable #3934

Conversation

yux0 commented Feb 10, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment