context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

muralisrini · 2023-01-22T20:17:09Z

Expected Behavior

Successful workflow creation . We did not see this problem in 1.18.x.

Note that there was no data migration involved (new deployment including database)

Actual Behavior

Errors like

2023-01-19T13:46:39.099356181Z {"level":"error","ts":"2023-01-19T13:46:39.099Z","msg":"error starting temporal-sys-history-scanner-workflow workflow","service":"worker","error":"context deadline exceeded","logging-call-at":"scanner.go:232","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:144\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflow\n\t/home/builder/temporal/service/worker/scanner/scanner.go:232\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry.func1\n\t/home/builder/temporal/service/worker/scanner/scanner.go:209\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/com  1.
  1.
  1.mon/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry\n\t/home/builder/temporal/service/worker/scanner/scanner.go:208"}

Steps to Reproduce the Problem

Does not happen every time but when it does, seems to persist.

We do not have a simple way to recreate this but hoping this might be a known issue given we saw it immediately after going to 1.19.0. We can upgrade to 1.19.1 if this is a know issue fixed in that version.

Attaching temporal logs...

kubectl-logs-temporal-frontend-c84645c8-wj8fr-temporal-frontend-current.txt
kubectl-logs-temporal-web-557c4795f9-46n76-temporal-web-current.txt
kubectl-logs-temporal-worker-77456bfc96-979hl-temporal-worker-current.txt

Specifications

Version: 1.19.0
Platform: Ubuntu

The text was updated successfully, but these errors were encountered:

yiminc · 2023-01-27T22:38:10Z

Is this a new namespace just created?
"wf-namespace":"manetu","error":"Namespace 4ad12a22-a459-4715-836b-b35aacb55d2e is not found."

Did the problem persist after a few minutes?

muralisrini · 2023-01-27T22:55:21Z

Is this a new namespace just created? "wf-namespace":"manetu","error":"Namespace 4ad12a22-a459-4715-836b-b35aacb55d2e is not found."

Yes, this is a new namespace just created.

Did the problem persist after a few minutes?

It did.

dnr · 2023-01-27T23:44:48Z

It does succeed eventually (after ~19 minutes) but that does seem weird.

It looks like something odd is going on with history and ringpop (service discovery): note that the worker locates frontend, matching, and itself right at startup, but doesn't locate history for another 12 seconds. (The worker doesn't contact history directly to start a workflow but it suggests that the frontend might also have a problem.) Then at 2023-01-19T13:46:27.123Z it seems to lose history completely, and find it again 25 seconds later, and at the next retry the scanner workflow is created successfully.

Looking at the frontend log itself, it either see no history pods, or has trouble reaching the one it can see, until it succeeds.

So I'd suspect something about the connectivity of the history pod.

Gibstick · 2023-01-31T17:36:39Z

We're seeing this issue too, with the same characteristics:

suspicious log messages of Current reachable members that seem to indicate services appearing and disappearing for no reason
it doesn't always happen, but when it does, the temporal cluster just stays in a broken state for a long time (not sure if it ever recovers)
we did not see the issue in 1.18.x
after downgrading to 1.18.5, we aren't able to reproduce the issue

Some differences that don't seem to be substantial are that

it happens in 1.19.1
we're seeing it after a migration, so it's not limited to just fresh deploys

yux0 · 2023-01-31T18:13:32Z

Can you provide the logs from history role? Looking at the frontend and worker logs, the worker failed to start because of the error in frontend and the frontend errors are related to the history role.

Gibstick · 2023-01-31T20:01:56Z

Here are all logs for completeness. Using the timestamps in the logs, I started everything at pretty much the same time at 19:55, and then waited a few minutes for the Current reachable members messages to show up in the history component.

temporal-frontend-logs.txt
temporal-history-logs.txt
temporal-matching-logs.txt
temporal-worker-logs.txt

dnr · 2023-01-31T21:46:08Z

I don't have any specific theories yet, just listing a few things I'm noticing in those logs:

The worker started about 4m after the others. Not sure if that was intentional.
The frontend says unable to bootstrap ringpop after 30s, and then succeeds 8s later. But not reliably: services seem to blink in and out of view after that.
All services are using port 7233 for their main grpc port. That's unusual but I don't immediately see why it would cause problems since of course the IPs are different. You could try giving them different ports like the standard configs.
They are using distinct membership ports.
There are some extra entries in the cluster membership table (e.g. 172.17.0.16:6935,172.17.0.23:6933,172.17.0.20:6939). I'm assuming those are old? I.e. they were running within 20s of starting this set of services but are no longer running? That shouldn't be a problem as long as those are unreachable. If they are still running, then you're pointing two temporal clusters at the same database, which won't work. Assuming they're not still running, can you check if trying to connect to those ports fails immediately or hangs for a long time? I would hope that ringpop can still bootstrap even if those connections just hang, as long as it can make others, but it might be something to look into.
Can you see if you can reproduce this if you wait at least 20s between all previous services being stopped and starting any new ones?

Gibstick · 2023-02-01T14:03:44Z

worker started about 4m after the others

My bad, that's just the worker crash looping because the rest of the cluster is broken. But every run looks about the same.

All services are using port 7233

That was a bit of laziness from when I created this helm chart for deploying temporal on k8s. I've tried it again with unique ports but the issue still manifests in the same way.

extra entries in the cluster membership table

These are old, but this is a case I run into often because I'm running this locally and I frequently bring down all of my pods and bring them up quickly again for local development. However, those stale ones are unreachable with a connection refused error (I just poked those host:ports with netcat from a different pod).

wait 20s

WORKAROUND: I can't reproduce this if I wait 20s. And it seems like in the rare cases where I don't wait 20s before restarting but things still work, all of the pods came back with the same IPs as before. So it seems like something changed about service discovery/ringpop that broke my use case.

Is there anything I can tweak to get the old behaviour back? Or is this something that can easily be tweaked from your end in the service discovery code? As far as I can tell, that cluster_membership table is still being written to in the same way, so it seems like something changed about how it's being used.

roded · 2023-02-06T18:15:34Z

I think that we're experiencing the same issue after upgrading to 1.19.1:
https://temporalio.slack.com/archives/CTRCR8RBP/p1675433773709559

dnr · 2023-02-06T18:45:33Z

I think that we're experiencing the same issue after upgrading to 1.19.1: temporalio.slack.com/archives/CTRCR8RBP/p1675433773709559

Interesting. Can you reproduce the problem if you wait at least 20s between all services being stopped and starting them again?

roded · 2023-02-06T21:01:36Z

Interesting. Can you reproduce the problem if you wait at least 20s between all services being stopped and starting them again?

Yeah, I did manage to reproduce the issue after killing the temporal process (I have just the one) and restarting it in the same container.

Edit: I did wait 20 seconds before restarting.

yux0 · 2023-02-07T21:04:30Z

I tried to repro this issue by:

Start four service roles, frontend, history matching, worker
Kill all processes
Start only frontend service role.

The issue I found is the startup timeout is too short. I added a fix #3911. Please let me know if this fixes your issue.

Gibstick · 2023-02-07T21:59:00Z

Some more info requested from slack:

Just want to confirm, after you cleaned the cluster_membership table. The issue cannot be repro? was this issue repro on an upgrade path(1.18 -> 1.19). or it is repro on 1.19 with a brand new cluster?

The first time I saw this issue was from upgrading. But I've since gone back and reproduced it with a brand-new cluster in 1.19.1.

My earlier messages about cleaning the cluster_membership table are just me trying to work around the issue. Basically, if I restart my services enough and hit this issue where they are all stuck, I can bring them all down, clear cluster_membership, and then bring them up again, then the immediate problem will be solved and my cluster will come back again healthy. But it's not a permanent fix, because the issue will come back if I restart quickly again a few times.

Gibstick · 2023-02-07T23:41:17Z

I build the server image from https://github.com/temporalio/docker-builds and I tested it. The easiest way I have of reproducing it is just running kubectl delete pod -l... and using a label that selects all temporal pods. Since they are in a deployment, they come back pretty much right away.

With the fix from @yux0, I haven't been able to reproduce the issue. In the 20 times I tried killing the pods and checking the cluster, it came back up successfully every time (verified with tctl wf list). Thanks!

muralisrini added the potential-bug label Jan 22, 2023

yux0 mentioned this issue Feb 7, 2023

Use max join time as start timeout #3911

Merged

yux0 self-assigned this Feb 7, 2023

yiminc closed this as completed Feb 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

muralisrini commented Jan 22, 2023 •

edited

Loading

yiminc commented Jan 27, 2023

muralisrini commented Jan 27, 2023 •

edited

Loading

dnr commented Jan 27, 2023

Gibstick commented Jan 31, 2023

yux0 commented Jan 31, 2023

Gibstick commented Jan 31, 2023

dnr commented Jan 31, 2023

Gibstick commented Feb 1, 2023

roded commented Feb 6, 2023

dnr commented Feb 6, 2023

roded commented Feb 6, 2023 •

edited

Loading

yux0 commented Feb 7, 2023

Gibstick commented Feb 7, 2023

Gibstick commented Feb 7, 2023

context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

Comments

muralisrini commented Jan 22, 2023 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Specifications

yiminc commented Jan 27, 2023

muralisrini commented Jan 27, 2023 • edited Loading

dnr commented Jan 27, 2023

Gibstick commented Jan 31, 2023

yux0 commented Jan 31, 2023

Gibstick commented Jan 31, 2023

dnr commented Jan 31, 2023

Gibstick commented Feb 1, 2023

roded commented Feb 6, 2023

dnr commented Feb 6, 2023

roded commented Feb 6, 2023 • edited Loading

yux0 commented Feb 7, 2023

Gibstick commented Feb 7, 2023

Gibstick commented Feb 7, 2023

muralisrini commented Jan 22, 2023 •

edited

Loading

muralisrini commented Jan 27, 2023 •

edited

Loading

roded commented Feb 6, 2023 •

edited

Loading