Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

context deadline exceeded in worker log when starting a workflow after upgrading to 1.19.0 #3826

Closed
muralisrini opened this issue Jan 22, 2023 · 14 comments
Assignees

Comments

@muralisrini
Copy link

muralisrini commented Jan 22, 2023

Expected Behavior

Successful workflow creation . We did not see this problem in 1.18.x.

Note that there was no data migration involved (new deployment including database)

Actual Behavior

Errors like

2023-01-19T13:46:39.099356181Z {"level":"error","ts":"2023-01-19T13:46:39.099Z","msg":"error starting temporal-sys-history-scanner-workflow workflow","service":"worker","error":"context deadline exceeded","logging-call-at":"scanner.go:232","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:144\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflow\n\t/home/builder/temporal/service/worker/scanner/scanner.go:232\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry.func1\n\t/home/builder/temporal/service/worker/scanner/scanner.go:209\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:170\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/com  1.
  1.
  1.mon/backoff/retry.go:194\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:171\ngo.temporal.io/server/service/worker/scanner.(*Scanner).startWorkflowWithRetry\n\t/home/builder/temporal/service/worker/scanner/scanner.go:208"}

Steps to Reproduce the Problem

Does not happen every time but when it does, seems to persist.

We do not have a simple way to recreate this but hoping this might be a known issue given we saw it immediately after going to 1.19.0. We can upgrade to 1.19.1 if this is a know issue fixed in that version.

Attaching temporal logs...

kubectl-logs-temporal-frontend-c84645c8-wj8fr-temporal-frontend-current.txt
kubectl-logs-temporal-web-557c4795f9-46n76-temporal-web-current.txt
kubectl-logs-temporal-worker-77456bfc96-979hl-temporal-worker-current.txt

Specifications

  • Version: 1.19.0
  • Platform: Ubuntu
@yiminc
Copy link
Member

yiminc commented Jan 27, 2023

Is this a new namespace just created?
"wf-namespace":"manetu","error":"Namespace 4ad12a22-a459-4715-836b-b35aacb55d2e is not found."

Did the problem persist after a few minutes?

@muralisrini
Copy link
Author

muralisrini commented Jan 27, 2023

Is this a new namespace just created? "wf-namespace":"manetu","error":"Namespace 4ad12a22-a459-4715-836b-b35aacb55d2e is not found."

Yes, this is a new namespace just created.

Did the problem persist after a few minutes?

It did.

@dnr
Copy link
Member

dnr commented Jan 27, 2023

It does succeed eventually (after ~19 minutes) but that does seem weird.

It looks like something odd is going on with history and ringpop (service discovery): note that the worker locates frontend, matching, and itself right at startup, but doesn't locate history for another 12 seconds. (The worker doesn't contact history directly to start a workflow but it suggests that the frontend might also have a problem.) Then at 2023-01-19T13:46:27.123Z it seems to lose history completely, and find it again 25 seconds later, and at the next retry the scanner workflow is created successfully.

Looking at the frontend log itself, it either see no history pods, or has trouble reaching the one it can see, until it succeeds.

So I'd suspect something about the connectivity of the history pod.

@Gibstick
Copy link

We're seeing this issue too, with the same characteristics:

  • suspicious log messages of Current reachable members that seem to indicate services appearing and disappearing for no reason
  • it doesn't always happen, but when it does, the temporal cluster just stays in a broken state for a long time (not sure if it ever recovers)
  • we did not see the issue in 1.18.x
  • after downgrading to 1.18.5, we aren't able to reproduce the issue

Some differences that don't seem to be substantial are that

  • it happens in 1.19.1
  • we're seeing it after a migration, so it's not limited to just fresh deploys

@yux0
Copy link
Contributor

yux0 commented Jan 31, 2023

Can you provide the logs from history role? Looking at the frontend and worker logs, the worker failed to start because of the error in frontend and the frontend errors are related to the history role.

@Gibstick
Copy link

Here are all logs for completeness. Using the timestamps in the logs, I started everything at pretty much the same time at 19:55, and then waited a few minutes for the Current reachable members messages to show up in the history component.

temporal-frontend-logs.txt
temporal-history-logs.txt
temporal-matching-logs.txt
temporal-worker-logs.txt

@dnr
Copy link
Member

dnr commented Jan 31, 2023

I don't have any specific theories yet, just listing a few things I'm noticing in those logs:

  • The worker started about 4m after the others. Not sure if that was intentional.
  • The frontend says unable to bootstrap ringpop after 30s, and then succeeds 8s later. But not reliably: services seem to blink in and out of view after that.
  • All services are using port 7233 for their main grpc port. That's unusual but I don't immediately see why it would cause problems since of course the IPs are different. You could try giving them different ports like the standard configs.
  • They are using distinct membership ports.
  • There are some extra entries in the cluster membership table (e.g. 172.17.0.16:6935,172.17.0.23:6933,172.17.0.20:6939). I'm assuming those are old? I.e. they were running within 20s of starting this set of services but are no longer running? That shouldn't be a problem as long as those are unreachable. If they are still running, then you're pointing two temporal clusters at the same database, which won't work. Assuming they're not still running, can you check if trying to connect to those ports fails immediately or hangs for a long time? I would hope that ringpop can still bootstrap even if those connections just hang, as long as it can make others, but it might be something to look into.
  • Can you see if you can reproduce this if you wait at least 20s between all previous services being stopped and starting any new ones?

@Gibstick
Copy link

Gibstick commented Feb 1, 2023

worker started about 4m after the others

My bad, that's just the worker crash looping because the rest of the cluster is broken. But every run looks about the same.

All services are using port 7233

That was a bit of laziness from when I created this helm chart for deploying temporal on k8s. I've tried it again with unique ports but the issue still manifests in the same way.

extra entries in the cluster membership table

These are old, but this is a case I run into often because I'm running this locally and I frequently bring down all of my pods and bring them up quickly again for local development. However, those stale ones are unreachable with a connection refused error (I just poked those host:ports with netcat from a different pod).

wait 20s

WORKAROUND: I can't reproduce this if I wait 20s. And it seems like in the rare cases where I don't wait 20s before restarting but things still work, all of the pods came back with the same IPs as before. So it seems like something changed about service discovery/ringpop that broke my use case.

Is there anything I can tweak to get the old behaviour back? Or is this something that can easily be tweaked from your end in the service discovery code? As far as I can tell, that cluster_membership table is still being written to in the same way, so it seems like something changed about how it's being used.

@roded
Copy link

roded commented Feb 6, 2023

I think that we're experiencing the same issue after upgrading to 1.19.1:
https://temporalio.slack.com/archives/CTRCR8RBP/p1675433773709559

@dnr
Copy link
Member

dnr commented Feb 6, 2023

I think that we're experiencing the same issue after upgrading to 1.19.1: temporalio.slack.com/archives/CTRCR8RBP/p1675433773709559

Interesting. Can you reproduce the problem if you wait at least 20s between all services being stopped and starting them again?

@roded
Copy link

roded commented Feb 6, 2023

Interesting. Can you reproduce the problem if you wait at least 20s between all services being stopped and starting them again?

Yeah, I did manage to reproduce the issue after killing the temporal process (I have just the one) and restarting it in the same container.

Edit: I did wait 20 seconds before restarting.

@yux0
Copy link
Contributor

yux0 commented Feb 7, 2023

I tried to repro this issue by:

  1. Start four service roles, frontend, history matching, worker
  2. Kill all processes
  3. Start only frontend service role.

The issue I found is the startup timeout is too short. I added a fix #3911. Please let me know if this fixes your issue.

@yux0 yux0 self-assigned this Feb 7, 2023
@Gibstick
Copy link

Gibstick commented Feb 7, 2023

Some more info requested from slack:

Just want to confirm, after you cleaned the cluster_membership table. The issue cannot be repro? was this issue repro on an upgrade path(1.18 -> 1.19). or it is repro on 1.19 with a brand new cluster?

The first time I saw this issue was from upgrading. But I've since gone back and reproduced it with a brand-new cluster in 1.19.1.

My earlier messages about cleaning the cluster_membership table are just me trying to work around the issue. Basically, if I restart my services enough and hit this issue where they are all stuck, I can bring them all down, clear cluster_membership, and then bring them up again, then the immediate problem will be solved and my cluster will come back again healthy. But it's not a permanent fix, because the issue will come back if I restart quickly again a few times.

@Gibstick
Copy link

Gibstick commented Feb 7, 2023

I build the server image from https://github.com/temporalio/docker-builds and I tested it. The easiest way I have of reproducing it is just running kubectl delete pod -l... and using a label that selects all temporal pods. Since they are in a deployment, they come back pretty much right away.

With the fix from @yux0, I haven't been able to reproduce the issue. In the 20 times I tried killing the pods and checking the cluster, it came back up successfully every time (verified with tctl wf list). Thanks!

@yiminc yiminc closed this as completed Feb 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants