Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: Goroutine leak when communication fails #8975

Closed
bdarnell opened this issue Aug 31, 2016 · 9 comments
Closed

stability: Goroutine leak when communication fails #8975

bdarnell opened this issue Aug 31, 2016 · 9 comments
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting

Comments

@bdarnell
Copy link
Contributor

On beta, running 2a2fdd9, node 6 leaked goroutines over a period of several hours until it ran out of memory and died:

screenshot 2016-08-31 18 49 14

It's not clear what's going on (since memory and goroutine profiles are not saved for very long after they are collected). One thing worth mentioning is that node 6 was the one implicated in #8939, so it was having trouble talking to certain other nodes. My best guess is that some key ranges (maybe the first range) got rebalanced onto a set of nodes that node 6 could not talk to. This is probably part of our general need to have better backpressure instead of allowing goroutines to pile up indefinitely.

Next time this happens we should be sure to save the goroutine profile. We should probably also change the way that old profiles are cleaned up, so that we have more history available. (e.g. instead of just keeping the last N, keep the most recent few and then one per minute for the last hour, one per hour for the last day, etc)

@bdarnell bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 31, 2016
@bdarnell
Copy link
Contributor Author

Well that didn't take long. Another node started leaking goroutines:

goroutine profile: total 1788
569 @ 0x62ef7a 0x63e679 0x63d46c 0xe65867 0xe60cb6 0xe612ef 0xe737ae 0xe62ad7 0xe61826 0xe6e0d6 0xaa65b0 0xaac8b8 0xaa5c68 0xaa5f15 0x86ecfe 0x86de0e 0x8cce53 0x8de292 0x85847e 0x714980 0x841d45 0x75333d 0xb74d00 0xb769b0 0xb7b6eb 0x660251
#   0xe65866    github.com/cockroachdb/cockroach/kv.(*DistSender).sendToReplicas+0xa76                          /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:1001
#   0xe60cb5    github.com/cockroachdb/cockroach/kv.(*DistSender).sendRPC+0x1c5                             /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:364
#   0xe612ee    github.com/cockroachdb/cockroach/kv.(*DistSender).sendSingleRange+0x19e                         /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:480
#   0xe737ad    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk.func2+0x18d                         /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:746
#   0xe62ad6    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk+0x436                           /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:747
#   0xe61825    github.com/cockroachdb/cockroach/kv.(*DistSender).Send+0x1d5                                /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:591
#   0xe6e0d5    github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send+0x545                            /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:374
#   0xaa65af    github.com/cockroachdb/cockroach/internal/client.(*DB).send+0x21f                           /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:490
#   0xaac8b7    github.com/cockroachdb/cockroach/internal/client.(*DB).(github.com/cockroachdb/cockroach/internal/client.send)-fm+0x57  /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0xaa5c67    github.com/cockroachdb/cockroach/internal/client.sendAndFill+0x167                          /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:417
#   0xaa5f14    github.com/cockroachdb/cockroach/internal/client.(*DB).Run+0x74                             /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0x86ecfd    github.com/cockroachdb/cockroach/storage.(*intentResolver).maybePushTransactions+0xa9d                  /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:230
#   0x86de0d    github.com/cockroachdb/cockroach/storage.(*intentResolver).processWriteIntentError+0x18d                /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:89
#   0x8cce52    github.com/cockroachdb/cockroach/storage.(*Store).Send+0xca2                                /go/src/github.com/cockroachdb/cockroach/storage/store.go:2073
#   0x8de291    github.com/cockroachdb/cockroach/storage.(*Stores).Send+0x201                               /go/src/github.com/cockroachdb/cockroach/storage/stores.go:182
#   0x85847d    github.com/cockroachdb/cockroach/server.(*Node).Batch.func3+0x2ed                           /go/src/github.com/cockroachdb/cockroach/server/node.go:818
#   0x71497f    github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask+0xff                          /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:215
#   0x841d44    github.com/cockroachdb/cockroach/server.(*Node).Batch+0x274                             /go/src/github.com/cockroachdb/cockroach/server/node.go:830
#   0x75333c    github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler+0x27c                          /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1499
#   0xb74cff    google.golang.org/grpc.(*Server).processUnaryRPC+0xc4f                                  /go/src/google.golang.org/grpc/server.go:608
#   0xb769af    google.golang.org/grpc.(*Server).handleStream+0x6af                                 /go/src/google.golang.org/grpc/server.go:766
#   0xb7b6ea    google.golang.org/grpc.(*Server).serveStreams.func1.1+0xaa                              /go/src/google.golang.org/grpc/server.go:419

569 @ 0x62ef7a 0x63e679 0x63d46c 0xf83f49 0xb6950d 0xb6a451 0x752fc2 0xe7476d 0x660251
#   0xf83f48    google.golang.org/grpc/transport.(*Stream).Header+0x298             /go/src/google.golang.org/grpc/transport/transport.go:239
#   0xb6950c    google.golang.org/grpc.recvResponse+0xac                    /go/src/google.golang.org/grpc/call.go:62
#   0xb6a450    google.golang.org/grpc.Invoke+0x8a0                     /go/src/google.golang.org/grpc/call.go:202
#   0x752fc1    github.com/cockroachdb/cockroach/roachpb.(*internalClient).Batch+0xd1       /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1476
#   0xe7476c    github.com/cockroachdb/cockroach/kv.(*grpcTransport).SendNext.func1+0xfc    /go/src/github.com/cockroachdb/cockroach/kv/transport.go:180

493 @ 0x62ef7a 0x63e679 0x63d46c 0xd253d5 0xe6290e 0xe61826 0xe6e0d6 0xaa65b0 0xaac8b8 0xaa5c68 0xaa5f15 0x86ecfe 0x86de0e 0x8cce53 0x8de292 0x85847e 0x714980 0x841d45 0x75333d 0xb74d00 0xb769b0 0xb7b6eb 0x660251
#   0xd253d4    github.com/cockroachdb/cockroach/util/retry.(*Retry).Next+0x1a4                             /go/src/github.com/cockroachdb/cockroach/util/retry/retry.go:128
#   0xe6290d    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk+0x26d                           /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:666
#   0xe61825    github.com/cockroachdb/cockroach/kv.(*DistSender).Send+0x1d5                                /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:591
#   0xe6e0d5    github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send+0x545                            /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:374
#   0xaa65af    github.com/cockroachdb/cockroach/internal/client.(*DB).send+0x21f                           /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:490
#   0xaac8b7    github.com/cockroachdb/cockroach/internal/client.(*DB).(github.com/cockroachdb/cockroach/internal/client.send)-fm+0x57  /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0xaa5c67    github.com/cockroachdb/cockroach/internal/client.sendAndFill+0x167                          /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:417
#   0xaa5f14    github.com/cockroachdb/cockroach/internal/client.(*DB).Run+0x74                             /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0x86ecfd    github.com/cockroachdb/cockroach/storage.(*intentResolver).maybePushTransactions+0xa9d                  /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:230
#   0x86de0d    github.com/cockroachdb/cockroach/storage.(*intentResolver).processWriteIntentError+0x18d                /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:89
#   0x8cce52    github.com/cockroachdb/cockroach/storage.(*Store).Send+0xca2                                /go/src/github.com/cockroachdb/cockroach/storage/store.go:2073
#   0x8de291    github.com/cockroachdb/cockroach/storage.(*Stores).Send+0x201                               /go/src/github.com/cockroachdb/cockroach/storage/stores.go:182
#   0x85847d    github.com/cockroachdb/cockroach/server.(*Node).Batch.func3+0x2ed                           /go/src/github.com/cockroachdb/cockroach/server/node.go:818
#   0x71497f    github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask+0xff                          /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:215
#   0x841d44    github.com/cockroachdb/cockroach/server.(*Node).Batch+0x274                             /go/src/github.com/cockroachdb/cockroach/server/node.go:830
#   0x75333c    github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler+0x27c                          /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1499
#   0xb74cff    google.golang.org/grpc.(*Server).processUnaryRPC+0xc4f                                  /go/src/google.golang.org/grpc/server.go:608
#   0xb769af    google.golang.org/grpc.(*Server).handleStream+0x6af                                 /go/src/google.golang.org/grpc/server.go:766
#   0xb7b6ea    google.golang.org/grpc.(*Server).serveStreams.func1.1+0xaa                              /go/src/google.golang.org/grpc/server.go:419

Looks like #8692; it's all synchronous intent resolution.

@tbg
Copy link
Member

tbg commented Aug 31, 2016

How are these leaked though? DistSender is stuck on Retry.Next which means it's not getting those requests where they ought to go. Looks like the intent resolution is exacerbating another condition.

tbg added a commit to tbg/cockroach that referenced this issue Aug 31, 2016
See cockroachdb#8975.

Open for better suggestions here, but this should give us some idea without
being overly burdensome.
tbg added a commit to tbg/cockroach that referenced this issue Aug 31, 2016
See cockroachdb#8975. Print either when there are many retries for the current chunk,
or a high sequence number (which could be caused by retries higher up the
stack).
@bdarnell
Copy link
Contributor Author

bdarnell commented Sep 1, 2016

How are these leaked though?

My guess is that certain pairs of nodes were having trouble talking to each other thanks to the GRPC bug, and the node ended up in a state where it couldn't talk to any member of some important range. But I don't have any evidence to back that up.

tbg added a commit to tbg/cockroach that referenced this issue Sep 1, 2016
See cockroachdb#8975. Print either when there are many retries for the current chunk,
or a high sequence number (which could be caused by retries higher up the
stack).
@tbg
Copy link
Member

tbg commented Sep 1, 2016

Ok. We have logging for that infinite DistSender loop now, so there's nothing actionable here at this point except seeing whether it happens again and then looking at the logs.

@tbg
Copy link
Member

tbg commented Sep 1, 2016

For the record, here's one more from rho:

screen shot 2016-09-01 at 10 44 43 am

@tbg
Copy link
Member

tbg commented Sep 1, 2016

This was on 1a34ca3. Seems like that had the grpc fix already. Perhaps some other form of range unavailability, I'll keep looking at rho to see what state it's in.

@tbg
Copy link
Member

tbg commented Sep 1, 2016

Here's one more from just now on beta. It does seem that the goroutine explosion is the primary reason for restarts, at least right now.

cc #9034

screen shot 2016-09-01 at 12 38 27 pm

@tbg
Copy link
Member

tbg commented Sep 1, 2016

@mberhault pointed out that these spikes are always post-restart, so they likely occur on freshly started nodes during initialization. Still worth investigating, but likely not the cause of the crash in the picture above.

@petermattis
Copy link
Collaborator

We fixed several bugs with cancellation propagation since this issue was filed. Closing for now. We should open a new bug if the problem reoccurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

No branches or pull requests

3 participants