stability: Goroutine leak when communication fails #8975

bdarnell · 2016-08-31T10:58:07Z

On beta, running 2a2fdd9, node 6 leaked goroutines over a period of several hours until it ran out of memory and died:

It's not clear what's going on (since memory and goroutine profiles are not saved for very long after they are collected). One thing worth mentioning is that node 6 was the one implicated in #8939, so it was having trouble talking to certain other nodes. My best guess is that some key ranges (maybe the first range) got rebalanced onto a set of nodes that node 6 could not talk to. This is probably part of our general need to have better backpressure instead of allowing goroutines to pile up indefinitely.

Next time this happens we should be sure to save the goroutine profile. We should probably also change the way that old profiles are cleaned up, so that we have more history available. (e.g. instead of just keeping the last N, keep the most recent few and then one per minute for the last hour, one per hour for the last day, etc)

The text was updated successfully, but these errors were encountered:

bdarnell · 2016-08-31T12:00:40Z

Well that didn't take long. Another node started leaking goroutines:

goroutine profile: total 1788
569 @ 0x62ef7a 0x63e679 0x63d46c 0xe65867 0xe60cb6 0xe612ef 0xe737ae 0xe62ad7 0xe61826 0xe6e0d6 0xaa65b0 0xaac8b8 0xaa5c68 0xaa5f15 0x86ecfe 0x86de0e 0x8cce53 0x8de292 0x85847e 0x714980 0x841d45 0x75333d 0xb74d00 0xb769b0 0xb7b6eb 0x660251
#   0xe65866    github.com/cockroachdb/cockroach/kv.(*DistSender).sendToReplicas+0xa76                          /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:1001
#   0xe60cb5    github.com/cockroachdb/cockroach/kv.(*DistSender).sendRPC+0x1c5                             /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:364
#   0xe612ee    github.com/cockroachdb/cockroach/kv.(*DistSender).sendSingleRange+0x19e                         /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:480
#   0xe737ad    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk.func2+0x18d                         /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:746
#   0xe62ad6    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk+0x436                           /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:747
#   0xe61825    github.com/cockroachdb/cockroach/kv.(*DistSender).Send+0x1d5                                /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:591
#   0xe6e0d5    github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send+0x545                            /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:374
#   0xaa65af    github.com/cockroachdb/cockroach/internal/client.(*DB).send+0x21f                           /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:490
#   0xaac8b7    github.com/cockroachdb/cockroach/internal/client.(*DB).(github.com/cockroachdb/cockroach/internal/client.send)-fm+0x57  /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0xaa5c67    github.com/cockroachdb/cockroach/internal/client.sendAndFill+0x167                          /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:417
#   0xaa5f14    github.com/cockroachdb/cockroach/internal/client.(*DB).Run+0x74                             /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0x86ecfd    github.com/cockroachdb/cockroach/storage.(*intentResolver).maybePushTransactions+0xa9d                  /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:230
#   0x86de0d    github.com/cockroachdb/cockroach/storage.(*intentResolver).processWriteIntentError+0x18d                /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:89
#   0x8cce52    github.com/cockroachdb/cockroach/storage.(*Store).Send+0xca2                                /go/src/github.com/cockroachdb/cockroach/storage/store.go:2073
#   0x8de291    github.com/cockroachdb/cockroach/storage.(*Stores).Send+0x201                               /go/src/github.com/cockroachdb/cockroach/storage/stores.go:182
#   0x85847d    github.com/cockroachdb/cockroach/server.(*Node).Batch.func3+0x2ed                           /go/src/github.com/cockroachdb/cockroach/server/node.go:818
#   0x71497f    github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask+0xff                          /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:215
#   0x841d44    github.com/cockroachdb/cockroach/server.(*Node).Batch+0x274                             /go/src/github.com/cockroachdb/cockroach/server/node.go:830
#   0x75333c    github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler+0x27c                          /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1499
#   0xb74cff    google.golang.org/grpc.(*Server).processUnaryRPC+0xc4f                                  /go/src/google.golang.org/grpc/server.go:608
#   0xb769af    google.golang.org/grpc.(*Server).handleStream+0x6af                                 /go/src/google.golang.org/grpc/server.go:766
#   0xb7b6ea    google.golang.org/grpc.(*Server).serveStreams.func1.1+0xaa                              /go/src/google.golang.org/grpc/server.go:419

569 @ 0x62ef7a 0x63e679 0x63d46c 0xf83f49 0xb6950d 0xb6a451 0x752fc2 0xe7476d 0x660251
#   0xf83f48    google.golang.org/grpc/transport.(*Stream).Header+0x298             /go/src/google.golang.org/grpc/transport/transport.go:239
#   0xb6950c    google.golang.org/grpc.recvResponse+0xac                    /go/src/google.golang.org/grpc/call.go:62
#   0xb6a450    google.golang.org/grpc.Invoke+0x8a0                     /go/src/google.golang.org/grpc/call.go:202
#   0x752fc1    github.com/cockroachdb/cockroach/roachpb.(*internalClient).Batch+0xd1       /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1476
#   0xe7476c    github.com/cockroachdb/cockroach/kv.(*grpcTransport).SendNext.func1+0xfc    /go/src/github.com/cockroachdb/cockroach/kv/transport.go:180

493 @ 0x62ef7a 0x63e679 0x63d46c 0xd253d5 0xe6290e 0xe61826 0xe6e0d6 0xaa65b0 0xaac8b8 0xaa5c68 0xaa5f15 0x86ecfe 0x86de0e 0x8cce53 0x8de292 0x85847e 0x714980 0x841d45 0x75333d 0xb74d00 0xb769b0 0xb7b6eb 0x660251
#   0xd253d4    github.com/cockroachdb/cockroach/util/retry.(*Retry).Next+0x1a4                             /go/src/github.com/cockroachdb/cockroach/util/retry/retry.go:128
#   0xe6290d    github.com/cockroachdb/cockroach/kv.(*DistSender).sendChunk+0x26d                           /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:666
#   0xe61825    github.com/cockroachdb/cockroach/kv.(*DistSender).Send+0x1d5                                /go/src/github.com/cockroachdb/cockroach/kv/dist_sender.go:591
#   0xe6e0d5    github.com/cockroachdb/cockroach/kv.(*TxnCoordSender).Send+0x545                            /go/src/github.com/cockroachdb/cockroach/kv/txn_coord_sender.go:374
#   0xaa65af    github.com/cockroachdb/cockroach/internal/client.(*DB).send+0x21f                           /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:490
#   0xaac8b7    github.com/cockroachdb/cockroach/internal/client.(*DB).(github.com/cockroachdb/cockroach/internal/client.send)-fm+0x57  /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0xaa5c67    github.com/cockroachdb/cockroach/internal/client.sendAndFill+0x167                          /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:417
#   0xaa5f14    github.com/cockroachdb/cockroach/internal/client.(*DB).Run+0x74                             /go/src/github.com/cockroachdb/cockroach/internal/client/db.go:435
#   0x86ecfd    github.com/cockroachdb/cockroach/storage.(*intentResolver).maybePushTransactions+0xa9d                  /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:230
#   0x86de0d    github.com/cockroachdb/cockroach/storage.(*intentResolver).processWriteIntentError+0x18d                /go/src/github.com/cockroachdb/cockroach/storage/intent_resolver.go:89
#   0x8cce52    github.com/cockroachdb/cockroach/storage.(*Store).Send+0xca2                                /go/src/github.com/cockroachdb/cockroach/storage/store.go:2073
#   0x8de291    github.com/cockroachdb/cockroach/storage.(*Stores).Send+0x201                               /go/src/github.com/cockroachdb/cockroach/storage/stores.go:182
#   0x85847d    github.com/cockroachdb/cockroach/server.(*Node).Batch.func3+0x2ed                           /go/src/github.com/cockroachdb/cockroach/server/node.go:818
#   0x71497f    github.com/cockroachdb/cockroach/util/stop.(*Stopper).RunTask+0xff                          /go/src/github.com/cockroachdb/cockroach/util/stop/stopper.go:215
#   0x841d44    github.com/cockroachdb/cockroach/server.(*Node).Batch+0x274                             /go/src/github.com/cockroachdb/cockroach/server/node.go:830
#   0x75333c    github.com/cockroachdb/cockroach/roachpb._Internal_Batch_Handler+0x27c                          /go/src/github.com/cockroachdb/cockroach/roachpb/api.pb.go:1499
#   0xb74cff    google.golang.org/grpc.(*Server).processUnaryRPC+0xc4f                                  /go/src/google.golang.org/grpc/server.go:608
#   0xb769af    google.golang.org/grpc.(*Server).handleStream+0x6af                                 /go/src/google.golang.org/grpc/server.go:766
#   0xb7b6ea    google.golang.org/grpc.(*Server).serveStreams.func1.1+0xaa                              /go/src/google.golang.org/grpc/server.go:419

Looks like #8692; it's all synchronous intent resolution.

tbg · 2016-08-31T18:42:11Z

How are these leaked though? DistSender is stuck on Retry.Next which means it's not getting those requests where they ought to go. Looks like the intent resolution is exacerbating another condition.

See cockroachdb#8975. Open for better suggestions here, but this should give us some idea without being overly burdensome.

See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).

bdarnell · 2016-09-01T06:56:54Z

How are these leaked though?

My guess is that certain pairs of nodes were having trouble talking to each other thanks to the GRPC bug, and the node ended up in a state where it couldn't talk to any member of some important range. But I don't have any evidence to back that up.

See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).

tbg · 2016-09-01T14:42:58Z

Ok. We have logging for that infinite DistSender loop now, so there's nothing actionable here at this point except seeing whether it happens again and then looking at the logs.

tbg · 2016-09-01T14:44:21Z

For the record, here's one more from rho:

tbg · 2016-09-01T14:47:08Z

This was on 1a34ca3. Seems like that had the grpc fix already. Perhaps some other form of range unavailability, I'll keep looking at rho to see what state it's in.

tbg · 2016-09-01T16:39:48Z

Here's one more from just now on beta. It does seem that the goroutine explosion is the primary reason for restarts, at least right now.

cc #9034

tbg · 2016-09-01T16:45:32Z

@mberhault pointed out that these spikes are always post-restart, so they likely occur on freshly started nodes during initialization. Still worth investigating, but likely not the cause of the crash in the picture above.

petermattis · 2017-02-22T20:56:07Z

We fixed several bugs with cancellation propagation since this issue was filed. Closing for now. We should open a new bug if the problem reoccurs.

bdarnell added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Aug 31, 2016

tbg mentioned this issue Aug 31, 2016

kv: log if request appears stuck #8997

Merged

tbg added a commit to tbg/cockroach that referenced this issue Aug 31, 2016

kv: log if request appears stuck

e2f7c56

See cockroachdb#8975. Open for better suggestions here, but this should give us some idea without being overly burdensome.

tbg added a commit to tbg/cockroach that referenced this issue Aug 31, 2016

kv: log if request appears stuck

7d691be

See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).

tbg added a commit to tbg/cockroach that referenced this issue Sep 1, 2016

kv: log if request appears stuck

56b3b67

See cockroachdb#8975. Print either when there are many retries for the current chunk, or a high sequence number (which could be caused by retries higher up the stack).

petermattis closed this as completed Feb 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stability: Goroutine leak when communication fails #8975

stability: Goroutine leak when communication fails #8975

bdarnell commented Aug 31, 2016

bdarnell commented Aug 31, 2016

tbg commented Aug 31, 2016

bdarnell commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

petermattis commented Feb 22, 2017

stability: Goroutine leak when communication fails #8975

stability: Goroutine leak when communication fails #8975

Comments

bdarnell commented Aug 31, 2016

bdarnell commented Aug 31, 2016

tbg commented Aug 31, 2016

bdarnell commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

tbg commented Sep 1, 2016

petermattis commented Feb 22, 2017