stability: delta experiencing regular hiccups #12713

mberhault · 2017-01-05T19:09:36Z

About 12 hours after upgrading delta to the beta candidate (sha e1ef4ec), and after running happily, delta starting seeing regular dips in qps every ~2 minutes:

This matched spikes in elections:

The TimeoutNow bumps came mainly from a single node: 104.196.163.204:

There is nothing obviously wrong with the machine itself.
Looking through its logs, we can see periods of calm, followed by some slow mutexes on handleRaftReady (~500ms), followed by a bit more calm, and then finally a flood of slow mutex warnings (many times a second, between ~500ms and 2s each).

Sample log:
cockroach.stderr.txt

The text was updated successfully, but these errors were encountered:

mberhault · 2017-01-05T19:11:09Z

the cluster is still up. I haven't tried to take down nodes. I did however stop block_writer on that single node as well as all of them. The 2min spikes were attenuated, but restarted afresh when starting up block_writer again.

@bdarnell, I'm fresh out of ideas about what to look for. I tried turning up verbosity (/debug/vmodule/store=3,replica=3,raft=3), but no trigger event was visible, just the expected election traffic.

petermattis · 2017-01-05T20:06:43Z

I just poked around on the problematic node as well and didn't see anything that jumped out at me.

bdarnell · 2017-01-07T19:28:39Z

Hmm, the 2-minute period doesn't ring a bell (it's used for gossip TTLs, but those are supposed to be refreshed every 1 minute without expiring).

raftpb.MsgTimeoutNow is part of the leadership transfer: the current leader sends a MsgTimeoutNow to the node that should become the new leader, and then that node sends MsgVote to the other nodes. Here, it looks like the same node is sending MsgTimeoutNow and MsgVote (looks simultaneous in the graph, but that just means it's within the same metrics collection window), suggesting that the attempted leadership transfer failed and a second election was required.

Right now it looks like the performance is still varying with a 2-minute period, but the TimeoutNow and Vote/VoteResp messages are no longer present, so it doesn't look like this was the root cause. I don't have any other suggestions right now except to figure out why the mutex is blocking so long. We'd expect to see raft elections with three seconds of blocking; blocking for 500ms-2s is obviously not good but shouldn't cause any specific problems that I'm aware of.

bdarnell · 2017-01-11T15:53:07Z

One of the nodes in delta is in a bad state: 104.196.175.87 only has leases for 548 ranges, while all the other members of the cluster have ~3.5k. Furthermore, it appears to be sending a constant stream of MsgVotes (and getting very few MsgVoteResps, presumably due to CheckQuorum). Ranges are stuck in StateCandidate and are therefore getting sent to the replicaGCQueue often (which examines them but never GCs them). At least some of the time the replicaGCQueue's consistent RangeLookup times out after a minute. This node is serving zero SQL qps.

These symptoms suggest problems with the raft transport, but I don't see any obvious problems there. And nothing on this misbehaving node matches up with the two-minute cycle seen on the relatively healthy nodes in the cluster.

bdarnell · 2017-01-11T16:43:09Z

I noticed this message while loading the /#/raft status page: transport: write tcp 10.142.0.42:52968->10.142.0.40:26257: write: connection refused (10.142.0.40 is the internal address of the node discussed in my previous message). Grepping the logs on that node, I see exactly the same message (including the same ephemeral port number) repeating over a period of days. The first occurrence is:

logs/cockroach.stderr.13:I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.142.0.42:52968->10.142.0.40:26257: write: connection refused"

That is after this issue was filed, so this cannot be the root cause but perhaps there were similar problems with other node pairs. It appears that something is hanging on to this failed connection and trying to reuse it instead of creating a new connection

petermattis · 2017-01-11T17:22:49Z

@tamird Didn't one of the gRPC bumps changing the handling of connections in some way?

bdarnell · 2017-01-11T18:15:38Z

A lot of stuff changed when we updated grpc on 12/15; I don't think there were any changes closer to the start of this issue.

Looking closer at the logs, we see a bunch of "getsockopt: connection refused" which are retried after 1 second, followed by one "write: connection refused" which is never retried.

I170106 17:45:34.212243 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:35.298321 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:36.390652 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:37.442627 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:38.563113 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:39.633187 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.14
2.0.42:52968->10.142.0.40:26257: write: connection refused"

net.Dial is always supposed to check getsockopt after connecting. In order to see write: connection refused, the connection must have made it past the getsockopt check, but I don't see how yet.

tamird · 2017-01-11T20:37:27Z

This is beginning to smell like grpc/grpc-go#1026, isn't it?

…

On Wed, Jan 11, 2017 at 1:15 PM, Ben Darnell ***@***.***> wrote: A lot of stuff changed when we updated grpc on 12/15; I don't think there were any changes closer to the start of this issue. Looking closer at the logs, we see a bunch of "getsockopt: connection refused" which are retried after 1 second, followed by one "write: connection refused" which is never retried. I170106 17:45:34.212243 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:35.298321 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:36.390652 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:37.442627 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:38.563113 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:39.633187 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>} I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798 grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.14 2.0.42:52968->10.142.0.40:26257: write: connection refused" net.Dial is always supposed to check getsockopt <https://github.com/golang/go/blob/641ef2a73392b01382d7b2ffd34820fa36f5f9cf/src/net/fd_unix.go#L164> after connecting. In order to see write: connection refused, the connection must have made it past the getsockopt check, but I don't see how yet. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12713 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABdsPHe2zcsMVoQ2UzS25Ro0qzNF3u6Pks5rRRxRgaJpZM4LcBVz> .

petermattis · 2017-01-17T18:27:41Z

Lightstep is showing slow queries every 2m on delta. For example:

   0s [client=[::1]:47646,user=root,n7] executing 1/1: INSERT INTO blocks(block_id, writer_id, block_num, raw_bytes) VALUES (5416627661108716854, '51606a2e-8fa6-4af0-85b7-b1359856c3ac', 109914, $1)
   0s [client=[::1]:47646,user=root,n7] txn: requesting 10240 bytes from the pool
   0s [client=[::1]:47646,user=root,n7] querying next range at /Table/51/1/5416627661108716854/"51606a2e-8fa6-4af0-85b7-b1359856c3ac"/109914/0
   0s [client=[::1]:47646,user=root,n7] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to range 78
 0.5s [client=[::1]:47646,user=root,n7] timeout, trying next peer
 0.5s [client=[::1]:47646,user=root,n7] application error: [NotLeaseHolderError] range 78: replica {7 14 6} not lease holder; current lease is repl={4 8 2} start=1484670803.911548726,65 epo=1 pro=1484675551.989515565,0
 0.5s [client=[::1]:47646,user=root,n7] error, trying next peer
 0.5s [client=[::1]:47646,user=root,n7] application error: [NotLeaseHolderError] range 78: replica {8 15 5} not lease holder; current lease is repl={4 8 2} start=1484670803.911548726,65 epo=1 pro=1484675551.989515565,0
2.85s [client=[::1]:47646,user=root,n7] coordinator stops
2.85s [client=[::1]:47646,user=root,n7] txn: releasing 10240 bytes to the pool

I've looked at a half-dozen such queries and they all have this same pattern.

petermattis · 2017-01-17T20:43:49Z

What the above lightstep trace is showing is that we tried to send the request to the leaseholder, we timed out after 0.5s and sent to the other 2 replicas and then we waited for another 2.3s for the leaseholder to respond. Looking at a similar trace, during the time period the leaseholder is not responsive I see a whole bunch of log messages like:

W170117 20:28:19.336461 189 storage/replica.go:568  [n4,s8,r686/1:/Table/51/1/32{02423…-20317…},@c4216cc300] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 614.3871ms (>500ms):

Those log messages are actually sprinkled throughout the file. And fascinatingly, they come in 2m bursts:

cockroach@cockroach-delta-04:~$ grep 'mutex held by' logs/cockroach.stderr | awk '{print $2}' | awk -F: '{printf "%s:%s\n", $1, $2}' | uniq -c | tail
     45 20:24
     61 20:26
     72 20:28
     43 20:30
     94 20:32
     61 20:34
     65 20:36
     61 20:38
     47 20:40
    147 20:42

The mutex held by messages are not happening in lock step on every node and some nodes were seeing the messages an hour ago but not now.

petermattis · 2017-01-17T20:56:56Z

Here's a related curiosity:

`W170117 20:42:13.917769 16879706 storage/replica.go:580 [n4,s7,r22/6:/Table/51{-/1/17740…},@c420a01b00] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose for 628.550232ms (>500ms):

There is almost nothing in Replica.propose that could account for it taking so long as it is all straight-shot code once Replica.mu is acquired.

bdarnell · 2017-01-17T21:03:23Z

Replica.propose acquires Replica.mu in two separate places. The second one includes quite a lot of stuff (defaultSubmitProposalLocked, withRaftGroupLocked, marshaling the request protobuf...). withRaftGroupLocked can include calls to read from rocksdb if the raft group is being created for the first time.

petermattis · 2017-01-17T21:14:52Z

This is the second acquisition of Replica.mu that covers defaultSubmitProposalLocked. I'm pretty sure the raft group is not being created. Marshaling the request protobuf might take a while, but hundreds of milliseconds?

petermattis · 2017-01-18T15:11:22Z

Inspecting the logs this morning and here's another instance of Replica.mu being held too long:

W170118 15:06:29.873193 181 storage/replica.go:580 [n4,s8,r1844/3:/Table/51/1/705{11089…-56306…},@c42513a600] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 511.958713ms (>500ms):

This corresponds to the following block of code:

	r.mu.Lock()

	lastIndex := r.mu.lastIndex // used for append below
	raftLogSize := r.mu.raftLogSize
	leaderID := r.mu.leaderID
	err := r.withRaftGroupLocked(false, func(raftGroup *raft.RawNode) (bool, error) {
		if hasReady = raftGroup.HasReady(); hasReady {
			rd = raftGroup.Ready()
		}
		return hasReady /* unquiesceAndWakeLeader */, nil
	})
	r.mu.Unlock()

There is no rebalance activity anywhere near this time frame, so I'm pretty confident the Raft group wasn't being initialized.

petermattis · 2017-01-18T15:17:35Z

And another one:

W170118 15:10:29.769793 183 storage/replica.go:580 [n4,s8,r580/4:/Table/51/1/295{26843…-71418…},@c42e727500] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup for 501.087395ms (>500ms):

The stacktrace (not shown above) shows this corresponds to to the call below:

	if err := r.withRaftGroup(func(raftGroup *raft.RawNode) (bool, error) {
		// We're processing a message from another replica which means that the
		// other replica is not quiesced, so we don't need to wake the leader.
		r.unquiesceLocked()
		return false, /* !unquiesceAndWakeLeader */
			raftGroup.Step(req.Message)
	}); err != nil {
		return roachpb.NewError(err)
	}

This is all extremely curious. Three different sections of straight-shot code which should execute in much less than a millisecond are taking >500ms. That implies that something is starving the goroutine (or the entire process) of CPU, but I haven't seen any other evidence of that yet.

spencerkimball · 2017-01-18T15:36:06Z

The idea of the entire process or goroutine being starved doesn't seem to agree with the CPU graphs, though maybe it's all getting smoothed out in some way that obscuring a pegged core. "Straight shot" here still includes a potential for the goroutine to block if it must unquiesce the Raft group. Unquiescing proposes an empty message. The Raft node proposes that message on an unbuffered channel, which is handled in the same select block as received messages (raft.Step()), a configuration state change, and ready processing (where we fetch committed entries from the log). Maybe instrumenting `Replica.unquiesceAndWakeLeaderLocked` as well as `raft/node.run` would eliminate this path as a culprit.

…

On Wed, Jan 18, 2017 at 10:17 AM Peter Mattis ***@***.***> wrote: And another one: W170118 15:10:29.769793 183 storage/replica.go:580 ***@***.*** replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup for 501.087395ms (>500ms): The stacktrace (not shown above) shows this corresponds to to the call below: if err := r.withRaftGroup(func(raftGroup *raft.RawNode) (bool, error) { // We're processing a message from another replica which means that the // other replica is not quiesced, so we don't need to wake the leader. r.unquiesceLocked() return false, /* !unquiesceAndWakeLeader */ raftGroup.Step(req.Message) }); err != nil { return roachpb.NewError(err) } This is all extremely curious. Three different sections of straight-shot code which should execute in much less than a millisecond are taking >500ms. That implies that something is starving the goroutine (or the entire process) of CPU, but I haven't seen any other evidence of that yet. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#12713 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTXwsV4u2YJioHJD1avTBkEKizVS2ks5rTi0XgaJpZM4LcBVz> .

bdarnell · 2017-01-18T15:37:50Z

We use raft.RawNode, not raft.Node. There are no channels involved in RawNode.Propose, just appending to a slice (the only lock involved is Replica.raftMu, which we acquire before Replica.mu.

spencerkimball · 2017-01-18T15:48:39Z

Good point, so it's not the channels in raft.Node. But I'm still suspicious of the call to raft.Propose. Perhaps we should instrument the call to r.mu.internalRaftGroup.Propose and more specifically, the raft log append invocation.

petermattis · 2017-01-18T15:54:36Z

raft.RawNode.Propose is only performing in-memory operations. No locks. No channels. Nada. The actual appending to the raft log happens on the next call to Replica.handleRaftReady....

bdarnell · 2017-01-18T15:58:11Z

Ah, right. RawNode.Propose doesn't write to disk (that's all done via Ready), but it does (sometimes) read (specifically, it calls FirstIndex and LastIndex, which are in memory, and Term, which is cached but on a cache miss will go to disk).

petermattis · 2017-01-18T17:09:32Z

I recall instrumenting Term not too long ago to see how often we had to go to RocksDB. In my testing the answer was never. Might be worth verifying that is the case on delta.

petermattis · 2017-01-18T17:19:44Z

Here's another one which doesn't involve Raft:

W170118 17:14:25.636106 143 storage/replica.go:607 [n4,s7,r991/2:/Table/51/1/3{361884…-405877…},@c420aaa300] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 500.485611ms (>500ms):

The stack shows it corresponds to this code:

	r.mu.Lock()
	r.store.raftEntryCache.addEntries(r.RangeID, rd.Entries)
	r.mu.lastIndex = lastIndex
	r.mu.raftLogSize = raftLogSize
	r.mu.leaderID = leaderID
	r.mu.Unlock()

@spencerkimball Care to speculate what could have taken 500ms in that block of code?

petermattis · 2017-01-18T18:30:05Z

Tried another experiment and reverted delta back to beta-20161215 which was the last beta before this problem started occurring. And the periodic blips still occur.

spencerkimball · 2017-01-18T18:39:29Z

Well, you're locking in the raft entry cache. Perhaps something is going on there. Why not make that a timed mutex?

spencerkimball · 2017-01-18T18:39:54Z

That entry cache is the common link between all of the stack traces so far.

petermattis · 2017-01-18T18:43:52Z

And yet I don't see anything in the entry cache that could be causing a problem. Do you?

spencerkimball · 2017-01-18T18:46:04Z

I don't, though I don't have time for a thorough review. It is suspicious that the cache is being used in all of these critical sections that otherwise appear to have no chance to block.

…

On Wed, Jan 18, 2017 at 1:44 PM Peter Mattis ***@***.***> wrote: And yet I don't see anything in the entry cache that could be causing a problem. Do you? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12713 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTd7ZyWQ5lEcrI7kbwzBQK-hpJ_z-ks5rTl1vgaJpZM4LcBVz> .

bdarnell · 2017-01-18T18:48:08Z

We cache the entire entry, so one large entry (from GC?) could push out a bunch of smaller ones. Evicting those entries takes time and causes more cache misses later. Since we care most about term numbers (which are small), it might make sense to split the entry cache into two, one for full entries and one for terms.

petermattis · 2017-01-18T19:00:18Z

Let me add some instrumentation to raftEntryCache to explore the hypothesis that it is causing problems.

petermattis · 2017-01-18T19:23:58Z

What changed on Jan 4th that would have caused problems with raftEntryCache? Note my comment above where I tested with beta-20161215 which previously ran smoothly on delta but now sees the performance blips.

petermattis · 2017-01-18T20:44:32Z

I170118 20:18:24.514912 162 storage/entry_cache.go:149 raftEntryCache.getEntries(r2705,2625,2626): count=1/693 123.5ms lock=0.0s

This indicates that raftEntryCache.getEntries took 123.5ms to retrieve one 693 byte entry. Lock acquisition took 0.0ms.

bdarnell · 2017-01-18T22:59:47Z

You're computing this delta from time.Now(), right? It would be interesting to compare this delta with one that uses a monotonic clock (such as the one linked in #10572). It's possible that what we're seeing is time jumping forward (either due to ntp or virtualization issues) instead of something taking time inside our VM.

Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go):

$  gcloud compute operations list|grep delta
systemevent-1484284002961-545f2ccbe06d9-77ac7c4e-175a3c9b  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-02                             200          DONE    2017-01-12T21:06:42.961-08:00
systemevent-1484285033748-545f30a2e9995-6dfe500d-7da82523  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-03                             200          DONE    2017-01-12T21:23:53.748-08:00
systemevent-1484320185062-545fb395d132c-cb05fa54-30d0d3ba  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-01                             200          DONE    2017-01-13T07:09:45.062-08:00
systemevent-1483572012765-5454d06d1d240-5b320508-3a989b12  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-08                             200          DONE    2017-01-04T15:20:12.765-08:00
systemevent-1483590065193-545513ad40387-a82b5ede-d3377e20  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-07                             200          DONE    2017-01-04T20:21:05.193-08:00
systemevent-1483617019444-54557816d3ba2-21d72648-48ccacdc  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-10                             200          DONE    2017-01-05T03:50:19.444-08:00
systemevent-1483626212638-54559a5624026-b519ff66-ac605637  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-09                             200          DONE    2017-01-05T06:23:32.638-08:00
systemevent-1483994056967-545af4a9d45c2-c9e1c305-e464ffa3  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-06                             200          DONE    2017-01-09T12:34:16.967-08:00
systemevent-1484263680395-545ee116c48c0-f8e17d54-65682a1a  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-05                             200          DONE    2017-01-12T15:28:00.395-08:00
systemevent-1484353632365-5460302fa6b56-c7bc5969-d6a1bbca  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-04                             200          DONE    2017-01-13T16:27:12.365-08:00

Maybe we've just been sharing hardware with noisier neighbors since then?

petermattis · 2017-01-19T00:44:08Z

You're computing this delta from time.Now(), right?

Yes.

Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go):

Looks like every machine in delta was migrated since Jan 4th.

Maybe we've just been sharing hardware with noisier neighbors since then?

I ran vmstat 2 for a few minutes and didn't see any st (stolen) cpu time or anything else suspicious.

The 1-node delta cluster is exhibiting all of the same problems we've seen earlier. For example:

W170119 00:39:58.670960 127 storage/replica.go:576 [n1,s1,r93/1:/Table/51/1/81{13651…-30407…},@c421931500] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 541.158659ms (>500ms):

The above is from the lock around the raftEntries.addCache call.

Also:

I170119 00:24:00.034669 178 storage/entry_cache.go:109 raftEntryCache.addEntries(r17): count=1/32 138.7ms lock=19.0s

spencerkimball · 2017-01-19T00:48:54Z

Have you tried just the pause detector without other load?

…

On Wed, Jan 18, 2017 at 7:44 PM Peter Mattis ***@***.***> wrote: You're computing this delta from time.Now(), right? Yes. Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go): Looks like every machine in delta was migrated since Jan 4th. Maybe we've just been sharing hardware with noisier neighbors since then? I ran vmstat 2 for a few minutes and didn't see any st (stolen) cpu time or anything else suspicious. The 1-node delta cluster is exhibiting all of the same problems we've seen earlier. For example: W170119 00:39:58.670960 127 storage/replica.go:576 ***@***.*** replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 541.158659ms (>500ms): The above is from the lock around the raftEntries.addCache call. Also: I170119 00:24:00.034669 178 storage/entry_cache.go:109 raftEntryCache.addEntries(r17): count=1/32 138.7ms lock=19.0s — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#12713 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF3MTZ-8nQVyJJj9qeuqUUvJm1Ymb4Fgks5rTrHjgaJpZM4LcBVz> .

petermattis · 2017-01-19T00:51:33Z

Have you tried just the pause detector without other load?

I tried running a pause detector process while cockroach was running and experiencing blips. No pauses detected. That would seem to indicate that there is something about the cockroach process itself.

petermattis · 2017-01-19T02:40:11Z

I keep on trimming down the scenario on delta and the performance blips still reproduce. The below graph is for a 1-node cluster running on delta-01 configured to use only /mnt/data1 and with a single block_writer --concurrency 1 running against it:

petermattis · 2017-01-19T02:59:43Z

I'm suspecting something funky with the delta nodes. I created a new delta-11 node configured as a 32 vCPU / 32 GB RAM instance which I believe wouldn't be susceptible to a noisy neighbor problem as it would consume the entire machine. A 1-node (insecure) cluster on this machine showed perfectly stable performance over 5 minutes.

Next I'm going to try creating a new node with the same 8 vCPU / 16 GB RAM configuration as current delta nodes.

bdarnell · 2017-01-19T03:07:28Z

The current delta nodes are 8 vCPU (and 16GB RAM). Do we know anything about the correspondence between vCPUs and real CPUs? (unless it's 1:1, things can get weird).

petermattis · 2017-01-19T03:13:27Z

Do we know anything about the correspondence between vCPUs and real CPUs?

I thought 1 vCPU == 1 core. Perhaps @a-robinson knows for sure.

Ok, one last verification was to run the exact same config on delta-01 as I was experimenting with on (the new defunct) delta-11. Specifically, I created a new insecure cluster using:

./cockroach start --insecure --store=/mnt/data1

And then ran:

./block_writer --splits 1000 --concurrency 1 --tolerate-errors --min-block-bytes=8 --max-block-bytes=128

Doing this on delta-01 exhibits the same performance blips we've been debugging in this issue while the brand new instances did not.

spencerkimball · 2017-01-19T03:50:40Z

Is it time to ping Google about this problem? Relatedly, have we searched GCE issues for any mention of this sort of problem?

petermattis · 2017-01-19T13:19:11Z

Is it time to ping Google about this problem?

Yes.

Relatedly, have we searched GCE issues for any mention of this sort of problem?

I've done some brief searches on the interwebs and found nothing other than research papers about security concerns with VMs sharing the same machine.

petermattis · 2017-01-19T13:26:30Z

FYI, a brand new 8 vCPU / 16 GB RAM instance has stable performance as well.

I bet if we destroyed all of the delta instances and recreated them the problem would go away.

mberhault · 2017-01-19T19:04:28Z

Following up:
I tried running individual in-memory cockroach nodes on each machine (independent single-node cluster for each) and running block_writer against each:

./cockroach start --store=type=mem,size=2GiB --ca-cert=certs/ca.crt --cert=certs/node.crt --key=certs/node.key
./block_writer --concurrency 8 --tolerate-errors --min-block-bytes=8 --max-block-bytes=128 'postgres://root@localhost:26257/?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/root.client.crt&sslkey=certs/root.client.key'

Some of the nodes on delta are showing the clear 2 minute pattern, others are not, or the pattern is not at clear.

Rebooting the VM seemed to get it out of the pattern. eg, 104.196.20.102 before and after reboot:

That node's kernel went from 3.16.7-ckt25-2+deb8u3 to 3.16.36-1+deb8u2, but another node (104.196.175.87) picked up the very same 3.16.36 kernel 3 months back but is still showing the sawtooth pattern.

Beyond that, the two minute blips are still undiagnosed, nothing detectable happens on the system at that time.

a-robinson · 2017-02-02T21:52:30Z

Closing because we're quite confident that this isn't a cockroach problem.

MakMukhi mentioned this issue Jan 27, 2017

persistent "connection refused" errors after upgrade grpc/grpc-go#1026

Closed

a-robinson closed this as completed Feb 2, 2017

a-robinson mentioned this issue Nov 15, 2017

cockroach retains a high memory footprint after a large query is processed #20078

Closed

stability: delta experiencing regular hiccups #12713

stability: delta experiencing regular hiccups #12713

Comments

mberhault commented Jan 5, 2017

mberhault commented Jan 5, 2017

petermattis commented Jan 5, 2017

bdarnell commented Jan 7, 2017

bdarnell commented Jan 11, 2017

bdarnell commented Jan 11, 2017

petermattis commented Jan 11, 2017

bdarnell commented Jan 11, 2017

tamird commented Jan 11, 2017 via email

petermattis commented Jan 17, 2017

petermattis commented Jan 17, 2017 • edited Loading

petermattis commented Jan 17, 2017

bdarnell commented Jan 17, 2017

petermattis commented Jan 17, 2017

petermattis commented Jan 18, 2017

petermattis commented Jan 18, 2017

spencerkimball commented Jan 18, 2017 via email

bdarnell commented Jan 18, 2017

spencerkimball commented Jan 18, 2017

petermattis commented Jan 18, 2017

bdarnell commented Jan 18, 2017

petermattis commented Jan 18, 2017

petermattis commented Jan 18, 2017

petermattis commented Jan 18, 2017

spencerkimball commented Jan 18, 2017

spencerkimball commented Jan 18, 2017

petermattis commented Jan 18, 2017

spencerkimball commented Jan 18, 2017 via email

bdarnell commented Jan 18, 2017

petermattis commented Jan 18, 2017

petermattis commented Jan 18, 2017

petermattis commented Jan 18, 2017 • edited Loading

bdarnell commented Jan 18, 2017

petermattis commented Jan 19, 2017

spencerkimball commented Jan 19, 2017 via email

petermattis commented Jan 19, 2017

petermattis commented Jan 19, 2017

petermattis commented Jan 19, 2017 • edited Loading

bdarnell commented Jan 19, 2017

petermattis commented Jan 19, 2017

spencerkimball commented Jan 19, 2017

petermattis commented Jan 19, 2017

petermattis commented Jan 19, 2017

mberhault commented Jan 19, 2017

a-robinson commented Feb 2, 2017

petermattis commented Jan 17, 2017 •

edited

Loading

petermattis commented Jan 18, 2017 •

edited

Loading

petermattis commented Jan 19, 2017 •

edited

Loading