Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: delta experiencing regular hiccups #12713

Closed
mberhault opened this issue Jan 5, 2017 · 44 comments
Closed

stability: delta experiencing regular hiccups #12713

mberhault opened this issue Jan 5, 2017 · 44 comments

Comments

@mberhault
Copy link
Contributor

About 12 hours after upgrading delta to the beta candidate (sha e1ef4ec), and after running happily, delta starting seeing regular dips in qps every ~2 minutes:

delta_sql

This matched spikes in elections:
delta_raft

The TimeoutNow bumps came mainly from a single node: 104.196.163.204:
delta_timeoutnow

There is nothing obviously wrong with the machine itself.
Looking through its logs, we can see periods of calm, followed by some slow mutexes on handleRaftReady (~500ms), followed by a bit more calm, and then finally a flood of slow mutex warnings (many times a second, between ~500ms and 2s each).

Sample log:
cockroach.stderr.txt

@mberhault
Copy link
Contributor Author

the cluster is still up. I haven't tried to take down nodes. I did however stop block_writer on that single node as well as all of them. The 2min spikes were attenuated, but restarted afresh when starting up block_writer again.

@bdarnell, I'm fresh out of ideas about what to look for. I tried turning up verbosity (/debug/vmodule/store=3,replica=3,raft=3), but no trigger event was visible, just the expected election traffic.

@petermattis
Copy link
Collaborator

I just poked around on the problematic node as well and didn't see anything that jumped out at me.

@bdarnell
Copy link
Contributor

bdarnell commented Jan 7, 2017

Hmm, the 2-minute period doesn't ring a bell (it's used for gossip TTLs, but those are supposed to be refreshed every 1 minute without expiring).

raftpb.MsgTimeoutNow is part of the leadership transfer: the current leader sends a MsgTimeoutNow to the node that should become the new leader, and then that node sends MsgVote to the other nodes. Here, it looks like the same node is sending MsgTimeoutNow and MsgVote (looks simultaneous in the graph, but that just means it's within the same metrics collection window), suggesting that the attempted leadership transfer failed and a second election was required.

Right now it looks like the performance is still varying with a 2-minute period, but the TimeoutNow and Vote/VoteResp messages are no longer present, so it doesn't look like this was the root cause. I don't have any other suggestions right now except to figure out why the mutex is blocking so long. We'd expect to see raft elections with three seconds of blocking; blocking for 500ms-2s is obviously not good but shouldn't cause any specific problems that I'm aware of.

@bdarnell
Copy link
Contributor

One of the nodes in delta is in a bad state: 104.196.175.87 only has leases for 548 ranges, while all the other members of the cluster have ~3.5k. Furthermore, it appears to be sending a constant stream of MsgVotes (and getting very few MsgVoteResps, presumably due to CheckQuorum). Ranges are stuck in StateCandidate and are therefore getting sent to the replicaGCQueue often (which examines them but never GCs them). At least some of the time the replicaGCQueue's consistent RangeLookup times out after a minute. This node is serving zero SQL qps.

These symptoms suggest problems with the raft transport, but I don't see any obvious problems there. And nothing on this misbehaving node matches up with the two-minute cycle seen on the relatively healthy nodes in the cluster.

@bdarnell
Copy link
Contributor

I noticed this message while loading the /#/raft status page: transport: write tcp 10.142.0.42:52968->10.142.0.40:26257: write: connection refused (10.142.0.40 is the internal address of the node discussed in my previous message). Grepping the logs on that node, I see exactly the same message (including the same ephemeral port number) repeating over a period of days. The first occurrence is:

logs/cockroach.stderr.13:I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.142.0.42:52968->10.142.0.40:26257: write: connection refused"

That is after this issue was filed, so this cannot be the root cause but perhaps there were similar problems with other node pairs. It appears that something is hanging on to this failed connection and trying to reuse it instead of creating a new connection

@petermattis
Copy link
Collaborator

@tamird Didn't one of the gRPC bumps changing the handling of connections in some way?

@bdarnell
Copy link
Contributor

A lot of stuff changed when we updated grpc on 12/15; I don't think there were any changes closer to the start of this issue.

Looking closer at the logs, we see a bunch of "getsockopt: connection refused" which are retried after 1 second, followed by one "write: connection refused" which is never retried.

I170106 17:45:34.212243 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:35.298321 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:36.390652 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:37.442627 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:38.563113 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:39.633187 121 vendor/google.golang.org/grpc/clientconn.go:698  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798  grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.14
2.0.42:52968->10.142.0.40:26257: write: connection refused"

net.Dial is always supposed to check getsockopt after connecting. In order to see write: connection refused, the connection must have made it past the getsockopt check, but I don't see how yet.

@tamird
Copy link
Contributor

tamird commented Jan 11, 2017 via email

@petermattis
Copy link
Collaborator

Lightstep is showing slow queries every 2m on delta. For example:

   0s [client=[::1]:47646,user=root,n7] executing 1/1: INSERT INTO blocks(block_id, writer_id, block_num, raw_bytes) VALUES (5416627661108716854, '51606a2e-8fa6-4af0-85b7-b1359856c3ac', 109914, $1)
   0s [client=[::1]:47646,user=root,n7] txn: requesting 10240 bytes from the pool
   0s [client=[::1]:47646,user=root,n7] querying next range at /Table/51/1/5416627661108716854/"51606a2e-8fa6-4af0-85b7-b1359856c3ac"/109914/0
   0s [client=[::1]:47646,user=root,n7] sending batch 1 CPut, 1 BeginTxn, 1 EndTxn to range 78
 0.5s [client=[::1]:47646,user=root,n7] timeout, trying next peer
 0.5s [client=[::1]:47646,user=root,n7] application error: [NotLeaseHolderError] range 78: replica {7 14 6} not lease holder; current lease is repl={4 8 2} start=1484670803.911548726,65 epo=1 pro=1484675551.989515565,0
 0.5s [client=[::1]:47646,user=root,n7] error, trying next peer
 0.5s [client=[::1]:47646,user=root,n7] application error: [NotLeaseHolderError] range 78: replica {8 15 5} not lease holder; current lease is repl={4 8 2} start=1484670803.911548726,65 epo=1 pro=1484675551.989515565,0
2.85s [client=[::1]:47646,user=root,n7] coordinator stops
2.85s [client=[::1]:47646,user=root,n7] txn: releasing 10240 bytes to the pool

I've looked at a half-dozen such queries and they all have this same pattern.

@petermattis
Copy link
Collaborator

petermattis commented Jan 17, 2017

What the above lightstep trace is showing is that we tried to send the request to the leaseholder, we timed out after 0.5s and sent to the other 2 replicas and then we waited for another 2.3s for the leaseholder to respond. Looking at a similar trace, during the time period the leaseholder is not responsive I see a whole bunch of log messages like:

W170117 20:28:19.336461 189 storage/replica.go:568  [n4,s8,r686/1:/Table/51/1/32{02423…-20317…},@c4216cc300] raftMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Store).processRaftRequest for 614.3871ms (>500ms):

Those log messages are actually sprinkled throughout the file. And fascinatingly, they come in 2m bursts:

cockroach@cockroach-delta-04:~$ grep 'mutex held by' logs/cockroach.stderr | awk '{print $2}' | awk -F: '{printf "%s:%s\n", $1, $2}' | uniq -c | tail
     45 20:24
     61 20:26
     72 20:28
     43 20:30
     94 20:32
     61 20:34
     65 20:36
     61 20:38
     47 20:40
    147 20:42

The mutex held by messages are not happening in lock step on every node and some nodes were seeing the messages an hour ago but not now.

@petermattis
Copy link
Collaborator

Here's a related curiosity:

`W170117 20:42:13.917769 16879706 storage/replica.go:580 [n4,s7,r22/6:/Table/51{-/1/17740…},@c420a01b00] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose for 628.550232ms (>500ms):

There is almost nothing in Replica.propose that could account for it taking so long as it is all straight-shot code once Replica.mu is acquired.

@bdarnell
Copy link
Contributor

Replica.propose acquires Replica.mu in two separate places. The second one includes quite a lot of stuff (defaultSubmitProposalLocked, withRaftGroupLocked, marshaling the request protobuf...). withRaftGroupLocked can include calls to read from rocksdb if the raft group is being created for the first time.

@petermattis
Copy link
Collaborator

This is the second acquisition of Replica.mu that covers defaultSubmitProposalLocked. I'm pretty sure the raft group is not being created. Marshaling the request protobuf might take a while, but hundreds of milliseconds?

@petermattis
Copy link
Collaborator

Inspecting the logs this morning and here's another instance of Replica.mu being held too long:

W170118 15:06:29.873193 181 storage/replica.go:580 [n4,s8,r1844/3:/Table/51/1/705{11089…-56306…},@c42513a600] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 511.958713ms (>500ms):

This corresponds to the following block of code:

	r.mu.Lock()

	lastIndex := r.mu.lastIndex // used for append below
	raftLogSize := r.mu.raftLogSize
	leaderID := r.mu.leaderID
	err := r.withRaftGroupLocked(false, func(raftGroup *raft.RawNode) (bool, error) {
		if hasReady = raftGroup.HasReady(); hasReady {
			rd = raftGroup.Ready()
		}
		return hasReady /* unquiesceAndWakeLeader */, nil
	})
	r.mu.Unlock()

There is no rebalance activity anywhere near this time frame, so I'm pretty confident the Raft group wasn't being initialized.

@petermattis
Copy link
Collaborator

And another one:

W170118 15:10:29.769793 183 storage/replica.go:580 [n4,s8,r580/4:/Table/51/1/295{26843…-71418…},@c42e727500] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup for 501.087395ms (>500ms):

The stacktrace (not shown above) shows this corresponds to to the call below:

	if err := r.withRaftGroup(func(raftGroup *raft.RawNode) (bool, error) {
		// We're processing a message from another replica which means that the
		// other replica is not quiesced, so we don't need to wake the leader.
		r.unquiesceLocked()
		return false, /* !unquiesceAndWakeLeader */
			raftGroup.Step(req.Message)
	}); err != nil {
		return roachpb.NewError(err)
	}

This is all extremely curious. Three different sections of straight-shot code which should execute in much less than a millisecond are taking >500ms. That implies that something is starving the goroutine (or the entire process) of CPU, but I haven't seen any other evidence of that yet.

@spencerkimball
Copy link
Member

spencerkimball commented Jan 18, 2017 via email

@bdarnell
Copy link
Contributor

We use raft.RawNode, not raft.Node. There are no channels involved in RawNode.Propose, just appending to a slice (the only lock involved is Replica.raftMu, which we acquire before Replica.mu.

@spencerkimball
Copy link
Member

Good point, so it's not the channels in raft.Node. But I'm still suspicious of the call to raft.Propose. Perhaps we should instrument the call to r.mu.internalRaftGroup.Propose and more specifically, the raft log append invocation.

@petermattis
Copy link
Collaborator

raft.RawNode.Propose is only performing in-memory operations. No locks. No channels. Nada. The actual appending to the raft log happens on the next call to Replica.handleRaftReady....

@bdarnell
Copy link
Contributor

Ah, right. RawNode.Propose doesn't write to disk (that's all done via Ready), but it does (sometimes) read (specifically, it calls FirstIndex and LastIndex, which are in memory, and Term, which is cached but on a cache miss will go to disk).

@petermattis
Copy link
Collaborator

I recall instrumenting Term not too long ago to see how often we had to go to RocksDB. In my testing the answer was never. Might be worth verifying that is the case on delta.

@petermattis
Copy link
Collaborator

Here's another one which doesn't involve Raft:

W170118 17:14:25.636106 143 storage/replica.go:607 [n4,s7,r991/2:/Table/51/1/3{361884…-405877…},@c420aaa300] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 500.485611ms (>500ms):

The stack shows it corresponds to this code:

	r.mu.Lock()
	r.store.raftEntryCache.addEntries(r.RangeID, rd.Entries)
	r.mu.lastIndex = lastIndex
	r.mu.raftLogSize = raftLogSize
	r.mu.leaderID = leaderID
	r.mu.Unlock()

@spencerkimball Care to speculate what could have taken 500ms in that block of code?

@petermattis
Copy link
Collaborator

Tried another experiment and reverted delta back to beta-20161215 which was the last beta before this problem started occurring. And the periodic blips still occur.

@spencerkimball
Copy link
Member

Well, you're locking in the raft entry cache. Perhaps something is going on there. Why not make that a timed mutex?

@spencerkimball
Copy link
Member

That entry cache is the common link between all of the stack traces so far.

@petermattis
Copy link
Collaborator

And yet I don't see anything in the entry cache that could be causing a problem. Do you?

@spencerkimball
Copy link
Member

spencerkimball commented Jan 18, 2017 via email

@bdarnell
Copy link
Contributor

We cache the entire entry, so one large entry (from GC?) could push out a bunch of smaller ones. Evicting those entries takes time and causes more cache misses later. Since we care most about term numbers (which are small), it might make sense to split the entry cache into two, one for full entries and one for terms.

@petermattis
Copy link
Collaborator

Let me add some instrumentation to raftEntryCache to explore the hypothesis that it is causing problems.

@petermattis
Copy link
Collaborator

What changed on Jan 4th that would have caused problems with raftEntryCache? Note my comment above where I tested with beta-20161215 which previously ran smoothly on delta but now sees the performance blips.

@petermattis
Copy link
Collaborator

petermattis commented Jan 18, 2017

I170118 20:18:24.514912 162 storage/entry_cache.go:149 raftEntryCache.getEntries(r2705,2625,2626): count=1/693 123.5ms lock=0.0s

This indicates that raftEntryCache.getEntries took 123.5ms to retrieve one 693 byte entry. Lock acquisition took 0.0ms.

@bdarnell
Copy link
Contributor

You're computing this delta from time.Now(), right? It would be interesting to compare this delta with one that uses a monotonic clock (such as the one linked in #10572). It's possible that what we're seeing is time jumping forward (either due to ntp or virtualization issues) instead of something taking time inside our VM.

Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go):

$  gcloud compute operations list|grep delta
systemevent-1484284002961-545f2ccbe06d9-77ac7c4e-175a3c9b  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-02                             200          DONE    2017-01-12T21:06:42.961-08:00
systemevent-1484285033748-545f30a2e9995-6dfe500d-7da82523  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-03                             200          DONE    2017-01-12T21:23:53.748-08:00
systemevent-1484320185062-545fb395d132c-cb05fa54-30d0d3ba  compute.instances.migrateOnHostMaintenance  us-east1-b/instances/cockroach-delta-01                             200          DONE    2017-01-13T07:09:45.062-08:00
systemevent-1483572012765-5454d06d1d240-5b320508-3a989b12  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-08                             200          DONE    2017-01-04T15:20:12.765-08:00
systemevent-1483590065193-545513ad40387-a82b5ede-d3377e20  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-07                             200          DONE    2017-01-04T20:21:05.193-08:00
systemevent-1483617019444-54557816d3ba2-21d72648-48ccacdc  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-10                             200          DONE    2017-01-05T03:50:19.444-08:00
systemevent-1483626212638-54559a5624026-b519ff66-ac605637  compute.instances.migrateOnHostMaintenance  us-east1-d/instances/cockroach-delta-09                             200          DONE    2017-01-05T06:23:32.638-08:00
systemevent-1483994056967-545af4a9d45c2-c9e1c305-e464ffa3  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-06                             200          DONE    2017-01-09T12:34:16.967-08:00
systemevent-1484263680395-545ee116c48c0-f8e17d54-65682a1a  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-05                             200          DONE    2017-01-12T15:28:00.395-08:00
systemevent-1484353632365-5460302fa6b56-c7bc5969-d6a1bbca  compute.instances.migrateOnHostMaintenance  us-east1-c/instances/cockroach-delta-04                             200          DONE    2017-01-13T16:27:12.365-08:00

Maybe we've just been sharing hardware with noisier neighbors since then?

@petermattis
Copy link
Collaborator

You're computing this delta from time.Now(), right?

Yes.

Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go):

Looks like every machine in delta was migrated since Jan 4th.

Maybe we've just been sharing hardware with noisier neighbors since then?

I ran vmstat 2 for a few minutes and didn't see any st (stolen) cpu time or anything else suspicious.

The 1-node delta cluster is exhibiting all of the same problems we've seen earlier. For example:

W170119 00:39:58.670960 127 storage/replica.go:576 [n1,s1,r93/1:/Table/51/1/81{13651…-30407…},@c421931500] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked for 541.158659ms (>500ms):

The above is from the lock around the raftEntries.addCache call.

Also:

I170119 00:24:00.034669 178 storage/entry_cache.go:109 raftEntryCache.addEntries(r17): count=1/32 138.7ms lock=19.0s

@spencerkimball
Copy link
Member

spencerkimball commented Jan 19, 2017 via email

@petermattis
Copy link
Collaborator

Have you tried just the pause detector without other load?

I tried running a pause detector process while cockroach was running and experiencing blips. No pauses detected. That would seem to indicate that there is something about the cockroach process itself.

@petermattis
Copy link
Collaborator

I keep on trimming down the scenario on delta and the performance blips still reproduce. The below graph is for a 1-node cluster running on delta-01 configured to use only /mnt/data1 and with a single block_writer --concurrency 1 running against it:

screen shot 2017-01-18 at 9 25 14 pm

@petermattis
Copy link
Collaborator

petermattis commented Jan 19, 2017

I'm suspecting something funky with the delta nodes. I created a new delta-11 node configured as a 32 vCPU / 32 GB RAM instance which I believe wouldn't be susceptible to a noisy neighbor problem as it would consume the entire machine. A 1-node (insecure) cluster on this machine showed perfectly stable performance over 5 minutes.

Next I'm going to try creating a new node with the same 8 vCPU / 16 GB RAM configuration as current delta nodes.

@bdarnell
Copy link
Contributor

The current delta nodes are 8 vCPU (and 16GB RAM). Do we know anything about the correspondence between vCPUs and real CPUs? (unless it's 1:1, things can get weird).

@petermattis
Copy link
Collaborator

Do we know anything about the correspondence between vCPUs and real CPUs?

I thought 1 vCPU == 1 core. Perhaps @a-robinson knows for sure.

Ok, one last verification was to run the exact same config on delta-01 as I was experimenting with on (the new defunct) delta-11. Specifically, I created a new insecure cluster using:

./cockroach start --insecure --store=/mnt/data1

And then ran:

./block_writer --splits 1000 --concurrency 1 --tolerate-errors --min-block-bytes=8 --max-block-bytes=128

Doing this on delta-01 exhibits the same performance blips we've been debugging in this issue while the brand new instances did not.

@spencerkimball
Copy link
Member

Is it time to ping Google about this problem? Relatedly, have we searched GCE issues for any mention of this sort of problem?

@petermattis
Copy link
Collaborator

Is it time to ping Google about this problem?

Yes.

Relatedly, have we searched GCE issues for any mention of this sort of problem?

I've done some brief searches on the interwebs and found nothing other than research papers about security concerns with VMs sharing the same machine.

@petermattis
Copy link
Collaborator

FYI, a brand new 8 vCPU / 16 GB RAM instance has stable performance as well.

I bet if we destroyed all of the delta instances and recreated them the problem would go away.

@mberhault
Copy link
Contributor Author

Following up:
I tried running individual in-memory cockroach nodes on each machine (independent single-node cluster for each) and running block_writer against each:

./cockroach start --store=type=mem,size=2GiB --ca-cert=certs/ca.crt --cert=certs/node.crt --key=certs/node.key
./block_writer --concurrency 8 --tolerate-errors --min-block-bytes=8 --max-block-bytes=128 'postgres://root@localhost:26257/?sslmode=verify-ca&sslrootcert=certs/ca.crt&sslcert=certs/root.client.crt&sslkey=certs/root.client.key'

Some of the nodes on delta are showing the clear 2 minute pattern, others are not, or the pattern is not at clear.

Rebooting the VM seemed to get it out of the pattern. eg, 104.196.20.102 before and after reboot:
delta_restart

That node's kernel went from 3.16.7-ckt25-2+deb8u3 to 3.16.36-1+deb8u2, but another node (104.196.175.87) picked up the very same 3.16.36 kernel 3 months back but is still showing the sawtooth pattern.

Beyond that, the two minute blips are still undiagnosed, nothing detectable happens on the system at that time.

@a-robinson
Copy link
Contributor

Closing because we're quite confident that this isn't a cockroach problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants