-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: delta experiencing regular hiccups #12713
Comments
the cluster is still up. I haven't tried to take down nodes. I did however stop block_writer on that single node as well as all of them. The 2min spikes were attenuated, but restarted afresh when starting up block_writer again. @bdarnell, I'm fresh out of ideas about what to look for. I tried turning up verbosity (/debug/vmodule/store=3,replica=3,raft=3), but no trigger event was visible, just the expected election traffic. |
I just poked around on the problematic node as well and didn't see anything that jumped out at me. |
Hmm, the 2-minute period doesn't ring a bell (it's used for gossip TTLs, but those are supposed to be refreshed every 1 minute without expiring).
Right now it looks like the performance is still varying with a 2-minute period, but the TimeoutNow and Vote/VoteResp messages are no longer present, so it doesn't look like this was the root cause. I don't have any other suggestions right now except to figure out why the mutex is blocking so long. We'd expect to see raft elections with three seconds of blocking; blocking for 500ms-2s is obviously not good but shouldn't cause any specific problems that I'm aware of. |
One of the nodes in delta is in a bad state: 104.196.175.87 only has leases for 548 ranges, while all the other members of the cluster have ~3.5k. Furthermore, it appears to be sending a constant stream of MsgVotes (and getting very few MsgVoteResps, presumably due to CheckQuorum). Ranges are stuck in StateCandidate and are therefore getting sent to the replicaGCQueue often (which examines them but never GCs them). At least some of the time the replicaGCQueue's consistent RangeLookup times out after a minute. This node is serving zero SQL qps. These symptoms suggest problems with the raft transport, but I don't see any obvious problems there. And nothing on this misbehaving node matches up with the two-minute cycle seen on the relatively healthy nodes in the cluster. |
I noticed this message while loading the
That is after this issue was filed, so this cannot be the root cause but perhaps there were similar problems with other node pairs. It appears that something is hanging on to this failed connection and trying to reuse it instead of creating a new connection |
@tamird Didn't one of the gRPC bumps changing the handling of connections in some way? |
A lot of stuff changed when we updated grpc on 12/15; I don't think there were any changes closer to the start of this issue. Looking closer at the logs, we see a bunch of "getsockopt: connection refused" which are retried after 1 second, followed by one "write: connection refused" which is never retried.
|
This is beginning to smell like grpc/grpc-go#1026,
isn't it?
…On Wed, Jan 11, 2017 at 1:15 PM, Ben Darnell ***@***.***> wrote:
A lot of stuff changed when we updated grpc on 12/15; I don't think there
were any changes closer to the start of this issue.
Looking closer at the logs, we see a bunch of "getsockopt: connection
refused" which are retried after 1 second, followed by one "write:
connection refused" which is never retried.
I170106 17:45:34.212243 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:35.298321 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:36.390652 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:37.442627 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:38.563113 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:39.633187 121 vendor/google.golang.org/grpc/clientconn.go:698 grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transpo
rt: dial tcp 10.142.0.40:26257: getsockopt: connection refused"; Reconnecting to {cockroach-delta-04:26257 <nil>}
I170106 17:45:40.688074 121 vendor/google.golang.org/grpc/clientconn.go:798 grpc: addrConn.transportMonitor exits due to: connection error: desc = "transport: write tcp 10.14
2.0.42:52968->10.142.0.40:26257: write: connection refused"
net.Dial is always supposed to check getsockopt
<https://github.com/golang/go/blob/641ef2a73392b01382d7b2ffd34820fa36f5f9cf/src/net/fd_unix.go#L164>
after connecting. In order to see write: connection refused, the
connection must have made it past the getsockopt check, but I don't see
how yet.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12713 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABdsPHe2zcsMVoQ2UzS25Ro0qzNF3u6Pks5rRRxRgaJpZM4LcBVz>
.
|
Lightstep is showing slow queries every 2m on delta. For example:
I've looked at a half-dozen such queries and they all have this same pattern. |
What the above lightstep trace is showing is that we tried to send the request to the leaseholder, we timed out after 0.5s and sent to the other 2 replicas and then we waited for another 2.3s for the leaseholder to respond. Looking at a similar trace, during the time period the leaseholder is not responsive I see a whole bunch of log messages like:
Those log messages are actually sprinkled throughout the file. And fascinatingly, they come in 2m bursts:
The |
Here's a related curiosity: `W170117 20:42:13.917769 16879706 storage/replica.go:580 [n4,s7,r22/6:/Table/51{-/1/17740…},@c420a01b00] replicaMu: mutex held by github.com/cockroachdb/cockroach/pkg/storage.(*Replica).propose for 628.550232ms (>500ms): There is almost nothing in |
|
This is the second acquisition of |
Inspecting the logs this morning and here's another instance of
This corresponds to the following block of code: r.mu.Lock()
lastIndex := r.mu.lastIndex // used for append below
raftLogSize := r.mu.raftLogSize
leaderID := r.mu.leaderID
err := r.withRaftGroupLocked(false, func(raftGroup *raft.RawNode) (bool, error) {
if hasReady = raftGroup.HasReady(); hasReady {
rd = raftGroup.Ready()
}
return hasReady /* unquiesceAndWakeLeader */, nil
})
r.mu.Unlock() There is no rebalance activity anywhere near this time frame, so I'm pretty confident the Raft group wasn't being initialized. |
And another one:
The stacktrace (not shown above) shows this corresponds to to the call below: if err := r.withRaftGroup(func(raftGroup *raft.RawNode) (bool, error) {
// We're processing a message from another replica which means that the
// other replica is not quiesced, so we don't need to wake the leader.
r.unquiesceLocked()
return false, /* !unquiesceAndWakeLeader */
raftGroup.Step(req.Message)
}); err != nil {
return roachpb.NewError(err)
} This is all extremely curious. Three different sections of straight-shot code which should execute in much less than a millisecond are taking >500ms. That implies that something is starving the goroutine (or the entire process) of CPU, but I haven't seen any other evidence of that yet. |
The idea of the entire process or goroutine being starved doesn't seem to
agree with the CPU graphs, though maybe it's all getting smoothed out in
some way that obscuring a pegged core.
"Straight shot" here still includes a potential for the goroutine to block
if it must unquiesce the Raft group. Unquiescing proposes an empty message.
The Raft node proposes that message on an unbuffered channel, which is
handled in the same select block as received messages (raft.Step()), a
configuration state change, and ready processing (where we fetch committed
entries from the log). Maybe instrumenting
`Replica.unquiesceAndWakeLeaderLocked` as well as `raft/node.run` would
eliminate this path as a culprit.
…On Wed, Jan 18, 2017 at 10:17 AM Peter Mattis ***@***.***> wrote:
And another one:
W170118 15:10:29.769793 183 storage/replica.go:580
***@***.*** replicaMu: mutex
held by
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).withRaftGroup for
501.087395ms (>500ms):
The stacktrace (not shown above) shows this corresponds to to the call
below:
if err := r.withRaftGroup(func(raftGroup *raft.RawNode) (bool, error) {
// We're processing a message from another replica which means that the
// other replica is not quiesced, so we don't need to wake the leader.
r.unquiesceLocked()
return false, /* !unquiesceAndWakeLeader */
raftGroup.Step(req.Message)
}); err != nil {
return roachpb.NewError(err)
}
This is all extremely curious. Three different sections of straight-shot
code which should execute in much less than a millisecond are taking
>500ms. That implies that something is starving the goroutine (or the
entire process) of CPU, but I haven't seen any other evidence of that yet.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#12713 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTXwsV4u2YJioHJD1avTBkEKizVS2ks5rTi0XgaJpZM4LcBVz>
.
|
We use |
Good point, so it's not the channels in |
|
Ah, right. |
I recall instrumenting |
Here's another one which doesn't involve Raft:
The stack shows it corresponds to this code: r.mu.Lock()
r.store.raftEntryCache.addEntries(r.RangeID, rd.Entries)
r.mu.lastIndex = lastIndex
r.mu.raftLogSize = raftLogSize
r.mu.leaderID = leaderID
r.mu.Unlock() @spencerkimball Care to speculate what could have taken 500ms in that block of code? |
Tried another experiment and reverted |
Well, you're locking in the raft entry cache. Perhaps something is going on there. Why not make that a timed mutex? |
That entry cache is the common link between all of the stack traces so far. |
And yet I don't see anything in the entry cache that could be causing a problem. Do you? |
I don't, though I don't have time for a thorough review. It is suspicious
that the cache is being used in all of these critical sections that
otherwise appear to have no chance to block.
…On Wed, Jan 18, 2017 at 1:44 PM Peter Mattis ***@***.***> wrote:
And yet I don't see anything in the entry cache that could be causing a
problem. Do you?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12713 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTd7ZyWQ5lEcrI7kbwzBQK-hpJ_z-ks5rTl1vgaJpZM4LcBVz>
.
|
We cache the entire entry, so one large entry (from GC?) could push out a bunch of smaller ones. Evicting those entries takes time and causes more cache misses later. Since we care most about term numbers (which are small), it might make sense to split the entry cache into two, one for full entries and one for terms. |
Let me add some instrumentation to |
What changed on Jan 4th that would have caused problems with |
This indicates that |
You're computing this delta from Also, we should have checked the GCE logs earlier. Looks like we had two nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few more since then. Jan 4 appears to be as far back as the logs go):
Maybe we've just been sharing hardware with noisier neighbors since then? |
Yes.
Looks like every machine in
I ran The 1-node
The above is from the lock around the Also:
|
Have you tried just the pause detector without other load?
…On Wed, Jan 18, 2017 at 7:44 PM Peter Mattis ***@***.***> wrote:
You're computing this delta from time.Now(), right?
Yes.
Also, we should have checked the GCE logs earlier. Looks like we had two
nodes migrated to new machines on Jan 4, and two more on Jan 5 (and a few
more since then. Jan 4 appears to be as far back as the logs go):
Looks like every machine in delta was migrated since Jan 4th.
Maybe we've just been sharing hardware with noisier neighbors since then?
I ran vmstat 2 for a few minutes and didn't see any st (stolen) cpu time
or anything else suspicious.
The 1-node delta cluster is exhibiting all of the same problems we've
seen earlier. For example:
W170119 00:39:58.670960 127 storage/replica.go:576
***@***.*** replicaMu: mutex
held by
github.com/cockroachdb/cockroach/pkg/storage.(*Replica).handleRaftReadyRaftMuLocked
for 541.158659ms (>500ms):
The above is from the lock around the raftEntries.addCache call.
Also:
I170119 00:24:00.034669 178 storage/entry_cache.go:109
raftEntryCache.addEntries(r17): count=1/32 138.7ms lock=19.0s
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12713 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AF3MTZ-8nQVyJJj9qeuqUUvJm1Ymb4Fgks5rTrHjgaJpZM4LcBVz>
.
|
I tried running a pause detector process while cockroach was running and experiencing blips. No pauses detected. That would seem to indicate that there is something about the |
I'm suspecting something funky with the Next I'm going to try creating a new node with the same 8 vCPU / 16 GB RAM configuration as current |
The current delta nodes are 8 vCPU (and 16GB RAM). Do we know anything about the correspondence between vCPUs and real CPUs? (unless it's 1:1, things can get weird). |
I thought 1 vCPU == 1 core. Perhaps @a-robinson knows for sure. Ok, one last verification was to run the exact same config on
And then ran:
Doing this on |
Is it time to ping Google about this problem? Relatedly, have we searched GCE issues for any mention of this sort of problem? |
Yes.
I've done some brief searches on the interwebs and found nothing other than research papers about security concerns with VMs sharing the same machine. |
FYI, a brand new 8 vCPU / 16 GB RAM instance has stable performance as well. I bet if we destroyed all of the |
Closing because we're quite confident that this isn't a cockroach problem. |
About 12 hours after upgrading
delta
to the beta candidate (sha e1ef4ec), and after running happily, delta starting seeing regular dips in qps every ~2 minutes:This matched spikes in elections:
The
TimeoutNow
bumps came mainly from a single node:104.196.163.204
:There is nothing obviously wrong with the machine itself.
Looking through its logs, we can see periods of calm, followed by some slow mutexes on
handleRaftReady
(~500ms), followed by a bit more calm, and then finally a flood of slow mutex warnings (many times a second, between ~500ms and 2s each).Sample log:
cockroach.stderr.txt
The text was updated successfully, but these errors were encountered: