Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: fix watch metrics #11375

Closed
wants to merge 1 commit into from
Closed

Conversation

hexfusion
Copy link
Contributor

@hexfusion hexfusion commented Nov 20, 2019

Currently, when a client closes context during watch we pass. codes.Unavailable to status.New() via rpctypes.ErrGRPCNoLeader[1],[2] this inadvertently registers Unavailable in Prometheus metrics which causes an issue as Unavailable indicates the service is currently unavailable [3]. This PR changes the logic for how we conclude the leader is lost by observing RaftStatusGetter.Leader()[4] for raft.None. Only then do we return Unavailable (no leader) otherwise Canceled.

Fixes #10289 #9725 #9576 #9166

[1] https://github.com/etcd-io/etcd/pull/11375/files#diff-8a4ebdea7c0a8a8926fca73c3058b0b9L200
[2] -

ErrGRPCNoLeader = status.New(codes.Unavailable, "etcdserver: no leader").Err()

[3] https://github.com/grpc/grpc-go/blob/master/codes/codes.go#L140
[4] -
Leader() types.ID

@codecov-io
Copy link

codecov-io commented Nov 22, 2019

Codecov Report

Merging #11375 into master will increase coverage by 0.2%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #11375     +/-   ##
=========================================
+ Coverage   64.11%   64.31%   +0.2%     
=========================================
  Files         403      403             
  Lines       37973    37996     +23     
=========================================
+ Hits        24348    24439     +91     
+ Misses      11995    11929     -66     
+ Partials     1630     1628      -2
Impacted Files Coverage Δ
etcdserver/api/v3rpc/rpctypes/error.go 90.47% <ø> (ø) ⬆️
etcdserver/api/v3rpc/watch.go 80.06% <100%> (+1.63%) ⬆️
etcdserver/api/v3rpc/lease.go 67.04% <0%> (-7.96%) ⬇️
auth/store.go 44.82% <0%> (-2.56%) ⬇️
lease/leasehttp/http.go 64.23% <0%> (-1.46%) ⬇️
etcdserver/api/v2http/client.go 84.3% <0%> (-1.21%) ⬇️
pkg/proxy/server.go 60.2% <0%> (-1.02%) ⬇️
etcdserver/v3_server.go 72.86% <0%> (-0.86%) ⬇️
mvcc/watchable_store.go 82.51% <0%> (-0.7%) ⬇️
mvcc/metrics_txn.go 100% <0%> (ø) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ec52217...91042e2. Read the comment docs.

@hexfusion hexfusion removed the WIP label Nov 22, 2019
@hexfusion hexfusion changed the title WIP etcdserver: fix watch metrics etcdserver: fix watch metrics Nov 22, 2019
@hexfusion
Copy link
Contributor Author

/cc @brancz

Signed-off-by: Sam Batschelet <sbatsche@redhat.com>
@hexfusion
Copy link
Contributor Author

/cc @jingyih @gyuho PTAL

@brancz
Copy link
Contributor

brancz commented Nov 23, 2019

I have very little idea about the code changes, they look fine to me but I really don’t know the code very well. If it does what’s promised then I’m extremely excited to finally turn the alerts back on! :)

@hexfusion
Copy link
Contributor Author

@gyuho would you mind taking a peek please:).

@hexfusion
Copy link
Contributor Author

@xiang90 would you mind looking please.

if err == context.Canceled {
err = rpctypes.ErrGRPCNoLeader
md, ok := metadata.FromIncomingContext(stream.Context())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a test case for this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure thing

select {
case err = <-errc:
close(sws.ctrlStream)

case <-stream.Context().Done():
err = stream.Context().Err()
// the only server-side cancellation is noleader for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a better way to decide if the cancel is from server side or from client side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, good question I dug pretty hard but let me review again.

@metost
Copy link

metost commented Feb 24, 2020

Hi guys, any update on this? Thanks in advance!

@philips
Copy link
Contributor

philips commented Apr 29, 2020

bump @hexfusion. There are TODOs on this PR from @xiang90’s feedback.

@hexfusion
Copy link
Contributor Author

I hope to get back to this soon.

@codecov-commenter
Copy link

Codecov Report

Merging #11375 into master will decrease coverage by 0.64%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #11375      +/-   ##
==========================================
- Coverage   64.96%   64.31%   -0.65%     
==========================================
  Files         403      403              
  Lines       37969    37996      +27     
==========================================
- Hits        24666    24439     -227     
- Misses      11680    11929     +249     
- Partials     1623     1628       +5     
Impacted Files Coverage Δ
etcdserver/api/v3rpc/rpctypes/error.go 90.47% <ø> (ø)
etcdserver/api/v3rpc/watch.go 80.06% <100.00%> (-1.97%) ⬇️
auth/store.go 44.82% <0.00%> (-23.50%) ⬇️
proxy/grpcproxy/register.go 69.44% <0.00%> (-11.12%) ⬇️
etcdserver/api/v3rpc/lease.go 67.04% <0.00%> (-7.96%) ⬇️
pkg/tlsutil/tlsutil.go 86.20% <0.00%> (-6.90%) ⬇️
pkg/fileutil/purge.go 66.00% <0.00%> (-6.00%) ⬇️
proxy/grpcproxy/watcher.go 89.79% <0.00%> (-4.09%) ⬇️
clientv3/maintenance.go 40.81% <0.00%> (-2.05%) ⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 63dd73c...91042e2. Read the comment docs.

ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of leader loss evidence represents a
client cancellation, but in practice this seems less likely to break down
whereas client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Aug 3, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
@ironcladlou
Copy link
Contributor

I'm picking this up in #12196

ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Aug 5, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

This patch improves the behavior by:

1. Performing a deeper analysis during stream closure to more conclusively
determine that a leader has actually been lost before propagating a
ErrGRPCNoLeader error.

2. Returning a ErrGRPCWatchCanceled error if no conclusion can be drawn
regarding leader loss.

There remains an assumption that absence of evidence of leader loss means a
client cancelled, but in practice this seems less likely to break down whereas
client cancellations are frequent and expected.

This is a continuation of the work already done in etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Sep 29, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Sep 30, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
@stale
Copy link

stale bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 1, 2020
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
ironcladlou added a commit to ironcladlou/etcd that referenced this pull request Nov 18, 2020
Before this patch, a client which cancels the context for a watch results in the
server generating a `rpctypes.ErrGRPCNoLeader` error that leads the recording of
a gRPC `Unavailable` metric in association with the client watch cancellation.
The metric looks like this:

    grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"}

So, the watch server has misidentified the error as a server error and then
propagates the mistake to metrics, leading to a false indicator that the leader
has been lost. This false signal then leads to false alerting.

The commit 9c103dd introduced an interceptor which wraps
watch streams requiring a leader, causing those streams to be actively canceled
when leader loss is detected.

However, the error handling code assumes all stream context cancellations are
from the interceptor. This assumption is broken when the context was canceled
because of a client stream cancelation.

The core challenge is lack of information conveyed via `context.Context` which
is shared by both the send and receive sides of the stream handling and is
subject to cancellation by all paths (including the gRPC library itself). If any
piece of the system cancels the shared context, there's no way for a context
consumer to understand who cancelled the context or why.

To solve the ambiguity of the stream interceptor code specifically, this patch
introduces a custom context struct which the interceptor uses to expose a custom
error through the context when the interceptor decides to actively cancel a
stream. Now the consuming side can more safely assume a generic context
cancellation can be propagated as a cancellation, and the server generated
leader error is preserved and propagated normally without any special inference.

When a client cancels the stream, there remains a race in the error handling
code between the send and receive goroutines whereby the underlying gRPC error
is lost in the case where the send path returns and is handled first, but this
issue can be taken separately as no matter which paths wins, we can detect a
generic cancellation.

This is a replacement of etcd-io#11375.

Fixes etcd-io#10289, etcd-io#9725, etcd-io#9576, etcd-io#9166
@adampl
Copy link

adampl commented Nov 20, 2020

I guess it shouldn't go stale, not to mention closing.

@stale stale bot removed the stale label Nov 20, 2020
@abhi1693
Copy link

@xiang90 @hexfusion @ironcladlou Any news on merging this PR?

@ptabor
Copy link
Contributor

ptabor commented Feb 16, 2021

Closing as this got replaced by: #12196

@ptabor ptabor closed this Feb 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

gRPC code Unavailable instead Canceled