Skip to content

Commit

Permalink
upstream: add failure percentage-based outlier detection (#8130)
Browse files Browse the repository at this point in the history
Description: Add a new outlier detection mode which compares each host's rate of request failure to a configured fixed threshold.

Risk Level: Low
Testing: 2 new unit tests added.
Docs Changes: New mode and config options described.
Release Notes: white_check_mark
Fixes #8105

Signed-off-by: James Forcier <jforcier@grubhub.com>
  • Loading branch information
csssuf authored and alyssawilk committed Sep 12, 2019
1 parent 5551315 commit 36cf26b
Show file tree
Hide file tree
Showing 9 changed files with 508 additions and 21 deletions.
30 changes: 30 additions & 0 deletions api/envoy/api/v2/cluster/outlier_detection.proto
Original file line number Diff line number Diff line change
Expand Up @@ -111,4 +111,34 @@ message OutlierDetection {
// is set to true.
google.protobuf.UInt32Value enforcing_local_origin_success_rate = 15
[(validate.rules).uint32.lte = 100];

// The failure percentage to use when determining failure percentage-based outlier detection. If
// the failure percentage of a given host is greater than or equal to this value, it will be
// ejected. Defaults to 85.
google.protobuf.UInt32Value failure_percentage_threshold = 16 [(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status is detected through
// failure percentage statistics. This setting can be used to disable ejection or to ramp it up
// slowly. Defaults to 0.
//
// [#next-major-version: setting this without setting failure_percentage_threshold should be
// invalid in v4.]
google.protobuf.UInt32Value enforcing_failure_percentage = 17 [(validate.rules).uint32.lte = 100];

// The % chance that a host will be actually ejected when an outlier status is detected through
// local-origin failure percentage statistics. This setting can be used to disable ejection or to
// ramp it up slowly. Defaults to 0.
google.protobuf.UInt32Value enforcing_failure_percentage_local_origin = 18
[(validate.rules).uint32.lte = 100];

// The minimum number of hosts in a cluster in order to perform failure percentage-based ejection.
// If the total number of hosts in the cluster is less than this value, failure percentage-based
// ejection will not be performed. Defaults to 5.
google.protobuf.UInt32Value failure_percentage_minimum_hosts = 19;

// The minimum number of total requests that must be collected in one interval (as defined by the
// interval duration above) to perform failure percentage-based ejection for this host. If the
// volume is lower than this setting, failure percentage-based ejection will not be performed for
// this host. Defaults to 50.
google.protobuf.UInt32Value failure_percentage_request_volume = 20;
}
12 changes: 12 additions & 0 deletions api/envoy/data/cluster/v2alpha/outlier_detection_event.proto
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ message OutlierDetectionEvent {
option (validate.required) = true;
OutlierEjectSuccessRate eject_success_rate_event = 9;
OutlierEjectConsecutive eject_consecutive_event = 10;
OutlierEjectFailurePercentage eject_failure_percentage_event = 11;
}
}

Expand Down Expand Up @@ -75,6 +76,12 @@ enum OutlierEjectionType {
// is set to *true*.
// See :ref:`Cluster outlier detection <arch_overview_outlier_detection>` documentation for
SUCCESS_RATE_LOCAL_ORIGIN = 4;
// Runs over aggregated success rate statistics from every host in cluster and selects hosts for
// which ratio of failed replies is above configured value.
FAILURE_PERCENTAGE = 5;
// Runs over aggregated success rate statistics for local origin failures from every host in
// cluster and selects hosts for which ratio of failed replies is above configured value.
FAILURE_PERCENTAGE_LOCAL_ORIGIN = 6;
}

// Represents possible action applied to upstream host
Expand All @@ -97,3 +104,8 @@ message OutlierEjectSuccessRate {

message OutlierEjectConsecutive {
}

message OutlierEjectFailurePercentage {
// Host's success rate at the time of the ejection event on a 0-100 range.
uint32 host_success_rate = 1 [(validate.rules).uint32.lte = 100];
}
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,31 @@ outlier_detection.success_rate_stdev_factor
<envoy_api_field_cluster.OutlierDetection.success_rate_stdev_factor>`
setting in outlier detection

outlier_detection.enforcing_failure_percentage
:ref:`enforcing_failure_percentage
<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage>`
setting in outlier detection

outlier_detection.enforcing_failure_percentage_local_origin
:ref:`enforcing_failure_percentage_local_origin
<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage_local_origin>`
setting in outlier detection

outlier_detection.failure_percentage_request_volume
:ref:`failure_percentage_request_volume
<envoy_api_field_cluster.OutlierDetection.failure_percentage_request_volume>`
setting in outlier detection

outlier_detection.failure_percentage_minimum_hosts
:ref:`failure_percentage_minimum_hosts
<envoy_api_field_cluster.OutlierDetection.failure_percentage_minimum_hosts>`
setting in outlier detection

outlier_detection.failure_percentage_threshold
:ref:`failure_percentage_threshold
<envoy_api_field_cluster.OutlierDetection.failure_percentage_threshold>`
setting in outlier detection

Core
----

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,10 @@ statistics will be rooted at *cluster.<name>.outlier_detection.* and contain the
ejections_detected_consecutive_local_origin_failure, Counter, Number of detected consecutive local origin failure ejections (even if unenforced)
ejections_enforced_local_origin_success_rate, Counter, Number of enforced success rate outlier ejections for locally originated failures
ejections_detected_local_origin_success_rate, Counter, Number of detected success rate outlier ejections for locally originated failures (even if unenforced)
ejections_enforced_failure_percentage, Counter, Number of enforced failure percentage outlier ejections. Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_detected_failure_percentage, Counter, Number of detected failure percentage outlier ejections (even if unenforced). Exact meaning of this counter depends on :ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>` config item. Refer to :ref:`Outlier Detection documentation<arch_overview_outlier_detection>` for details.
ejections_enforced_failure_percentage_local_origin, Counter, Number of enforced failure percentage outlier ejections for locally originated failures
ejections_detected_failure_percentage_local_origin, Counter, Number of detected failure percentage outlier ejections for locally originated failures (even if unenforced)
ejections_total, Counter, Deprecated. Number of ejections due to any outlier type (even if unenforced)
ejections_consecutive_5xx, Counter, Deprecated. Number of consecutive 5xx ejections (even if unenforced)

Expand Down
27 changes: 27 additions & 0 deletions docs/root/intro/arch_overview/upstream/outlier.rst
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,33 @@ Most configuration items, namely
types of errors, but :ref:`outlier_detection.enforcing_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_success_rate>` applies
to externally originated errors only and :ref:`outlier_detection.enforcing_local_origin_success_rate<envoy_api_field_cluster.OutlierDetection.enforcing_local_origin_success_rate>` applies to locally originated errors only.

.. _arch_overview_outlier_detection_failure_percentage:

Failure Percentage
^^^^^^^^^^^^^^^^^^

Failure Percentage based outlier ejection functions similarly to the success rate detecion type, in
that it relies on success rate data from each host in a cluster. However, rather than compare those
values to the mean success rate of the cluster as a whole, they are compared to a flat
user-configured threshold. This threshold is configured via the
:ref:`outlier_detection.failure_percentage_threshold<envoy_api_field_cluster.OutlierDetection.failure_percentage_threshold>`
field.

The other configuration fields for failure percentage based ejection are similar to the fields for
success rate ejection. Failure percentage based ejection also obeys
:ref:`outlier_detection.split_external_local_origin_errors<envoy_api_field_cluster.OutlierDetection.split_external_local_origin_errors>`;
the enforcement percentages for externally- and locally-originated errors are controlled by
:ref:`outlier_detection.enforcing_failure_percentage<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage>`
and
:ref:`outlier_detection.enforcing_failure_percentage_local_origin<envoy_api_field_cluster.OutlierDetection.enforcing_failure_percentage_local_origin>`,
respectively. As with success rate detection, detection will not be performed for a host if its
request volume over the aggregation interval is less than the
:ref:`outlier_detection.failure_percentage_request_volume<envoy_api_field_cluster.OutlierDetection.failure_percentage_request_volume>`
value. Detection also will not be performed for a cluster if the number of hosts with the minimum
required request volume in an interval is less than the
:ref:`outlier_detection.failure_percentage_minimum_hosts<envoy_api_field_cluster.OutlierDetection.failure_percentage_minimum_hosts>`
value.

.. _arch_overview_outlier_detection_grpc:

gRPC
Expand Down
1 change: 1 addition & 0 deletions docs/root/intro/version_history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ Version history
* tracing: added :ref:`max_path_tag_length <envoy_api_field_config.filter.network.http_connection_manager.v2.HttpConnectionManager.tracing>` to support customizing the length of the request path included in the extracted `http.url <https://github.com/opentracing/specification/blob/master/semantic_conventions.md#standard-span-tags-and-log-fields>` tag.
* upstream: added :ref:`an option <envoy_api_field_Cluster.CommonLbConfig.close_connections_on_host_set_change>` that allows draining HTTP, TCP connection pools on cluster membership change.
* upstream: added network filter chains to upstream connections, see :ref:`filters<envoy_api_field_Cluster.filters>`.
* upstream: added new :ref:`failure-percentage based outlier detection<arch_overview_outlier_detection_failure_percentage>` mode.
* upstream: use p2c to select hosts for least-requests load balancers if all host weights are the same, even in cases where weights are not equal to 1.
* upstream: added :ref:`fail_traffic_on_panic <envoy_api_field_Cluster.CommonLbConfig.ZoneAwareLbConfig.fail_traffic_on_panic>` to allow failing all requests to a cluster during panic state.
* zookeeper: parse responses and emit latency stats.
Expand Down
Loading

0 comments on commit 36cf26b

Please sign in to comment.