Skip to content

Commit

Permalink
Adding circuit breakers on ingester server side for write path (grafa…
Browse files Browse the repository at this point in the history
…na#8180)

* Adding circuit breakers on ingester server side for write path

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findigs

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Implementing the gauge as NewGaugeFunc

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing lint issues

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Adding test for hitting deadline when ingester.Push is used

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fix additional review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of finishPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Add unit test for startPushRequest

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Activate circuit breaker on a successful completion of ingester.starting

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Rename the output error of ingester.PushWithCleanup()

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Get rid of test-delay key from in context

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

* Fixing review findings

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>

---------

Signed-off-by: Yuri Nikolic <durica.nikolic@grafana.com>
  • Loading branch information
duricanikolic authored and narqo committed Jun 6, 2024
1 parent 5329990 commit 905cc6c
Show file tree
Hide file tree
Showing 12 changed files with 1,509 additions and 49 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [FEATURE] mimirtool: Add `runtime-config verify` sub-command, for verifying Mimir runtime config files. #8123
* [FEATURE] Query-frontend, querier: new experimental `/cardinality/active_native_histogram_metrics` API to get active native histogram metric names with statistics about active native histogram buckets. #7982 #7986 #8008
* [FEATURE] Alertmanager: Added `-alertmanager.max-silences-count` and `-alertmanager.max-silence-size-bytes` to set limits on per tenant silences. Disabled by default. #6898
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to ingesters. This can be enabled using `-ingester.circuit-breaker.enabled` option. Further `-ingester.circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180
* [ENHANCEMENT] Reduced memory allocations in functions used to propagate contextual information between gRPC calls. #7529
* [ENHANCEMENT] Distributor: add experimental limit for exemplars per series per request, enabled with `-distributor.max-exemplars-per-series-per-request`, the number of discarded exemplars are tracked with `cortex_discarded_exemplars_total{reason="too_many_exemplars_per_series_per_request"}` #7989 #8010
* [ENHANCEMENT] Store-gateway: merge series from different blocks concurrently. #7456
Expand Down
87 changes: 87 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3140,6 +3140,93 @@
"fieldFlag": "ingester.owned-series-update-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "block",
"name": "circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
{
"kind": "field",
"name": "enabled",
"required": false,
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_threshold_percentage",
"required": false,
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_execution_threshold",
"required": false,
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "thresholding_period",
"required": false,
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cooldown_period",
"required": false,
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "initial_delay",
"required": false,
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "push_timeout",
"required": false,
"desc": "How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors",
"fieldValue": null,
"fieldDefaultValue": 2000000000,
"fieldFlag": "ingester.circuit-breaker.push-timeout",
"fieldType": "duration",
"fieldCategory": "experiment"
}
],
"fieldValue": null,
"fieldDefaultValue": null
}
],
"fieldValue": null,
Expand Down
14 changes: 14 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1307,6 +1307,20 @@ Usage of ./cmd/mimir/mimir:
After what time a series is considered to be inactive. (default 10m0s)
-ingester.active-series-metrics-update-period duration
How often to update active series metrics. (default 1m0s)
-ingester.circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.circuit-breaker.push-timeout duration
How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors (default 2s)
-ingester.circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.client.backoff-max-period duration
Maximum delay when backing off. (default 10s)
-ingester.client.backoff-min-period duration
Expand Down
2 changes: 2 additions & 0 deletions cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,8 @@ Usage of ./cmd/mimir/mimir:
Print basic help.
-help-all
Print help, also including advanced and experimental parameters.
-ingester.circuit-breaker.push-timeout duration
How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors (default 2s)
-ingester.max-global-metadata-per-metric int
The maximum number of metadata per metric, across the cluster. 0 to disable.
-ingester.max-global-metadata-per-user int
Expand Down
10 changes: 9 additions & 1 deletion docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,20 @@ The following features are currently experimental:
- `-ingester.track-ingester-owned-series`
- `-ingester.use-ingester-owned-series-for-limits`
- `-ingester.owned-series-update-interval`
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.circuit-breaker.enabled`
- `-ingester.circuit-breaker.failure-threshold-percentage`
- `-ingester.circuit-breaker.failure-execution-threshold`
- `-ingester.circuit-breaker.thresholding-period`
- `-ingester.circuit-breaker.cooldown-period`
- `-ingester.circuit-breaker.initial-delay`
- `-ingester.circuit-breaker.push-timeout`
- Ingester client
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.client.circuit-breaker.enabled`
- `-ingester.client.circuit-breaker.failure-threshold`
- `-ingester.client.circuit-breaker.failure-execution-threshold`
- `-ingester.client.circuit-breaker.period`
- `-ingester.client.circuit-breaker.thresholding-period`
- `-ingester.client.circuit-breaker.cooldown-period`
- Querier
- Use of Redis cache backend (`-blocks-storage.bucket-store.metadata-cache.backend=redis`)
Expand Down
38 changes: 38 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1217,6 +1217,44 @@ instance_limits:
# owned series as a result of detected change.
# CLI flag: -ingester.owned-series-update-interval
[owned_series_update_interval: <duration> | default = 15s]
circuit_breaker:
# (experimental) Enable circuit breaking when making requests to ingesters
# CLI flag: -ingester.circuit-breaker.enabled
[enabled: <boolean> | default = false]
# (experimental) Max percentage of requests that can fail over period before
# the circuit breaker opens
# CLI flag: -ingester.circuit-breaker.failure-threshold-percentage
[failure_threshold_percentage: <int> | default = 10]
# (experimental) How many requests must have been executed in period for the
# circuit breaker to be eligible to open for the rate of failures
# CLI flag: -ingester.circuit-breaker.failure-execution-threshold
[failure_execution_threshold: <int> | default = 100]
# (experimental) Moving window of time that the percentage of failed requests
# is computed over
# CLI flag: -ingester.circuit-breaker.thresholding-period
[thresholding_period: <duration> | default = 1m]
# (experimental) How long the circuit breaker will stay in the open state
# before allowing some requests
# CLI flag: -ingester.circuit-breaker.cooldown-period
[cooldown_period: <duration> | default = 10s]
# (experimental) How long the circuit breaker should wait between an
# activation request and becoming effectively active. During that time both
# failures and successes will not be counted.
# CLI flag: -ingester.circuit-breaker.initial-delay
[initial_delay: <duration> | default = 0s]
# (experiment) How long is execution of ingester's Push supposed to last
# before it is reported as timeout in a circuit breaker. This configuration is
# used for circuit breakers only, and timeout expirations are not reported as
# errors
# CLI flag: -ingester.circuit-breaker.push-timeout
[push_timeout: <duration> | default = 2s]
```

### querier
Expand Down
Loading

0 comments on commit 905cc6c

Please sign in to comment.