Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding circuit breakers on ingester server side for write path #8180

Merged
merged 16 commits into from
Jun 4, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* [FEATURE] mimirtool: Add `runtime-config verify` sub-command, for verifying Mimir runtime config files. #8123
* [FEATURE] Query-frontend, querier: new experimental `/cardinality/active_native_histogram_metrics` API to get active native histogram metric names with statistics about active native histogram buckets. #7982 #7986 #8008
* [FEATURE] Alertmanager: Added `-alertmanager.max-silences-count` and `-alertmanager.max-silence-size-bytes` to set limits on per tenant silences. Disabled by default. #6898
* [FEATURE] Ingester: add experimental support for the server-side circuit breakers when writing to ingesters. This can be enabled using `-ingester.circuit-breaker.enabled` option. Further `-ingester.circuit-breaker.*` options for configuring circuit-breaker are available. Added metrics `cortex_ingester_circuit_breaker_results_total`, `cortex_ingester_circuit_breaker_transitions_total` and `cortex_ingester_circuit_breaker_current_state`. #8180
* [ENHANCEMENT] Reduced memory allocations in functions used to propagate contextual information between gRPC calls. #7529
* [ENHANCEMENT] Distributor: add experimental limit for exemplars per series per request, enabled with `-distributor.max-exemplars-per-series-per-request`, the number of discarded exemplars are tracked with `cortex_discarded_exemplars_total{reason="too_many_exemplars_per_series_per_request"}` #7989 #8010
* [ENHANCEMENT] Store-gateway: merge series from different blocks concurrently. #7456
Expand Down
87 changes: 87 additions & 0 deletions cmd/mimir/config-descriptor.json
Original file line number Diff line number Diff line change
Expand Up @@ -3140,6 +3140,93 @@
"fieldFlag": "ingester.owned-series-update-interval",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "block",
"name": "circuit_breaker",
"required": false,
"desc": "",
"blockEntries": [
{
"kind": "field",
"name": "enabled",
"required": false,
"desc": "Enable circuit breaking when making requests to ingesters",
"fieldValue": null,
"fieldDefaultValue": false,
"fieldFlag": "ingester.circuit-breaker.enabled",
"fieldType": "boolean",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_threshold_percentage",
"required": false,
"desc": "Max percentage of requests that can fail over period before the circuit breaker opens",
"fieldValue": null,
"fieldDefaultValue": 10,
"fieldFlag": "ingester.circuit-breaker.failure-threshold-percentage",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "failure_execution_threshold",
"required": false,
"desc": "How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures",
"fieldValue": null,
"fieldDefaultValue": 100,
"fieldFlag": "ingester.circuit-breaker.failure-execution-threshold",
"fieldType": "int",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "thresholding_period",
"required": false,
"desc": "Moving window of time that the percentage of failed requests is computed over",
"fieldValue": null,
"fieldDefaultValue": 60000000000,
"fieldFlag": "ingester.circuit-breaker.thresholding-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "cooldown_period",
"required": false,
"desc": "How long the circuit breaker will stay in the open state before allowing some requests",
"fieldValue": null,
"fieldDefaultValue": 10000000000,
"fieldFlag": "ingester.circuit-breaker.cooldown-period",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "initial_delay",
"required": false,
"desc": "How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.",
"fieldValue": null,
"fieldDefaultValue": 0,
"fieldFlag": "ingester.circuit-breaker.initial-delay",
"fieldType": "duration",
"fieldCategory": "experimental"
},
{
"kind": "field",
"name": "push_timeout",
"required": false,
"desc": "How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors",
"fieldValue": null,
"fieldDefaultValue": 2000000000,
"fieldFlag": "ingester.circuit-breaker.push-timeout",
"fieldType": "duration",
"fieldCategory": "experiment"
}
],
"fieldValue": null,
"fieldDefaultValue": null
}
],
"fieldValue": null,
Expand Down
14 changes: 14 additions & 0 deletions cmd/mimir/help-all.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -1307,6 +1307,20 @@ Usage of ./cmd/mimir/mimir:
After what time a series is considered to be inactive. (default 10m0s)
-ingester.active-series-metrics-update-period duration
How often to update active series metrics. (default 1m0s)
-ingester.circuit-breaker.cooldown-period duration
[experimental] How long the circuit breaker will stay in the open state before allowing some requests (default 10s)
-ingester.circuit-breaker.enabled
[experimental] Enable circuit breaking when making requests to ingesters
-ingester.circuit-breaker.failure-execution-threshold uint
[experimental] How many requests must have been executed in period for the circuit breaker to be eligible to open for the rate of failures (default 100)
-ingester.circuit-breaker.failure-threshold-percentage uint
[experimental] Max percentage of requests that can fail over period before the circuit breaker opens (default 10)
-ingester.circuit-breaker.initial-delay duration
[experimental] How long the circuit breaker should wait between an activation request and becoming effectively active. During that time both failures and successes will not be counted.
-ingester.circuit-breaker.push-timeout duration
How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors (default 2s)
-ingester.circuit-breaker.thresholding-period duration
[experimental] Moving window of time that the percentage of failed requests is computed over (default 1m0s)
-ingester.client.backoff-max-period duration
Maximum delay when backing off. (default 10s)
-ingester.client.backoff-min-period duration
Expand Down
2 changes: 2 additions & 0 deletions cmd/mimir/help.txt.tmpl
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,8 @@ Usage of ./cmd/mimir/mimir:
Print basic help.
-help-all
Print help, also including advanced and experimental parameters.
-ingester.circuit-breaker.push-timeout duration
How long is execution of ingester's Push supposed to last before it is reported as timeout in a circuit breaker. This configuration is used for circuit breakers only, and timeout expirations are not reported as errors (default 2s)
-ingester.max-global-metadata-per-metric int
The maximum number of metadata per metric, across the cluster. 0 to disable.
-ingester.max-global-metadata-per-user int
Expand Down
10 changes: 9 additions & 1 deletion docs/sources/mimir/configure/about-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,12 +117,20 @@ The following features are currently experimental:
- `-ingester.track-ingester-owned-series`
- `-ingester.use-ingester-owned-series-for-limits`
- `-ingester.owned-series-update-interval`
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.circuit-breaker.enabled`
- `-ingester.circuit-breaker.failure-threshold-percentage`
- `-ingester.circuit-breaker.failure-execution-threshold`
- `-ingester.circuit-breaker.thresholding-period`
- `-ingester.circuit-breaker.cooldown-period`
- `-ingester.circuit-breaker.initial-delay`
- `-ingester.circuit-breaker.push-timeout`
- Ingester client
- Per-ingester circuit breaking based on requests timing out or hitting per-instance limits
- `-ingester.client.circuit-breaker.enabled`
- `-ingester.client.circuit-breaker.failure-threshold`
- `-ingester.client.circuit-breaker.failure-execution-threshold`
- `-ingester.client.circuit-breaker.period`
- `-ingester.client.circuit-breaker.thresholding-period`
- `-ingester.client.circuit-breaker.cooldown-period`
- Querier
- Use of Redis cache backend (`-blocks-storage.bucket-store.metadata-cache.backend=redis`)
Expand Down
38 changes: 38 additions & 0 deletions docs/sources/mimir/configure/configuration-parameters/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -1217,6 +1217,44 @@ instance_limits:
# owned series as a result of detected change.
# CLI flag: -ingester.owned-series-update-interval
[owned_series_update_interval: <duration> | default = 15s]

circuit_breaker:
# (experimental) Enable circuit breaking when making requests to ingesters
# CLI flag: -ingester.circuit-breaker.enabled
[enabled: <boolean> | default = false]

# (experimental) Max percentage of requests that can fail over period before
# the circuit breaker opens
# CLI flag: -ingester.circuit-breaker.failure-threshold-percentage
[failure_threshold_percentage: <int> | default = 10]

# (experimental) How many requests must have been executed in period for the
# circuit breaker to be eligible to open for the rate of failures
# CLI flag: -ingester.circuit-breaker.failure-execution-threshold
[failure_execution_threshold: <int> | default = 100]

# (experimental) Moving window of time that the percentage of failed requests
# is computed over
# CLI flag: -ingester.circuit-breaker.thresholding-period
[thresholding_period: <duration> | default = 1m]

# (experimental) How long the circuit breaker will stay in the open state
# before allowing some requests
# CLI flag: -ingester.circuit-breaker.cooldown-period
[cooldown_period: <duration> | default = 10s]

# (experimental) How long the circuit breaker should wait between an
# activation request and becoming effectively active. During that time both
# failures and successes will not be counted.
# CLI flag: -ingester.circuit-breaker.initial-delay
[initial_delay: <duration> | default = 0s]

# (experiment) How long is execution of ingester's Push supposed to last
# before it is reported as timeout in a circuit breaker. This configuration is
# used for circuit breakers only, and timeout expirations are not reported as
# errors
# CLI flag: -ingester.circuit-breaker.push-timeout
[push_timeout: <duration> | default = 2s]
```

### querier
Expand Down
Loading
Loading