Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm): Adding KEDA autoscaling support #7282

Merged
merged 62 commits into from
Feb 13, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
9ff0e92
feat(helm): Adding KEDA autoscaling support
beatkind Feb 3, 2024
f7d7afd
fix: update changelog
beatkind Feb 3, 2024
ce04869
feat(helm): Porting the changes from #6971 into helm chart
beatkind Feb 3, 2024
9ad9f67
feat: Adding better changelog & values documentation outlining the ex…
beatkind Feb 13, 2024
c0f29e7
fix: Remove duplicate query field, add base url in CHANGELOG.md
beatkind Feb 13, 2024
38702d2
helm: align grpc server connection lifetime settings with jsonnet (#7…
narqo Feb 5, 2024
c59d35e
querymiddleware: Fix race condition in shardActiveSeriesMiddleware (#…
narqo Feb 5, 2024
21bbd22
version: add UserAgent() (#7264)
narqo Feb 5, 2024
916c340
helm: remove -server.grpc.keepalive.max-connection-idle from common c…
narqo Feb 5, 2024
05b2dce
Compactor: export estimated number of compaction jobs based on bucket…
pstibrany Feb 5, 2024
16a59ad
Add KubePersistentVolumeFillingUp runbook (#7297)
pracucci Feb 5, 2024
3a778ae
Internal: remove unnecessary parameter to NoCompactionMarkFilter (#7301)
pstibrany Feb 5, 2024
04b8224
Name query metrics for easier discovery (#7302)
56quarters Feb 5, 2024
1b3708e
fix(deps): update module github.com/aws/aws-sdk-go to v1.50.11 (#7288)
renovate[bot] Feb 6, 2024
6798bed
fix(deps): update module github.com/klauspost/compress to v1.17.6 (#7…
renovate[bot] Feb 6, 2024
213a453
chore(deps): update anchore/sbom-action action to v0.15.8 (#7286)
renovate[bot] Feb 6, 2024
105d82f
chore(deps): update grafana/agent docker tag to v0.39.2 (#7287)
renovate[bot] Feb 6, 2024
7f1c9fe
chore(deps): update grafana/grafana docker tag to v10.3.1 (#7292)
renovate[bot] Feb 6, 2024
28be80c
fix(deps): update module github.com/failsafe-go/failsafe-go to v0.4.4…
renovate[bot] Feb 6, 2024
6e3ff83
Chore: removed unused parameter from GenerateBlockFromSpec() (#7303)
pracucci Feb 6, 2024
14d241a
Update mimir-prometheus (#7293)
pracucci Feb 6, 2024
55c978e
Release mimir-distributed Helm chart 5.3.0-weekly.276 (#7294)
grafanabot Feb 6, 2024
0c6d6db
Open circuit breakers on timeouts and per-instance limit errors only …
duricanikolic Feb 7, 2024
f1c8e71
Get rid of iterators.chunkIterator and iterators.chunkMergeIterator (…
duricanikolic Feb 7, 2024
185c2fe
Compactor: Language fixes (#7315)
aknuds1 Feb 7, 2024
1627df3
Do not register compat metrics in mimirtool (#7314)
grobinson-grafana Feb 7, 2024
6f57c5c
Compactor: Un-export symbols that don't need to be exported (#7317)
aknuds1 Feb 7, 2024
28e09c5
Circuit breakers: add client.ErrCircuitBreakerOpen type (#7324)
duricanikolic Feb 8, 2024
bbcb640
Add mimirpb.CIRCUIT_BREAKER_OPEN error cause (#7330)
duricanikolic Feb 8, 2024
1d2d2a7
store-gateway: remove cortex_bucket_store_blocks_loaded_by_duration (…
dimitarvdimitrov Feb 8, 2024
c9c074b
ruler: don't retry on non-retriable error (#7216)
narqo Feb 8, 2024
3624447
Update Alertmanager to f69a508 (#7332)
grobinson-grafana Feb 8, 2024
eaae699
Helm: add ruler specific service account (#7132)
QuantumEnigmaa Feb 8, 2024
84a2add
frontend/transport: log non-2xx replies from downstream as non-succes…
narqo Feb 8, 2024
dffd834
querymiddleware: Pool snappy writer in shard activity series (#7308)
narqo Feb 8, 2024
c1e523d
Helm: make PSP configurable (#7190)
QuantumEnigmaa Feb 8, 2024
b22fed6
Helm - Templatable host for gateway ingress/route (#7218)
Itaykal Feb 8, 2024
33b6a8a
[Docs] Update migrate-from-single-zone-with-helm.md (#7327)
eamonryan Feb 8, 2024
d3797d6
Always sort labels in distributors (#7326)
Logiraptor Feb 8, 2024
0c8a166
Do not check for ingester ring state before creating TSDB, or compact…
pracucci Feb 9, 2024
7952c2e
Compactor: String format compaction plan as comma separated blocks (#…
aknuds1 Feb 9, 2024
262ae64
Add a lifetime manager for Vault authentication tokens (#7337)
fayzal-g Feb 9, 2024
2dba521
fix(deps): update github.com/grafana/dskit digest to f245b48 (#7283)
renovate[bot] Feb 9, 2024
c8e62c8
Packaging: remove reload from systemd file as mimir does not take int…
wilfriedroset Feb 9, 2024
1745d88
Docs: No longer mark OTLP endpoint as experimental (#7348)
aknuds1 Feb 10, 2024
aa3813c
Update golang.org/x/exp digest to 2c58cdc (#7352)
renovate[bot] Feb 12, 2024
f7c3cb7
Update module github.com/aws/aws-sdk-go to v1.50.15 (#7353)
renovate[bot] Feb 12, 2024
831f9e2
Update module github.com/minio/minio-go/v7 to v7.0.67 (#7354)
renovate[bot] Feb 12, 2024
f720020
Update dependency puppeteer to v21.11.0 (#7355)
renovate[bot] Feb 12, 2024
c5e9dfe
Update helm/kind-action action to v1.9.0 (#7357)
renovate[bot] Feb 12, 2024
271a805
Update module cloud.google.com/go/storage to v1.37.0 (#7358)
renovate[bot] Feb 12, 2024
22b163e
Jsonnet / Helm: improve distributors graceful shutdown (#7361)
pracucci Feb 12, 2024
66a893a
Release mimir-distributed Helm chart 5.3.0-weekly.277 (#7362)
grafanabot Feb 12, 2024
8ba0cad
Distributor: Make `-distributor.enable-otlp-metadata-storage` flag de…
aknuds1 Feb 12, 2024
f95dc9d
Mark -ingester.limit-inflight-requests-using-grpc-method-limiter and …
pracucci Feb 12, 2024
f9e9d6f
Do not consider out-of-order blocks when filtering compactable jobs (…
jhalterman Feb 12, 2024
f10561c
mimir: Inject span profiler into tracer (#7363)
narqo Feb 13, 2024
3a7e509
Add experimental partitions ring lifecycler support (#7349)
pracucci Feb 13, 2024
188d181
feat(helm): Adding KEDA autoscaling support
beatkind Feb 13, 2024
87add14
chore: rebase branch with main
beatkind Feb 13, 2024
cf2e13e
Merge branch 'grafana:main' into add-helm-keda
beatkind Feb 13, 2024
c96d4c6
chore: make build-helm-tests
beatkind Feb 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion operations/helm/charts/mimir-distributed/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ Entries should include a reference to the Pull Request that introduced the chang

## main / unreleased

* [FEATURE] Added experimental feature for deploying keda autoscaling objects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Requires metamonitoring, for more details on metamonitoring see the Helm chart documentation. #7282
* [FEATURE] Added experimental feature for deploying [KEDA](https://keda.sh) ScaledObjects as part of the helm chart for the components: distributor, querier, query-frontend and ruler. Autoscaling can be enabled via `distributor.kedaAutoscaling`, `ruler.kedaAutoscaling`, `query_frontend.kedaAutoscaling`, and `querier.kedaAutoscaling`. Requires metamonitoring, for more details on metamonitoring see [Monitor the health of your system](https://grafana.com/docs/helm-charts/mimir-distributed/latest/run-production-environment-with-helm/monitor-system-health/). See https://github.com/grafana/mimir/issues/7367 for a migration procedure. #7282
beatkind marked this conversation as resolved.
Show resolved Hide resolved
* [CHANGE] Rollout-operator: remove default CPU limit. #7125
* [CHANGE] Ring: relaxed the hash ring heartbeat period and timeout for distributor, ingester, store-gateway and compactor: #6860
* `-distributor.ring.heartbeat-period` set to `1m`
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ kind: ScaledObject
metadata:
name: {{ include "mimir.resourceName" (dict "ctx" . "component" "distributor") }}
labels:
{{- include "mimir.labels" (dict "ctx" . "component" "distributor" "memberlist" true) | nindent 4 }}
{{- include "mimir.labels" (dict "ctx" . "component" "distributor") | nindent 4 }}
annotations:
{{- toYaml .Values.distributor.annotations | nindent 4 }}
namespace: {{ .Release.Namespace | quote }}
Expand All @@ -24,8 +24,7 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: cortex_distributor_cpu_hpa_default
query: max_over_time(sum(rate(container_cpu_usage_seconds_total{container="distributor",namespace="{{ .Release.Namespace }}"}[5m]))[15m:]) * 1000
query: max_over_time(sum(sum by (pod) (rate(container_cpu_usage_seconds_total{container="distributor",namespace="{{ .Release.Namespace }}"}[5m])) and max by (pod) (up{container="distributor",namespace="{{ .Release.Namespace }}"}) > 0)[15m:]) * 1000
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $cpu_request := dig "requests" "cpu" nil .Values.distributor.resources }}
threshold: {{ mulf (include "mimir.parseCPU" (dict "value" $cpu_request)) (divf .Values.distributor.kedaAutoscaling.targetCPUUtilizationPercentage 100) | floor | int64 | quote }}
Expand All @@ -34,8 +33,7 @@ spec:
{{- end }}
type: prometheus
- metadata:
metricName: cortex_distributor_memory_hpa_default
query: max_over_time(sum(container_memory_working_set_bytes{container="distributor",namespace="{{ .Release.Namespace }}"})[15m:])
query: max_over_time(sum((sum by (pod) (container_memory_working_set_bytes{container="distributor",namespace="{{ .Release.Namespace }}"}) and max by (pod) (up{container="distributor",namespace="{{ .Release.Namespace }}"}) > 0) or vector(0))[15m:]) + sum(sum by (pod) (max_over_time(kube_pod_container_resource_requests{container="distributor",namespace="{{ .Release.Namespace }}", resource="memory"}[15m])) and max by (pod) (changes(kube_pod_container_status_restarts_total{container="distributor",namespace="{{ .Release.Namespace }}"}[15m]) > 0) and max by (pod) (kube_pod_container_status_last_terminated_reason{container="distributor",namespace="{{ .Release.Namespace }}", reason="OOMKilled"}) or vector(0))
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $mem_request := dig "requests" "memory" nil .Values.distributor.resources }}
threshold: {{ mulf (include "mimir.siToBytes" (dict "value" $mem_request)) (divf .Values.distributor.kedaAutoscaling.targetMemoryUtilizationPercentage 100) | floor | int64 | quote }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ kind: ScaledObject
metadata:
name: {{ include "mimir.resourceName" (dict "ctx" . "component" "querier") }}
labels:
{{- include "mimir.labels" (dict "ctx" . "component" "querier" "memberlist" true) | nindent 4 }}
{{- include "mimir.labels" (dict "ctx" . "component" "querier") | nindent 4 }}
annotations:
{{- toYaml .Values.querier.annotations | nindent 4 }}
namespace: {{ .Release.Namespace | quote }}
Expand All @@ -27,7 +27,6 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: cortex_querier_hpa_default
query: sum(max_over_time(cortex_query_scheduler_inflight_requests{container="query-scheduler",namespace="{{ .Release.Namespace }}",quantile="0.5"}[1m]))
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
threshold: {{ .Values.querier.kedaAutoscaling.querySchedulerInflightRequestsThreshold | quote }}
Expand All @@ -37,7 +36,6 @@ spec:
name: cortex_querier_hpa_default
type: prometheus
- metadata:
metricName: cortex_querier_hpa_default_requests_duration
query: sum(rate(cortex_querier_request_duration_seconds_sum{container="querier",namespace="{{ .Release.Namespace }}"}[1m]))
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
threshold: {{ .Values.querier.kedaAutoscaling.querySchedulerInflightRequestsThreshold | quote }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,14 +24,17 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: query_frontend_cpu_hpa_default
query: max_over_time(sum(sum by (pod) (rate(container_cpu_usage_seconds_total{container="query-frontend",namespace="{{ .Release.Namespace }}"}[5m])) and max by (pod) (up{container="query-frontend",namespace="{{ .Release.Namespace }}"}) > 0)[15m:]) * 1000
query: max_over_time(sum(rate(container_cpu_usage_seconds_total{container="query-frontend",namespace="{{ .Release.Namespace }}"}[5m]))[15m:]) * 1000
beatkind marked this conversation as resolved.
Show resolved Hide resolved
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $cpu_request := dig "requests" "cpu" nil .Values.query_frontend.resources }}
threshold: {{ mulf (include "mimir.parseCPU" (dict "value" $cpu_request)) (divf .Values.query_frontend.kedaAutoscaling.targetCPUUtilizationPercentage 100) | floor | int64 | quote }}
beatkind marked this conversation as resolved.
Show resolved Hide resolved
{{- if .Values.query_frontend.kedaAutoscaling.customHeaders }}
customHeaders: {{ (include "mimir.lib.mapToCSVString" (dict "map" .Values.query_frontend.kedaAutoscaling.customHeaders)) | quote }}
{{- end }}
type: prometheus
- metadata:
metricName: query_frontend_memory_hpa_default
query: max_over_time(sum((sum by (pod) (container_memory_working_set_bytes{container="query-frontend",namespace="{{ .Release.Namespace }}"}) and max by (pod) (up{container="query-frontend",namespace="{{ .Release.Namespace }}"}) > 0) or vector(0))[15m:]) + sum(sum by (pod) (max_over_time(kube_pod_container_resource_requests{container="query-frontend",namespace="{{ .Release.Namespace }}", resource="memory"}[15m])) and max by (pod) (changes(kube_pod_container_status_restarts_total{container="query-frontend",namespace="{{ .Release.Namespace }}"}[15m]) > 0) and max by (pod) (kube_pod_container_status_last_terminated_reason{container="query-frontend",namespace="{{ .Release.Namespace }}", reason="OOMKilled"}) or vector(0))
query: max_over_time(sum(container_memory_working_set_bytes{container="query-frontend",namespace="{{ .Release.Namespace }}"})[15m:])
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $mem_request := dig "requests" "memory" nil .Values.query_frontend.resources }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ kind: ScaledObject
metadata:
name: {{ include "mimir.resourceName" (dict "ctx" . "component" "ruler") }}
labels:
{{- include "mimir.labels" (dict "ctx" . "component" "ruler" "memberlist" true) | nindent 4 }}
{{- include "mimir.labels" (dict "ctx" . "component" "ruler") | nindent 4 }}
annotations:
{{- toYaml .Values.ruler.annotations | nindent 4 }}
namespace: {{ .Release.Namespace | quote }}
Expand All @@ -24,7 +24,7 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: ruler_cpu_hpa_default
query: max_over_time(sum(sum by (pod) (rate(container_cpu_usage_seconds_total{container="ruler",namespace="{{ .Release.Namespace }}"}[5m])) and max by (pod) (up{container="ruler",namespace="{{ .Release.Namespace }}"}) > 0)[15m:]) * 1000
query: max_over_time(sum(rate(container_cpu_usage_seconds_total{container="ruler",namespace="{{ .Release.Namespace }}"}[5m]))[15m:]) * 1000
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $cpu_request := dig "requests" "cpu" nil .Values.ruler.resources }}
Expand All @@ -34,7 +34,7 @@ spec:
{{- end }}
type: prometheus
- metadata:
metricName: ruler_memory_hpa_default
query: max_over_time(sum((sum by (pod) (container_memory_working_set_bytes{container="ruler",namespace="{{ .Release.Namespace }}"}) and max by (pod) (up{container="ruler",namespace="{{ .Release.Namespace }}"}) > 0) or vector(0))[15m:]) + sum(sum by (pod) (max_over_time(kube_pod_container_resource_requests{container="ruler",namespace="{{ .Release.Namespace }}", resource="memory"}[15m])) and max by (pod) (changes(kube_pod_container_status_restarts_total{container="ruler",namespace="{{ .Release.Namespace }}"}[15m]) > 0) and max by (pod) (kube_pod_container_status_last_terminated_reason{container="ruler",namespace="{{ .Release.Namespace }}", reason="OOMKilled"}) or vector(0))
query: max_over_time(sum(container_memory_working_set_bytes{container="ruler",namespace="{{ .Release.Namespace }}"})[15m:])
serverAddress: {{ include "mimir.metaMonitoring.metrics.remoteReadUrl" (dict "ctx" $) }}
{{- $mem_request := dig "requests" "memory" nil .Values.ruler.resources }}
Expand Down
56 changes: 40 additions & 16 deletions operations/helm/charts/mimir-distributed/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -739,20 +739,26 @@ distributor:
# Setting it to null will produce a deployment without replicas set, allowing you to use autoscaling with the deployment
replicas: 1

# -- [Experimental] Configure autoscaling via KEDA (https://keda.sh). This requires having
# KEDA already installed in the Kubernetes cluster. The metrics for scaling are read
# from the the metamonitoring setup (metamonitoring.grafanaAgent.metrics.remote).
# Basic auth and extra HTTP headers from metamonitoring are ignored, please use customHeaders.
# The remote URL is used even if metamonitoring is disabled.
# See https://github.com/grafana/mimir/issues/7367 for more details on how to migrate to autoscaled resources without disruptions.
kedaAutoscaling:
beatkind marked this conversation as resolved.
Show resolved Hide resolved
enabled: false
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
targetCPUUtilizationPercentage: 100
targetMemoryUtilizationPercentage: 100
customHeaders:
{}
# X-Scope-OrgID: ""
behavior:
scaleDown:
policies:
- periodSeconds: 60
- periodSeconds: 600
type: Percent
value: 10

Expand Down Expand Up @@ -1132,13 +1138,19 @@ ruler:
enabled: true
replicas: 1

# -- [Experimental] Configure autoscaling via KEDA (https://keda.sh). This requires having
# KEDA already installed in the Kubernetes cluster. The metrics for scaling are read
# from the the metamonitoring setup (metamonitoring.grafanaAgent.metrics.remote).
# Basic auth and extra HTTP headers from metamonitoring are ignored, please use customHeaders.
# The remote URL is used even if metamonitoring is disabled.
# See https://github.com/grafana/mimir/issues/7367 for more details on how to migrate to autoscaled resources without disruptions.
kedaAutoscaling:
enabled: false
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
targetCPUUtilizationPercentage: 100
targetMemoryUtilizationPercentage: 100
customHeaders:
{}
# X-Scope-OrgID: ""
Expand Down Expand Up @@ -1226,6 +1238,12 @@ ruler:
querier:
replicas: 2

# -- [Experimental] Configure autoscaling via KEDA (https://keda.sh). This requires having
# KEDA already installed in the Kubernetes cluster. The metrics for scaling are read
# from the the metamonitoring setup (metamonitoring.grafanaAgent.metrics.remote).
# Basic auth and extra HTTP headers from metamonitoring are ignored, please use customHeaders.
# The remote URL is used even if metamonitoring is disabled.
# See https://github.com/grafana/mimir/issues/7367 for more details on how to migrate to autoscaled resources without disruptions.
kedaAutoscaling:
enabled: false
minReplicaCount: 1
Expand All @@ -1242,15 +1260,15 @@ querier:
type: Percent
value: 10
stabilizationWindowSeconds: 600
scaleUp:
policies:
- periodSeconds: 120
type: Percent
value: 50
- periodSeconds: 120
type: Pods
value: 15
stabilizationWindowSeconds: 60
scaleUp:
policies:
- periodSeconds: 120
type: Percent
value: 50
- periodSeconds: 120
type: Pods
value: 15
stabilizationWindowSeconds: 60

service:
annotations: {}
Expand Down Expand Up @@ -1331,13 +1349,19 @@ query_frontend:
# Setting it to null will produce a deployment without replicas set, allowing you to use autoscaling with the deployment
replicas: 1

# -- [Experimental] Configure autoscaling via KEDA (https://keda.sh). This requires having
# KEDA already installed in the Kubernetes cluster. The metrics for scaling are read
# from the the metamonitoring setup (metamonitoring.grafanaAgent.metrics.remote).
# Basic auth and extra HTTP headers from metamonitoring are ignored, please use customHeaders.
# The remote URL is used even if metamonitoring is disabled.
# See https://github.com/grafana/mimir/issues/7367 for more details on how to migrate to autoscaled resources without disruptions.
kedaAutoscaling:
enabled: false
minReplicaCount: 1
maxReplicaCount: 10
pollingInterval: 10
targetCPUUtilizationPercentage: 80
targetMemoryUtilizationPercentage: 80
targetCPUUtilizationPercentage: 75
targetMemoryUtilizationPercentage: 100
customHeaders:
{}
# X-Scope-OrgID: ""
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ metadata:
app.kubernetes.io/name: mimir
app.kubernetes.io/instance: keda-autoscaling-metamonitoring-values
app.kubernetes.io/component: distributor
app.kubernetes.io/part-of: memberlist
app.kubernetes.io/managed-by: Helm
annotations:
{}
Expand All @@ -19,7 +18,7 @@ spec:
behavior:
scaleDown:
policies:
- periodSeconds: 60
- periodSeconds: 600
type: Percent
value: 10
maxReplicaCount: 10
Expand All @@ -31,15 +30,13 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: cortex_distributor_cpu_hpa_default
query: max_over_time(sum(rate(container_cpu_usage_seconds_total{container="distributor",namespace="citestns"}[5m]))[15m:]) * 1000
query: max_over_time(sum(sum by (pod) (rate(container_cpu_usage_seconds_total{container="distributor",namespace="citestns"}[5m])) and max by (pod) (up{container="distributor",namespace="citestns"}) > 0)[15m:]) * 1000
serverAddress: https://mimir.example.com/prometheus
threshold: "0"
customHeaders: "X-Scope-OrgID=tenant-1"
type: prometheus
- metadata:
metricName: cortex_distributor_memory_hpa_default
query: max_over_time(sum(container_memory_working_set_bytes{container="distributor",namespace="citestns"})[15m:])
query: max_over_time(sum((sum by (pod) (container_memory_working_set_bytes{container="distributor",namespace="citestns"}) and max by (pod) (up{container="distributor",namespace="citestns"}) > 0) or vector(0))[15m:]) + sum(sum by (pod) (max_over_time(kube_pod_container_resource_requests{container="distributor",namespace="citestns", resource="memory"}[15m])) and max by (pod) (changes(kube_pod_container_status_restarts_total{container="distributor",namespace="citestns"}[15m]) > 0) and max by (pod) (kube_pod_container_status_last_terminated_reason{container="distributor",namespace="citestns", reason="OOMKilled"}) or vector(0))
serverAddress: https://mimir.example.com/prometheus
threshold: "429496729"
customHeaders: "X-Scope-OrgID=tenant-1"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,6 @@ metadata:
app.kubernetes.io/name: mimir
app.kubernetes.io/instance: keda-autoscaling-metamonitoring-values
app.kubernetes.io/component: querier
app.kubernetes.io/part-of: memberlist
app.kubernetes.io/managed-by: Helm
annotations:
{}
Expand All @@ -22,16 +21,16 @@ spec:
- periodSeconds: 120
type: Percent
value: 10
scaleUp:
policies:
- periodSeconds: 120
type: Percent
value: 50
- periodSeconds: 120
type: Pods
value: 15
stabilizationWindowSeconds: 60
stabilizationWindowSeconds: 600
scaleUp:
policies:
- periodSeconds: 120
type: Percent
value: 50
- periodSeconds: 120
type: Pods
value: 15
stabilizationWindowSeconds: 60
maxReplicaCount: 10
minReplicaCount: 2
pollingInterval: 10
Expand All @@ -41,15 +40,13 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: cortex_querier_hpa_default
query: sum(max_over_time(cortex_query_scheduler_inflight_requests{container="query-scheduler",namespace="citestns",quantile="0.5"}[1m]))
serverAddress: https://mimir.example.com/prometheus
threshold: "6"
customHeaders: "X-Scope-OrgID=tenant-1"
name: cortex_querier_hpa_default
type: prometheus
- metadata:
metricName: cortex_querier_hpa_default_requests_duration
query: sum(rate(cortex_querier_request_duration_seconds_sum{container="querier",namespace="citestns"}[1m]))
serverAddress: https://mimir.example.com/prometheus
threshold: "6"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,13 +30,14 @@ spec:
kind: Deployment
triggers:
- metadata:
metricName: query_frontend_cpu_hpa_default
query: max_over_time(sum(sum by (pod) (rate(container_cpu_usage_seconds_total{container="query-frontend",namespace="citestns"}[5m])) and max by (pod) (up{container="query-frontend",namespace="citestns"}) > 0)[15m:]) * 1000
query: max_over_time(sum(rate(container_cpu_usage_seconds_total{container="query-frontend",namespace="citestns"}[5m]))[15m:]) * 1000
serverAddress: https://mimir.example.com/prometheus
threshold: "0"
customHeaders: "X-Scope-OrgID=tenant-1"
type: prometheus
- metadata:
metricName: query_frontend_memory_hpa_default
query: max_over_time(sum((sum by (pod) (container_memory_working_set_bytes{container="query-frontend",namespace="citestns"}) and max by (pod) (up{container="query-frontend",namespace="citestns"}) > 0) or vector(0))[15m:]) + sum(sum by (pod) (max_over_time(kube_pod_container_resource_requests{container="query-frontend",namespace="citestns", resource="memory"}[15m])) and max by (pod) (changes(kube_pod_container_status_restarts_total{container="query-frontend",namespace="citestns"}[15m]) > 0) and max by (pod) (kube_pod_container_status_last_terminated_reason{container="query-frontend",namespace="citestns", reason="OOMKilled"}) or vector(0))
query: max_over_time(sum(container_memory_working_set_bytes{container="query-frontend",namespace="citestns"})[15m:])
serverAddress: https://mimir.example.com/prometheus
threshold: "107374182"
Expand Down
Loading
Loading