Merge remote-tracking branch 'origin/main' into yuri/native-hist-remo…

…te-ruler-reads-dashboard
grafana · Jul 23, 2024 · 632cda5 · 632cda5
2 parents 588ceff + 4410382
commit 632cda5
Show file tree

Hide file tree

Showing 90 changed files with 761 additions and 1,344 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,10 @@
 * [CHANGE] Query-frontend: Remove deprecated `frontend.align_queries_with_step` YAML configuration. The configuration option has been moved to per-tenant and default `limits` since Mimir 2.12. #8733 #8735
 * [CHANGE] Store-gateway: Change default of `-blocks-storage.bucket-store.max-concurrent` to 200. #8768
 * [CHANGE] Added new metric `cortex_compactor_disk_out_of_space_errors_total` which counts how many times a compaction failed due to the compactor being out of disk, alert if there is a single increase. #8237 #8278
+* [CHANGE] Store-gateway: Remove experimental parameter `-blocks-storage.bucket-store.series-selection-strategy`. The default strategy is now `worst-case`. #8702
+* [CHANGE] Store-gateway: Rename `-blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference` to `-blocks-storage.bucket-store.series-fetch-preference` and promote to stable. #8702
+* [CHANGE] Querier, store-gateway: remove deprecated `-querier.prefer-streaming-chunks-from-store-gateways=true`. Streaming from store-gateways is now always enabled. #8696
+* [CHANGE] Ingester: remove deprecated `-ingester.return-only-grpc-errors`. #8699
 * [FEATURE] Querier: add experimental streaming PromQL engine, enabled with `-querier.query-engine=mimir`. #8422 #8430 #8454 #8455 #8360 #8490 #8508 #8577 #8671
 * [FEATURE] Experimental Kafka-based ingest storage. #6888 #6894 #6929 #6940 #6951 #6974 #6982 #7029 #7030 #7091 #7142 #7147 #7148 #7153 #7160 #7193 #7349 #7376 #7388 #7391 #7393 #7394 #7402 #7404 #7423 #7424 #7437 #7486 #7503 #7508 #7540 #7621 #7682 #7685 #7694 #7695 #7696 #7697 #7701 #7733 #7734 #7741 #7752 #7838 #7851 #7871 #7877 #7880 #7882 #7887 #7891 #7925 #7955 #7967 #8031 #8063 #8077 #8088 #8135 #8176 #8184 #8194 #8216 #8217 #8222 #8233 #8503 #8542 #8579 #8657 #8686 #8688 #8703 #8706 #8708 #8738 #8750
   * What it is:
@@ -67,14 +71,19 @@
   * Overview dashboard: status, read/write latency and queries/ingestion per sec panels, `cortex_request_duration_seconds` metric. #7674 #8502 #8791
   * Writes dashboard: `cortex_request_duration_seconds` metric. #8757 #8791
   * Reads dashboard: `cortex_request_duration_seconds` metric. #8752
-  * Rollout progress dashboard. #8779
+  * Rollout progress dashboard: `cortex_request_duration_seconds` metric. #8779
+  * Alertmanager dashboard: `cortex_request_duration_seconds` metric. #8792
+  * Ruler dashboard: `cortex_request_duration_seconds` metric. #8795
+  * Queries dashboard: `cortex_request_duration_seconds` metric. #8800
   * Remote ruler reads dashboard: `cortex_request_duration_seconds` metric.
 * [ENHANCEMENT] Alerts: `MimirRunningIngesterReceiveDelayTooHigh` alert has been tuned to be more reactive to high receive delay. #8538
 * [ENHANCEMENT] Dashboards: improve end-to-end latency and strong read consistency panels when experimental ingest storage is enabled. #8543
 * [ENHANCEMENT] Dashboards: Add panels for monitoring ingester autoscaling when not using ingest-storage. These panels are disabled by default, but can be enabled using the `autoscaling.ingester.enabled: true` config option. #8484
 * [ENHANCEMENT] Dashboards: add panels to show writes to experimental ingest storage backend in the "Mimir / Ruler" dashboard, when `_config.show_ingest_storage_panels` is enabled. #8732
 * [ENHANCEMENT] Dashboards: show all series in tooltips on time series dashboard panels. #8748
 * [ENHANCEMENT] Dashboards: add compactor autoscaling panels to "Mimir / Compactor" dashboard. The panels are disabled by default, but can be enabled setting `_config.autoscaling.compactor.enabled` to `true`. #8777
+* [ENHANCEMENT] Alerts: added `MimirKafkaClientBufferedProduceBytesTooHigh` alert. #8763
+* [ENHANCEMENT] Dashboards: added "Kafka produced records / sec" panel to "Mimir / Writes" dashboard. #8763
 * [BUGFIX] Dashboards: fix "current replicas" in autoscaling panels when HPA is not active. #8566
 * [BUGFIX] Alerts: do not fire `MimirRingMembersMismatch` during the migration to experimental ingest storage. #8727
 
@@ -94,6 +103,12 @@
 ### Mimirtool
 
 * [CHANGE] Analyze Rules: Count recording rules used in rules group as used. #6133
+* [CHANGE] Remove deprecated `--rule-files` flag in favor of CLI arguments for the following commands: #8701
+  * `mimirtool rules load`
+  * `mimirtool rules sync`
+  * `mimirtool rules diff`
+  * `mimirtool rules check`
+  * `mimirtool rules prepare`
 
 ### Mimir Continuous Test
 

diff --git a/cmd/mimir/config-descriptor.json b/cmd/mimir/config-descriptor.json
@@ -1865,17 +1865,6 @@
           "fieldType": "boolean",
           "fieldCategory": "advanced"
         },
-        {
-          "kind": "field",
-          "name": "prefer_streaming_chunks_from_store_gateways",
-          "required": false,
-          "desc": "Request store-gateways stream chunks. Store-gateways will only respond with a stream of chunks if the target store-gateway supports this, and this preference will be ignored by store-gateways that do not support this.",
-          "fieldValue": null,
-          "fieldDefaultValue": true,
-          "fieldFlag": "querier.prefer-streaming-chunks-from-store-gateways",
-          "fieldType": "boolean",
-          "fieldCategory": "experimental"
-        },
         {
           "kind": "field",
           "name": "streaming_chunks_per_ingester_series_buffer_size",
@@ -3469,17 +3458,6 @@
           "fieldType": "int",
           "fieldCategory": "advanced"
         },
-        {
-          "kind": "field",
-          "name": "return_only_grpc_errors",
-          "required": false,
-          "desc": "When enabled only gRPC errors will be returned by the ingester.",
-          "fieldValue": null,
-          "fieldDefaultValue": true,
-          "fieldFlag": "ingester.return-only-grpc-errors",
-          "fieldType": "boolean",
-          "fieldCategory": "deprecated"
-        },
         {
           "kind": "field",
           "name": "use_ingester_owned_series_for_limits",
@@ -9492,35 +9470,14 @@
             },
             {
               "kind": "field",
-              "name": "series_selection_strategy",
-              "required": false,
-              "desc": "This option controls the strategy to selection of series and deferring application of matchers. A more aggressive strategy will fetch less posting lists at the cost of more series. This is useful when querying large blocks in which many series share the same label name and value. Supported values (most aggressive to least aggressive): speculative, worst-case, worst-case-small-posting-lists, all.",
-              "fieldValue": null,
-              "fieldDefaultValue": "worst-case",
-              "fieldFlag": "blocks-storage.bucket-store.series-selection-strategy",
-              "fieldType": "string",
-              "fieldCategory": "experimental"
-            },
-            {
-              "kind": "block",
-              "name": "series_selection_strategies",
+              "name": "series_fetch_preference",
               "required": false,
-              "desc": "",
-              "blockEntries": [
-                {
-                  "kind": "field",
-                  "name": "worst_case_series_preference",
-                  "required": false,
-                  "desc": "This option is only used when blocks-storage.bucket-store.series-selection-strategy=worst-case. Increasing the series preference results in fetching more series than postings. Must be a positive floating point number.",
-                  "fieldValue": null,
-                  "fieldDefaultValue": 0.75,
-                  "fieldFlag": "blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference",
-                  "fieldType": "float",
-                  "fieldCategory": "experimental"
-                }
-              ],
+              "desc": "This parameter controls the trade-off in fetching series versus fetching postings to fulfill a series request. Increasing the series preference results in fetching more series and reducing the volume of postings fetched. Reducing the series preference results in the opposite. Increase this parameter to reduce the rate of fetched series bytes (see \"Mimir / Queries\" dashboard) or API calls to the object store. Must be a positive floating point number.",
               "fieldValue": null,
-              "fieldDefaultValue": null
+              "fieldDefaultValue": 0.75,
+              "fieldFlag": "blocks-storage.bucket-store.series-fetch-preference",
+              "fieldType": "float",
+              "fieldCategory": "advanced"
             }
           ],
           "fieldValue": null,

diff --git a/cmd/mimir/help-all.txt.tmpl b/cmd/mimir/help-all.txt.tmpl
@@ -671,12 +671,10 @@ Usage of ./cmd/mimir/mimir:
     	Max size - in bytes - of a gap for which the partitioner aggregates together two bucket GET object requests. (default 524288)
   -blocks-storage.bucket-store.posting-offsets-in-mem-sampling int
     	Controls what is the ratio of postings offsets that the store will hold in memory. (default 32)
+  -blocks-storage.bucket-store.series-fetch-preference float
+    	This parameter controls the trade-off in fetching series versus fetching postings to fulfill a series request. Increasing the series preference results in fetching more series and reducing the volume of postings fetched. Reducing the series preference results in the opposite. Increase this parameter to reduce the rate of fetched series bytes (see "Mimir / Queries" dashboard) or API calls to the object store. Must be a positive floating point number. (default 0.75)
   -blocks-storage.bucket-store.series-hash-cache-max-size-bytes uint
     	Max size - in bytes - of the in-memory series hash cache. The cache is shared across all tenants and it's used only when query sharding is enabled. (default 1073741824)
-  -blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference float
-    	[experimental] This option is only used when blocks-storage.bucket-store.series-selection-strategy=worst-case. Increasing the series preference results in fetching more series than postings. Must be a positive floating point number. (default 0.75)
-  -blocks-storage.bucket-store.series-selection-strategy string
-    	[experimental] This option controls the strategy to selection of series and deferring application of matchers. A more aggressive strategy will fetch less posting lists at the cost of more series. This is useful when querying large blocks in which many series share the same label name and value. Supported values (most aggressive to least aggressive): speculative, worst-case, worst-case-small-posting-lists, all. (default "worst-case")
   -blocks-storage.bucket-store.sync-dir string
     	Directory to store synchronized TSDB index headers. This directory is not required to be persisted between restarts, but it's highly recommended in order to improve the store-gateway startup time. (default "./tsdb-sync/")
   -blocks-storage.bucket-store.sync-interval duration
@@ -1553,8 +1551,6 @@ Usage of ./cmd/mimir/mimir:
     	[experimental] CPU utilization limit, as CPU cores, for CPU/memory utilization based read request limiting. Use 0 to disable it.
   -ingester.read-path-memory-utilization-limit uint
     	[experimental] Memory limit, in bytes, for CPU/memory utilization based read request limiting. Use 0 to disable it.
-  -ingester.return-only-grpc-errors
-    	[deprecated] When enabled only gRPC errors will be returned by the ingester. (default true)
   -ingester.ring.consul.acl-token string
     	ACL Token used to interact with Consul.
   -ingester.ring.consul.cas-retry-delay duration
@@ -1913,8 +1909,6 @@ Usage of ./cmd/mimir/mimir:
     	If true, when querying ingesters, only the minimum required ingesters required to reach quorum will be queried initially, with other ingesters queried only if needed due to failures from the initial set of ingesters. Enabling this option reduces resource consumption for the happy path at the cost of increased latency for the unhappy path. (default true)
   -querier.minimize-ingester-requests-hedging-delay duration
     	Delay before initiating requests to further ingesters when request minimization is enabled and the initially selected set of ingesters have not all responded. Ignored if -querier.minimize-ingester-requests is not enabled. (default 3s)
-  -querier.prefer-streaming-chunks-from-store-gateways
-    	[experimental] Request store-gateways stream chunks. Store-gateways will only respond with a stream of chunks if the target store-gateway supports this, and this preference will be ignored by store-gateways that do not support this. (default true)
   -querier.promql-experimental-functions-enabled
     	[experimental] Enable experimental PromQL functions. This config option should be set on query-frontend too when query sharding is enabled.
   -querier.query-engine string

diff --git a/docs/sources/mimir/configure/about-versioning.md b/docs/sources/mimir/configure/about-versioning.md
@@ -146,7 +146,6 @@ The following features are currently experimental:
     - `-ingester.client.circuit-breaker.cooldown-period`
 - Querier
   - Use of Redis cache backend (`-blocks-storage.bucket-store.metadata-cache.backend=redis`)
-  - Streaming chunks from store-gateway to querier (`-querier.prefer-streaming-chunks-from-store-gateways`)
   - Limiting queries based on the estimated number of chunks that will be used (`-querier.max-estimated-fetched-chunks-per-query-multiplier`)
   - Max concurrency for tenant federated queries (`-tenant-federation.max-concurrent`)
   - Maximum response size for active series queries (`-querier.active-series-results-max-size-bytes`)
@@ -167,7 +166,6 @@ The following features are currently experimental:
   - `-query-scheduler.querier-forget-delay`
 - Store-gateway
   - Use of Redis cache backend (`-blocks-storage.bucket-store.chunks-cache.backend=redis`, `-blocks-storage.bucket-store.index-cache.backend=redis`, `-blocks-storage.bucket-store.metadata-cache.backend=redis`)
-  - `-blocks-storage.bucket-store.series-selection-strategy`
   - Eagerly loading some blocks on startup even when lazy loading is enabled `-blocks-storage.bucket-store.index-header.eager-loading-startup-enabled`
 - Read-write deployment mode
 - API endpoints:
@@ -212,14 +210,8 @@ For details about what _deprecated_ means, see [Parameter lifecycle]({{< relref
 
 The following features or configuration parameters are currently deprecated and will be **removed in Mimir 2.14**:
 
-- Ingester
-  - `-ingester.return-only-grpc-errors`
 - Ingester client
   - `-ingester.client.report-grpc-codes-in-instrumentation-label-enabled`
-- Mimirtool
-  - the flag `--rule-files`
-- Querier
-  - the flag `-querier.prefer-streaming-chunks-from-store-gateways`
 
 The following features or configuration parameters are currently deprecated and will be **removed in a future release (to be announced)**:
 

diff --git a/docs/sources/mimir/configure/configuration-parameters/index.md b/docs/sources/mimir/configure/configuration-parameters/index.md
@@ -1265,10 +1265,6 @@ instance_limits:
 # CLI flag: -ingester.error-sample-rate
 [error_sample_rate: <int> | default = 10]
 
-# (deprecated) When enabled only gRPC errors will be returned by the ingester.
-# CLI flag: -ingester.return-only-grpc-errors
-[return_only_grpc_errors: <boolean> | default = true]
-
 # (experimental) When enabled, only series currently owned by ingester according
 # to the ring are used when checking user per-tenant series limit.
 # CLI flag: -ingester.use-ingester-owned-series-for-limits
@@ -1452,12 +1448,6 @@ store_gateway_client:
 # CLI flag: -querier.shuffle-sharding-ingesters-enabled
 [shuffle_sharding_ingesters_enabled: <boolean> | default = true]
 
-# (experimental) Request store-gateways stream chunks. Store-gateways will only
-# respond with a stream of chunks if the target store-gateway supports this, and
-# this preference will be ignored by store-gateways that do not support this.
-# CLI flag: -querier.prefer-streaming-chunks-from-store-gateways
-[prefer_streaming_chunks_from_store_gateways: <boolean> | default = true]
-
 # (advanced) Number of series to buffer per ingester when streaming chunks from
 # ingesters.
 # CLI flag: -querier.streaming-chunks-per-ingester-buffer-size
@@ -4151,22 +4141,15 @@ bucket_store:
   # CLI flag: -blocks-storage.bucket-store.batch-series-size
   [streaming_series_batch_size: <int> | default = 5000]
 
-  # (experimental) This option controls the strategy to selection of series and
-  # deferring application of matchers. A more aggressive strategy will fetch
-  # less posting lists at the cost of more series. This is useful when querying
-  # large blocks in which many series share the same label name and value.
-  # Supported values (most aggressive to least aggressive): speculative,
-  # worst-case, worst-case-small-posting-lists, all.
-  # CLI flag: -blocks-storage.bucket-store.series-selection-strategy
-  [series_selection_strategy: <string> | default = "worst-case"]
-
-  series_selection_strategies:
-    # (experimental) This option is only used when
-    # blocks-storage.bucket-store.series-selection-strategy=worst-case.
-    # Increasing the series preference results in fetching more series than
-    # postings. Must be a positive floating point number.
-    # CLI flag: -blocks-storage.bucket-store.series-selection-strategies.worst-case-series-preference
-    [worst_case_series_preference: <float> | default = 0.75]
+  # (advanced) This parameter controls the trade-off in fetching series versus
+  # fetching postings to fulfill a series request. Increasing the series
+  # preference results in fetching more series and reducing the volume of
+  # postings fetched. Reducing the series preference results in the opposite.
+  # Increase this parameter to reduce the rate of fetched series bytes (see
+  # "Mimir / Queries" dashboard) or API calls to the object store. Must be a
+  # positive floating point number.
+  # CLI flag: -blocks-storage.bucket-store.series-fetch-preference
+  [series_fetch_preference: <float> | default = 0.75]
 
 tsdb:
   # Directory to store TSDBs (including WAL) in the ingesters. This directory is

diff --git a/docs/sources/mimir/manage/mimir-runbooks/_index.md b/docs/sources/mimir/manage/mimir-runbooks/_index.md
@@ -1485,6 +1485,24 @@ How to **investigate**:
 - Check if ingesters are processing too many records, and they need to be scaled up (vertically or horizontally).
 - Check actual error in logs to see whether the `-ingest-storage.kafka.wait-strong-read-consistency-timeout` or the request timeout has been hit first.
 
+### MimirKafkaClientBufferedProduceBytesTooHigh
+
+This alert fires when the Kafka client buffer, used to write incoming write requests to Kafka, is getting full.
+
+How it **works**:
+
+- Distributor and ruler encapsulate write requests into Kafka records and send them to Kafka.
+- The Kafka client has a limit on the total byte size of buffered records either sent to Kafka or sent to Kafka but not acknowledged yet.
+- When the limit is reached, the Kafka client stops producing more records and fast fails.
+- The limit is configured via `-ingest-storage.kafka.producer-max-buffered-bytes`.
+- The default limit is configured intentionally high, so that when the buffer utilization gets close to the limit, this indicates that there's probably an issue.
+
+How to **investigate**:
+
+- Query `cortex_ingest_storage_writer_buffered_produce_bytes{quantile="1.0"}` metrics to see the actual buffer utilization peaks.
+  - If the high buffer utilization is isolated to a small set of pods, then there might be an issue in the client pods.
+  - If the high buffer utilization is spread across all or most pods, then there might be an issue in Kafka.
+
 ### Ingester is overloaded when consuming from Kafka
 
 This runbook covers the case an ingester is overloaded when ingesting metrics data (consuming) from Kafka.