Thanos Querier/store gateway is failing to return just 2 months of data #5450

anoop2503 · 2022-06-27T15:54:27Z

Thanos, version used: v0.23.1

Object Storage Provider: S3

What happened:
Deployed only thanos-querier and thanos-store-gateway to use an S3 bucket to display historical data. However, it is failing to return just 2 months of data.

What you expected to happen: Thanos should return data without failure

How to reproduce it (as minimally and precisely as possible):

Deploy thanos-store-gateway to point to S3 bucket. Deploy thanos querier to use only thanos-store-gateway. Try to pull historical data for 2 months.

Full logs to relevant components:

Getting following error from thanos querier log:
level=warn ts=2022-06-27T15:25:49.79866573Z caller=endpointset.go:525 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from xx.xx.xx.xxx:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address= xx.xx.xx.xxx:10901

Getting following error from Store-Gateway logs:
level=debug ts=2022-06-27T15:26:32.614591931Z caller=bucket.go:1087 msg="stats query processed" stats="&{blocksQueried:0 postingsTouched:0 postingsTouchedSizeSum:0 postingsToFetch:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 cachedPostingsOriginalSizeSum:0 cachedPostingsCompressedSizeSum:0 cachedPostingsCompressionTimeSum:0 cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 cachedPostingsDecompressionTimeSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:0 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:0}" err="rpc error: code = Aborted desc = fetch series for block 01FQEABSWZBJVT8Z0E0F4Q8KY6: expanded matching posting: get postings: read postings range: read range: context canceled"

Anything else we need to know:

Querier deployment configurations:

containers:
      - args:
        - query
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --query.auto-downsampling
        - --store=dnssrv+thanos-store-gateway.$(NAMESPACE):10901
        - --log.level=debug
        - --query.timeout=5m
        - --store.response-timeout=5m
      resources:
          limits:
            cpu: "2"
            memory: 12Gi
          requests:
            cpu: "1"
            memory: 12Gi

Store-Gateway deployment configurations:

 containers:
      - args:
        - store
        - --log.level=debug
        - --data-dir=/data
        - --index-cache-size=2GB
        - --chunk-pool-size=20GB
        - --objstore.config-file=/thanos-config/thanos-s3-config.yml
        - --http-address=0.0.0.0:10902
        - --max-time=-24h
        - --min-time=-52w
       resources:
          limits:
            cpu: "12"
            memory: 120Gi
          requests:
            cpu: "12"
            memory: 120Gi

Actual CPU/Memory usage for thanos querier during the failure: CPU - 0.05%, Memory - 3 %
Actual CPU/Memory usage for thanos store gateway during the failure: CPU - 3.36%, Memory - 14.2 %

Environment:

OS (e.g. from /etc/os-release): Deployed the applications as docker containers in kubernetes.

The text was updated successfully, but these errors were encountered:

italux · 2022-07-03T23:16:24Z

It looks like I'm facing the same behavior but using Azure Blob instead of S3 and I'm struggling to understand if the error reported by store gateway is really related to the querier DeadlineExceeded. It looks like that querier is simple ignoring --store.response-timeout param

Querier error log:
level=warn ts=2022-07-03T21:35:36.820616161Z caller=endpointset.go:512 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from 10.0.227.110:10901 after err: rpc error: code = DeadlineExceeded desc = context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.0.227.110:10901
Store Gateway debug log:
level=debug ts=2022-07-03T21:35:40.726735313Z caller=bucket.go:1087 msg="stats query processed" stats="&{blocksQueried:0 postingsTouched:140 postingsTouchedSizeSum:84966 postingsToFetch:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 cachedPostingsOriginalSizeSum:0 cachedPostingsCompressedSizeSum:0 cachedPostingsCompressionTimeSum:0 cachedPostingsDecompressions:140 cachedPostingsDecompressionErrors:0 cachedPostingsDecompressionTimeSum:594401 seriesTouched:130 seriesTouchedSizeSum:4685 seriesFetched:130 seriesFetchedSizeSum:667488 seriesFetchCount:10 seriesFetchDurationSum:115254535346 chunksTouched:260 chunksTouchedSizeSum:180615 chunksFetched:260 chunksFetchedSizeSum:479484 chunksFetchCount:10 chunksFetchDurationSum:124778300778 getAllDuration:0 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:0}" err="rpc error: code = Aborted desc = fetch series for block 01G6H3WJEFB23R2K5XJYPEVP5M: expanded matching posting: get postings: read postings range: get range reader: cannot get blob reader: 01G6H3WJEFB23R2K5XJYPEVP5M/index: cannot get properties for Azure blob, address: 01G6H3WJEFB23R2K5XJYPEVP5M/index: Head \"https://thanosxxxxxxxxx.blob.core.windows.net/xxxxxxxxx/01G6H3WJEFB23R2K5XJYPEVP5M/index?timeout=61\": context canceled"

italux · 2022-07-04T13:54:37Z

Looking into the code, the err="getting metadata: fallback fetching info from uses the context with gRPCInfoCallTimeout, which seems to have a fixed value (aka: 5s).

I don't know much about the code, but it seems to me that we don't have a specific configuration parameter to change this timeout, like unhealthyEndpointTimeout which I think is changed by --store.unhealthy-timeout. I'll keep digging...

marevers · 2022-07-07T07:37:34Z

Running into the same issue with Azure Blob storage, already at 7 days of data.

Here are my relevant settings:

--query.timeout=5m
--store.unhealthy-timeout=5m
--store.response-timeout=5m

Also changed --query.default-step to 5m, but that did not do enough. Thanos 0.27 also has the 5 second fixed timeout

Looking at the Thanos Store debug log:
level=debug ts=2022-07-07T07:55:16.408977238Z caller=bucket.go:1115 msg="stats query processed" request="min_time:1656575100000 max_time:1657180200000 matchers:<name:\"__name__\" value:\"sql_mssql_memory_usage\" > max_resolution_window:120000 aggregates:COUNT aggregates:SUM " stats="&{blocksQueried:0 postingsTouched:0 PostingsTouchedSizeSum:0B postingsToFetch:0 postingsFetched:0 PostingsFetchedSizeSum:0B postingsFetchCount:0 PostingsFetchDurationSum:0s cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:0B CachedPostingsCompressedSizeSum:0B CachedPostingsCompressionTimeSum:0s cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:0s seriesTouched:0 SeriesTouchedSizeSum:0B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:0 ChunksTouchedSizeSum:0B chunksFetched:0 ChunksFetchedSizeSum:0B chunksFetchCount:0 ChunksFetchDurationSum:0s GetAllDuration:0s mergedSeriesCount:0 mergedChunksCount:0 MergeDuration:0s}" err="rpc error: code = Aborted desc = fetch series for block 01G75Q48ED36JVADQMRZYMVWN7: expanded matching posting: get postings: read postings range: get range reader: cannot get blob reader: 01G75Q48ED36JVADQMRZYMVWN7/index: cannot get properties for Azure blob, address: 01G75Q48ED36JVADQMRZYMVWN7/index: Head \"https://snappg0dev.privatelink.blob.core.windows.net/snappg0dev/01G75Q48ED36JVADQMRZYMVWN7/index?timeout=61\": context canceled"

It does look like Thanos Query is actively cancelling the connection after a certain amount of time and the timeout is not on the Store side or on the Azure side, judging by the rpc error: code = Aborted.

marevers · 2022-07-07T11:09:12Z

I created a test image from Thanos v0.27.0 where I changed gRPCInfoCallTimeout to 30 seconds and I'm no longer running into the issue. @italux and @anoop2503 maybe you can test with my image gkmevers/thanos:grpc and check if it also works for you? If so, I guess it would be possible to make this timeout configurable via a new flag.

EDIT: A bit more helpful, if you use gkmevers/thanos:grpc-flag, you can use the flag --store.grpcinfo-timeout I added to test the values. Default is 5 seconds as it is normally.

yeya24 · 2022-07-08T03:42:44Z

Actually I think we can definitely make this timeout a configurable flag. I have seen this in our Thanos setup, too. As we are a central-edge setup, this error is common. I will open a pr to update the timeout.

snowmansora · 2022-08-04T00:58:10Z

Our environment is also encountering this problem when running a query that gets 7 days of data. Logs are similar to what @italux described.

Our environment setup:
Thanos v0.27.0.
Query and Store are running on the same AKS node on West US 2.
Store is using Azure Blob Storage on West US 2.

Interestingly, with the same setup, except where Query and Store are running on the same AKS node on East US 2, the chance of encountering this problem is much less for some reason...

It will be nice if this issue can be addressed in the next Thanos release.

[Update on Aug 4]
I increased the timeout to 30 seconds as suggested by @gk-mevers, and I no longer see the problem on West US 2.
I wonder what does the timeout do? The reason I asked is supposedly it will take a longer time for the East US 2 cluster to get the data from Azure Blob Storage on West US 2, however the East US 2 cluster seems to work fine without the increase to timeout.

yeya24 mentioned this issue Jul 8, 2022

Expose Endpoint Info timeout parameter #5480

Merged

2 tasks

GiedriusS closed this as completed in #5480 Aug 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thanos Querier/store gateway is failing to return just 2 months of data #5450

Thanos Querier/store gateway is failing to return just 2 months of data #5450

anoop2503 commented Jun 27, 2022

italux commented Jul 3, 2022 •

edited

Loading

italux commented Jul 4, 2022

marevers commented Jul 7, 2022 •

edited

Loading

marevers commented Jul 7, 2022 •

edited

Loading

yeya24 commented Jul 8, 2022

snowmansora commented Aug 4, 2022 •

edited

Loading

Thanos Querier/store gateway is failing to return just 2 months of data #5450

Thanos Querier/store gateway is failing to return just 2 months of data #5450

Comments

anoop2503 commented Jun 27, 2022

italux commented Jul 3, 2022 • edited Loading

italux commented Jul 4, 2022

marevers commented Jul 7, 2022 • edited Loading

marevers commented Jul 7, 2022 • edited Loading

yeya24 commented Jul 8, 2022

snowmansora commented Aug 4, 2022 • edited Loading

italux commented Jul 3, 2022 •

edited

Loading

marevers commented Jul 7, 2022 •

edited

Loading

marevers commented Jul 7, 2022 •

edited

Loading

snowmansora commented Aug 4, 2022 •

edited

Loading