Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Querier/store gateway is failing to return just 2 months of data #5450

Closed
anoop2503 opened this issue Jun 27, 2022 · 6 comments · Fixed by #5480
Closed

Thanos Querier/store gateway is failing to return just 2 months of data #5450

anoop2503 opened this issue Jun 27, 2022 · 6 comments · Fixed by #5480

Comments

@anoop2503
Copy link

Thanos, version used: v0.23.1

Object Storage Provider: S3

What happened:
Deployed only thanos-querier and thanos-store-gateway to use an S3 bucket to display historical data. However, it is failing to return just 2 months of data.

What you expected to happen: Thanos should return data without failure

How to reproduce it (as minimally and precisely as possible):

Deploy thanos-store-gateway to point to S3 bucket. Deploy thanos querier to use only thanos-store-gateway. Try to pull historical data for 2 months.

Full logs to relevant components:

Getting following error from thanos querier log:
level=warn ts=2022-06-27T15:25:49.79866573Z caller=endpointset.go:525 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from xx.xx.xx.xxx:10901: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address= xx.xx.xx.xxx:10901

Getting following error from Store-Gateway logs:
level=debug ts=2022-06-27T15:26:32.614591931Z caller=bucket.go:1087 msg="stats query processed" stats="&{blocksQueried:0 postingsTouched:0 postingsTouchedSizeSum:0 postingsToFetch:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 cachedPostingsOriginalSizeSum:0 cachedPostingsCompressedSizeSum:0 cachedPostingsCompressionTimeSum:0 cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 cachedPostingsDecompressionTimeSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:0 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:0}" err="rpc error: code = Aborted desc = fetch series for block 01FQEABSWZBJVT8Z0E0F4Q8KY6: expanded matching posting: get postings: read postings range: read range: context canceled"

Anything else we need to know:

Querier deployment configurations:

containers:
      - args:
        - query
        - --http-address=0.0.0.0:10902
        - --query.replica-label=replica
        - --query.auto-downsampling
        - --store=dnssrv+thanos-store-gateway.$(NAMESPACE):10901
        - --log.level=debug
        - --query.timeout=5m
        - --store.response-timeout=5m
      resources:
          limits:
            cpu: "2"
            memory: 12Gi
          requests:
            cpu: "1"
            memory: 12Gi

Store-Gateway deployment configurations:

 containers:
      - args:
        - store
        - --log.level=debug
        - --data-dir=/data
        - --index-cache-size=2GB
        - --chunk-pool-size=20GB
        - --objstore.config-file=/thanos-config/thanos-s3-config.yml
        - --http-address=0.0.0.0:10902
        - --max-time=-24h
        - --min-time=-52w
       resources:
          limits:
            cpu: "12"
            memory: 120Gi
          requests:
            cpu: "12"
            memory: 120Gi

Actual CPU/Memory usage for thanos querier during the failure: CPU - 0.05%, Memory - 3 %
Actual CPU/Memory usage for thanos store gateway during the failure: CPU - 3.36%, Memory - 14.2 %

Environment:

  • OS (e.g. from /etc/os-release): Deployed the applications as docker containers in kubernetes.
@italux
Copy link

italux commented Jul 3, 2022

It looks like I'm facing the same behavior but using Azure Blob instead of S3 and I'm struggling to understand if the error reported by store gateway is really related to the querier DeadlineExceeded. It looks like that querier is simple ignoring --store.response-timeout param

  • Querier error log:
    level=warn ts=2022-07-03T21:35:36.820616161Z caller=endpointset.go:512 component=endpointset msg="update of node failed" err="getting metadata: fallback fetching info from 10.0.227.110:10901 after err: rpc error: code = DeadlineExceeded desc = context deadline exceeded: rpc error: code = DeadlineExceeded desc = context deadline exceeded" address=10.0.227.110:10901

  • Store Gateway debug log:
    level=debug ts=2022-07-03T21:35:40.726735313Z caller=bucket.go:1087 msg="stats query processed" stats="&{blocksQueried:0 postingsTouched:140 postingsTouchedSizeSum:84966 postingsToFetch:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 cachedPostingsOriginalSizeSum:0 cachedPostingsCompressedSizeSum:0 cachedPostingsCompressionTimeSum:0 cachedPostingsDecompressions:140 cachedPostingsDecompressionErrors:0 cachedPostingsDecompressionTimeSum:594401 seriesTouched:130 seriesTouchedSizeSum:4685 seriesFetched:130 seriesFetchedSizeSum:667488 seriesFetchCount:10 seriesFetchDurationSum:115254535346 chunksTouched:260 chunksTouchedSizeSum:180615 chunksFetched:260 chunksFetchedSizeSum:479484 chunksFetchCount:10 chunksFetchDurationSum:124778300778 getAllDuration:0 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:0}" err="rpc error: code = Aborted desc = fetch series for block 01G6H3WJEFB23R2K5XJYPEVP5M: expanded matching posting: get postings: read postings range: get range reader: cannot get blob reader: 01G6H3WJEFB23R2K5XJYPEVP5M/index: cannot get properties for Azure blob, address: 01G6H3WJEFB23R2K5XJYPEVP5M/index: Head \"https://thanosxxxxxxxxx.blob.core.windows.net/xxxxxxxxx/01G6H3WJEFB23R2K5XJYPEVP5M/index?timeout=61\": context canceled"

@italux
Copy link

italux commented Jul 4, 2022

Looking into the code, the err="getting metadata: fallback fetching info from uses the context with gRPCInfoCallTimeout, which seems to have a fixed value (aka: 5s).

I don't know much about the code, but it seems to me that we don't have a specific configuration parameter to change this timeout, like unhealthyEndpointTimeout which I think is changed by --store.unhealthy-timeout. I'll keep digging...

@marevers
Copy link

marevers commented Jul 7, 2022

Running into the same issue with Azure Blob storage, already at 7 days of data.

Here are my relevant settings:

--query.timeout=5m
--store.unhealthy-timeout=5m
--store.response-timeout=5m

Also changed --query.default-step to 5m, but that did not do enough. Thanos 0.27 also has the 5 second fixed timeout

Looking at the Thanos Store debug log:
level=debug ts=2022-07-07T07:55:16.408977238Z caller=bucket.go:1115 msg="stats query processed" request="min_time:1656575100000 max_time:1657180200000 matchers:<name:\"__name__\" value:\"sql_mssql_memory_usage\" > max_resolution_window:120000 aggregates:COUNT aggregates:SUM " stats="&{blocksQueried:0 postingsTouched:0 PostingsTouchedSizeSum:0B postingsToFetch:0 postingsFetched:0 PostingsFetchedSizeSum:0B postingsFetchCount:0 PostingsFetchDurationSum:0s cachedPostingsCompressions:0 cachedPostingsCompressionErrors:0 CachedPostingsOriginalSizeSum:0B CachedPostingsCompressedSizeSum:0B CachedPostingsCompressionTimeSum:0s cachedPostingsDecompressions:0 cachedPostingsDecompressionErrors:0 CachedPostingsDecompressionTimeSum:0s seriesTouched:0 SeriesTouchedSizeSum:0B seriesFetched:0 SeriesFetchedSizeSum:0B seriesFetchCount:0 SeriesFetchDurationSum:0s chunksTouched:0 ChunksTouchedSizeSum:0B chunksFetched:0 ChunksFetchedSizeSum:0B chunksFetchCount:0 ChunksFetchDurationSum:0s GetAllDuration:0s mergedSeriesCount:0 mergedChunksCount:0 MergeDuration:0s}" err="rpc error: code = Aborted desc = fetch series for block 01G75Q48ED36JVADQMRZYMVWN7: expanded matching posting: get postings: read postings range: get range reader: cannot get blob reader: 01G75Q48ED36JVADQMRZYMVWN7/index: cannot get properties for Azure blob, address: 01G75Q48ED36JVADQMRZYMVWN7/index: Head \"https://snappg0dev.privatelink.blob.core.windows.net/snappg0dev/01G75Q48ED36JVADQMRZYMVWN7/index?timeout=61\": context canceled"

It does look like Thanos Query is actively cancelling the connection after a certain amount of time and the timeout is not on the Store side or on the Azure side, judging by the rpc error: code = Aborted.

@marevers
Copy link

marevers commented Jul 7, 2022

I created a test image from Thanos v0.27.0 where I changed gRPCInfoCallTimeout to 30 seconds and I'm no longer running into the issue. @italux and @anoop2503 maybe you can test with my image gkmevers/thanos:grpc and check if it also works for you? If so, I guess it would be possible to make this timeout configurable via a new flag.

EDIT: A bit more helpful, if you use gkmevers/thanos:grpc-flag, you can use the flag --store.grpcinfo-timeout I added to test the values. Default is 5 seconds as it is normally.

@yeya24
Copy link
Contributor

yeya24 commented Jul 8, 2022

Actually I think we can definitely make this timeout a configurable flag. I have seen this in our Thanos setup, too. As we are a central-edge setup, this error is common. I will open a pr to update the timeout.

@snowmansora
Copy link

snowmansora commented Aug 4, 2022

Our environment is also encountering this problem when running a query that gets 7 days of data. Logs are similar to what @italux described.

Our environment setup:
Thanos v0.27.0.
Query and Store are running on the same AKS node on West US 2.
Store is using Azure Blob Storage on West US 2.

Interestingly, with the same setup, except where Query and Store are running on the same AKS node on East US 2, the chance of encountering this problem is much less for some reason...

It will be nice if this issue can be addressed in the next Thanos release.

[Update on Aug 4]
I increased the timeout to 30 seconds as suggested by @gk-mevers, and I no longer see the problem on West US 2.
I wonder what does the timeout do? The reason I asked is supposedly it will take a longer time for the East US 2 cluster to get the data from Azure Blob Storage on West US 2, however the East US 2 cluster seems to work fine without the increase to timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants