Cortex store gateway keeps going into crash/failure during startup #4993

ajcts · 2022-11-24T08:58:31Z

Issue -

On cortex hosted on AKS distributed env- during startup/new deployment, store gateway keeps going into crashloopback and never really comes up. store gateway has the PVC mount and its associated blob storage and is currently running with replication factor of 3. Tuned the settings of readiness/liveness probe timeouts to give the ring more time to turn out healthy but its not really helping.

All 3 instances are going into crashloop eventually.
When deployed fresh with blob and PVC deleted, store gateway comes up normally without any issues. But in a shared cluster env, this is not really a permanent option.

K8s events doesnt really help on narrowing down to what makes the SG to fail nor does the SG logs.

Infra - K8S istio environment
Arch - Microservices

alanprot · 2022-11-24T09:07:42Z

Is it going oom or terminating for other reason? If is not oom can u try to fetch the log from the dead container (-p option on the kubectl logs)?

ajcts · 2022-11-25T05:39:11Z

Yes checked the dead container logs and it was not due to OOM, as we recently increased a fair bit of memory. Errors were more on the side of memberlist failures (relatively minimal though) - and not much info on cause for termination

caller=tcp_transport.go:428 component="memberlist TCPTransport" msg="WriteTo failed"

alanprot · 2022-11-25T06:16:12Z

Can you try to set the lazy load config to true?

  # If enabled, store-gateway will lazily memory-map an index-header only once
  # required by a query.
  # CLI flag: -blocks-storage.bucket-store.index-header-lazy-loading-enabled
  [index_header_lazy_loading_enabled: <boolean> | default = false]

And also bucket index?

  bucket_index:
    # True to enable querier and store-gateway to discover blocks in the storage
    # via bucket index instead of bucket scanning.
    # CLI flag: -blocks-storage.bucket-store.bucket-index.enabled
    [enabled: <boolean> | default = false]

ajcts · 2022-11-25T06:28:50Z

Yes, we have these enabled too. Also there is no definite pattern to these failures/crashes as it occurs intermittently but more often than not.

bucket_index:
enabled: true
idle_timeout: 30m
max_stale_period: 1h
index_header_lazy_loading_enabled: true
index_header_lazy_loading_idle_timeout: 20m

yeya24 · 2022-11-28T02:40:05Z

If there are no enough log from pods, could you please increase log level to debug and try again?

stale · 2023-06-18T09:38:51Z

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

yeya24 · 2023-11-10T05:21:45Z

After taking another look at this issue, I believe it is related to thanos-io/thanos#6509.

This bug caused SG initial sync takes too much memory, which is totally uncessary. The fix was included in the latest release RC so I will close this issue. Feel free to try it out and let us know if it works or not. https://github.com/cortexproject/cortex/releases/tag/v1.16.0-rc.0

friedrichg added the type/troubleshoot label Mar 3, 2023

stale bot added the stale label Jun 18, 2023

yeya24 closed this as completed Nov 10, 2023

yeya24 added the component/store-gateway label Nov 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cortex store gateway keeps going into crash/failure during startup #4993

Cortex store gateway keeps going into crash/failure during startup #4993

ajcts commented Nov 24, 2022 •

edited

Loading

alanprot commented Nov 24, 2022

ajcts commented Nov 25, 2022

alanprot commented Nov 25, 2022

ajcts commented Nov 25, 2022

yeya24 commented Nov 28, 2022

stale bot commented Jun 18, 2023

yeya24 commented Nov 10, 2023

Cortex store gateway keeps going into crash/failure during startup #4993

Cortex store gateway keeps going into crash/failure during startup #4993

Comments

ajcts commented Nov 24, 2022 • edited Loading

alanprot commented Nov 24, 2022

ajcts commented Nov 25, 2022

alanprot commented Nov 25, 2022

ajcts commented Nov 25, 2022

yeya24 commented Nov 28, 2022

stale bot commented Jun 18, 2023

yeya24 commented Nov 10, 2023

ajcts commented Nov 24, 2022 •

edited

Loading