Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos store OOM #1750

Closed
salapat11 opened this issue Nov 16, 2019 · 11 comments · Fixed by #1952
Closed

Thanos store OOM #1750

salapat11 opened this issue Nov 16, 2019 · 11 comments · Fixed by #1952

Comments

@salapat11
Copy link

Thanos, Prometheus and Golang version used:
Thanos: 0.6.0
Prometheus: 2.10.0

Object Storage Provider: S3

What happened:
Thanos Store won't start. It runs for 2 mins and crashes with OOM. Increased the memory to 64 GB and will still fail. Compactor is running and generating index.cache.json files.

Bucket size: 202.4 GiB. Total Objects: 4952
Biggest index.cache.json: 3.2 GiB

  • --index-cache-size=20GB
  • --chunk-pool-size=40GB
@GiedriusS
Copy link
Member

Hi, any reason you are running such an old version? Please try the master and/or 0.8.1 and see if it is still reproducible.

@salapat11
Copy link
Author

Same issue with v0.8.1.

@hbokh
Copy link

hbokh commented Nov 25, 2019

Using the same version here.

# thanos --version
thanos, version 0.8.1 (branch: HEAD, revision: bd8278859b2321aaaa7514edde764816cc039d34)
  build user:       root@2227d9a2fdb1
  build date:       20191014-12:03:55
  go version:       go1.13.1

Total objects is a little over 8000.

Running on a VM, I had to go from 2GB --> 4GB --> 8GB --> 16GB of memory before the OOM-killer was not an issue anymore!
Now thanos store is using 12.3GB of RAM.

@MarcMielke
Copy link

Same issue with 0.9.0. I wonder what the relations between Bucket size, Total Objects,--index-cache-size and --chunk-pool-size are - as to how to come up with a formula indicating the proper memory requirements. Even if it's only an estimate.

@draeron
Copy link

draeron commented Dec 23, 2019

Same issue with 0.9.0. I wonder what the relations between Bucket size, Total Objects,--index-cache-size and --chunk-pool-size are - as to how to come up with a formula indicating the proper memory requirements. Even if it's only an estimate.

we're having the same issue, it'd be really useful if the documentation would contains some informations about ball park values to set.

@hbokh
Copy link

hbokh commented Dec 24, 2019

Things seem to have gotten a bit worse with v0.9.0.
I upgraded at 14:13CET and this is what Grafana shows:

Screenshot 2019-12-24 at 15 32 58

Thanos Store behaves a bit like "The Very Hungry Caterpillar" when it comes to memory usage...

On the positive side, I see there's being worked on: #1471 👍

@hbokh
Copy link

hbokh commented Jan 21, 2020

Just started testing thanos store, thanos-0.10.0.linux-amd64
@bwplotka Can you please explain this?

Instead of OOM it is now restarting pretty often with fatal error: runtime: out of memory (20GB RAM).

Jan 21 15:35:28 thanos0-grq thanos[10904]: created by net.(*netFD).connect
Jan 21 15:35:28 thanos0-grq thanos[10904]:         /usr/local/go/src/net/fd_unix.go:128 +0x275
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Failed with result 'exit-code'.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Service RestartSec=100ms expired, scheduling restart.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Scheduled restart job, restart counter is at 3.
Jan 21 15:35:29 thanos0-grq systemd[1]: Stopped Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq systemd[1]: Started Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.707034623Z caller=main.go:149 msg="Tracing will be disabled"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.711552838Z caller=factory.go:43 msg="loading bucket configuration"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.817545669Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818622141Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818923211Z caller=store.go:288 msg="starting store node"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819046301Z caller=store.go:243 msg="initializing bucket store"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819459371Z caller=prober.go:127 msg="changing probe status" status=healthy
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819557861Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:19191
Jan 21 15:35:39 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:39.306737083Z caller=fetcher.go:361 component=block.MetaFetcher msg="successfully fetched block metadata" duration=9.487646873s cached=11563 returned=11563 partial=0
Jan 21 15:39:11 thanos0-grq thanos[11516]: fatal error: runtime: out of memory

Thanos-store

@GiedriusS
Copy link
Member

Just started testing thanos store, thanos-0.10.0.linux-amd64
@bwplotka Can you please explain this?

Instead of OOM it is now restarting pretty often with fatal error: runtime: out of memory (20GB RAM).

Jan 21 15:35:28 thanos0-grq thanos[10904]: created by net.(*netFD).connect
Jan 21 15:35:28 thanos0-grq thanos[10904]:         /usr/local/go/src/net/fd_unix.go:128 +0x275
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Failed with result 'exit-code'.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Service RestartSec=100ms expired, scheduling restart.
Jan 21 15:35:29 thanos0-grq systemd[1]: thanos-store.service: Scheduled restart job, restart counter is at 3.
Jan 21 15:35:29 thanos0-grq systemd[1]: Stopped Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq systemd[1]: Started Thanos Store Gateway.
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.707034623Z caller=main.go:149 msg="Tracing will be disabled"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.711552838Z caller=factory.go:43 msg="loading bucket configuration"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.817545669Z caller=inmemory.go:167 msg="created in-memory index cache" maxItemSizeBytes=131072000 maxSizeBytes=262144000 maxItems=math.MaxInt64
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818622141Z caller=options.go:20 protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.818923211Z caller=store.go:288 msg="starting store node"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819046301Z caller=store.go:243 msg="initializing bucket store"
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819459371Z caller=prober.go:127 msg="changing probe status" status=healthy
Jan 21 15:35:29 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:29.819557861Z caller=http.go:53 service=http/server component=store msg="listening for requests and metrics" address=0.0.0.0:19191
Jan 21 15:35:39 thanos0-grq thanos[11516]: level=info ts=2020-01-21T14:35:39.306737083Z caller=fetcher.go:361 component=block.MetaFetcher msg="successfully fetched block metadata" duration=9.487646873s cached=11563 returned=11563 partial=0
Jan 21 15:39:11 thanos0-grq thanos[11516]: fatal error: runtime: out of memory

Thanos-store

The PR that closed this is not in 0.10.0. Please try out the master version with the experimental flags turned on.

@bwplotka
Copy link
Member

It's on master indeed. It's still on experimental but you can enable it via https://github.com/thanos-io/thanos/blob/master/cmd/thanos/store.go#L78 (--experimental.enable-index-header).

We are still working on various benchmarks especially around query resource usage, but functionally it should work! (: Please try it our on dev/testing/staging environments and give us feedback! ❤️

@Jasstkn
Copy link

Jasstkn commented Mar 25, 2020

@bwplotka the result is amazing after applied this feature flag. Is there any side effects for performance?

@bwplotka
Copy link
Member

bwplotka commented Mar 25, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants