store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines #651

GiedriusS · 2018-11-30T08:14:53Z

Thanos components: improbable/thanos:v0.1.0
Prometheus: prom/prometheus:v2.4.2

What happened

Number of goroutines in Thanos Store instances started growing inexorably. It never came back down and Thanos Store was essentially unresponsive to all queries during this.

What you expected to happen
Goroutine number to stay around the same number i.e. have a low standard deviation, and Thanos Store to be responsive.

How to reproduce it (as minimally and precisely as possible):
To be done. since this is hard to reproduce and depends on the data.

Full logs to relevant components
Thanos Query started timing out:

level=error ts=2018-11-29T13:20:41.368751553Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica a}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368755808Z caller=proxy.go:117 err="fetch series for [{monitor prtoss} {replica b}]: rpc error: code = Canceled desc = context canceled"
level=error ts=2018-11-29T13:20:41.368808725Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368849077Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.368848275Z caller=proxy.go:117 err="fetch series for []: rpc error: code = Canceled desc = context canceled"
level=warn ts=2018-11-29T13:20:41.368909531Z caller=proxy.go:126 err="No store matched for this query"
level=error ts=2018-11-29T13:20:41.369833973Z caller=engine.go:499 msg="error expanding series set" err="context canceled"

No more series query processed messages in Thanos Store whatsoever, only Blocks source resolution. Only this one got through somehow in a period of about an hour:

Lap 29 14:07:42 foobar docker[21891]: level=debug ts=2018-11-29T13:07:42.784986336Z caller=bucket.go:802 msg="series query processed" stats="&{blocksQueried:0 postingsTouched:0 postingsTouchedSizeSum:0 postingsFetched:0 postingsFetchedSizeSum:0 postingsFetchCount:0 postingsFetchDurationSum:0 seriesTouched:0 seriesTouchedSizeSum:0 seriesFetched:0 seriesFetchedSizeSum:0 seriesFetchCount:0 seriesFetchDurationSum:0 chunksTouched:0 chunksTouchedSizeSum:0 chunksFetched:0 chunksFetchedSizeSum:0 chunksFetchCount:0 chunksFetchDurationSum:0 getAllDuration:877 mergedSeriesCount:0 mergedChunksCount:0 mergeDuration:1273}"

When something happened but exactly zero data was pulled down so maybe this is a separate issue?

The goroutine dump of Thanos Store at the moment of this freeze:
goroutine_dmp.zip

Notice that exactly one goroutine is stuck in ensureFits and it locks the lock, thus preventing other goroutines from performing their work.

Workaround: increased the index cache significantly and the problem went away. Perhaps this has to do with something that some index cache files are really large:

$ du -t 250M -hs /var/lib/thanos/*/index.cache.json
279M    /var/lib/thanos/01CXD021G0JHNRQWJKVKCHD3QR/index.cache.json
1,6G    /var/lib/thanos/01CXF4WQE8H27NZK0W4VEQB3XP/index.cache.json
1,5G    /var/lib/thanos/01CXFJT3R8J59MD4YC9AC0TPE5/index.cache.json
...

Anything else we need to know
Environment:

OS: CentOS Linux 7 (Core)
Kernel: 3.10.0-862.11.6.el7.x86_64
Docker: 18.03.1-ce, build 9ee9f40
Memory: 256 GB

The text was updated successfully, but these errors were encountered:

bwplotka · 2018-11-30T12:22:44Z

Are you sure it's spinning forever? When index cache in memory is full (as limited by index cache limit option) it just spinning for every setPosting, setPosting, so it's slow and go routines accumulates.

So it's all about the case of full index cache and how to deal with it. The implementations of LRU is too naive. If it's near full it blocks a lot and removes not aggresively enough. We need to fix this.

Second acceptance criteria to improve this is to also have clear query/way to tell you that you have a traffic that requires bigger cache.

GiedriusS · 2019-01-30T11:38:02Z

It never got unblocked - I think we have waited for an hour or so. Those two potential solutions sound logical to me however I also think that something like this GiedriusS@621d87d should work in the mean time. @bwplotka what is your opinion about this diff?

GiedriusS · 2019-02-01T12:03:26Z

Got some time to look at this again and IMHO what had happened is that a huge query came and the results were too big for the index cache to handle - setSeries() was invoked with a slice which exceeds the maximum set amount and thus we couldn't ever ensure that it fitted into the LRU cache. I will cook up some extra unit tests for the index cache to test this hypothesis and I will open up a PR if I am correct.

bwplotka · 2019-02-01T12:30:21Z

Thanks for looking into it. I think what matters the most is the extension of currnet implementation to handle gracefully or at least be aware of what will happen when cache is:

too small to fit things needed for single query
too small to fit things from different queries done in the same time.

It is not easy thing to fix and do unit tests against, but would be super valuable, thanks for helping here!

bwplotka · 2019-02-01T16:21:12Z

Fix merged.

bwplotka · 2019-04-23T11:30:35Z

We can repro this again on the version with fix cc @mjd95

bwplotka · 2019-04-23T11:31:14Z

First thing to test against: size equal to maxSize

devnev · 2019-04-23T11:52:52Z

Our thanos image is improbable/thanos:master-2019-04-16-f3f0518

Goroutine dump: storegw-0-goroutine.txt

The culprit looks like a single goroutine spinning in ensureFits:


1 @ 0xcfb371 0xe6b48c 0xe6b5ec 0xe77231 0xba6007 0x45cc41
#	0xcfb370	github.com/hashicorp/golang-lru/simplelru.(*LRU).RemoveOldest+0xe0				/go/pkg/mod/github.com/hashicorp/golang-lru@v0.5.0/simplelru/lru.go:109
#	0xe6b48b	github.com/improbable-eng/thanos/pkg/store.(*indexCache).ensureFits+0x3b			/go/src/github.com/improbable-eng/thanos/pkg/store/cache.go:131
#	0xe6b5eb	github.com/improbable-eng/thanos/pkg/store.(*indexCache).setPostings+0x11b			/go/src/github.com/improbable-eng/thanos/pkg/store/cache.go:142
#	0xe77230	github.com/improbable-eng/thanos/pkg/store.(*bucketIndexReader).fetchPostings.func3+0x440	/go/src/github.com/improbable-eng/thanos/pkg/store/bucket.go:1460
#	0xba6006	github.com/oklog/run.(*Group).Run.func1+0x26							/go/pkg/mod/github.com/oklog/run@v1.0.0/group.go:38

GiedriusS · 2019-04-23T12:07:40Z

Weird, I'm on master and I've added some more tests: GiedriusS@42449a8, and they pass without an issue. Could you think of another edge case that we are missing? 😞

bwplotka · 2019-04-23T12:11:44Z

Interesting thing is that we have 2 replicas, and one was fine, second was deadlocking. This indicates that certain order of requests matters?

devnev · 2019-04-23T12:19:01Z

Also looks like the thanos_store_index_cache_items_overflowed_total metric isn't being registered.

…cases. Fixes #651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

…cases. Fixes #651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka · 2019-04-25T14:03:27Z

So this still happens (!) However does not affect us much, thanks for our safeguard code we just log error and overflow request if ensureFits loops more than 500 times.

Reopening as this still happens, but is workarounded. Is it simplelru package bug?

cc @mjd95 @devnev

bwplotka · 2019-04-29T13:04:14Z

We got those logs a lot:

level=error name=simstoregw ts=2019-04-29T12:24:43.387316778Z caller=cache.go:226 msg="LRU has nothing more to evict, but we still cannot allocate the item. Ignoring." maxItemSizeBytes=536870912 maxSizeBytes=1073741824 curSize=1073741770 itemSize=1996 cacheType=Postings

bwplotka · 2019-04-29T13:05:03Z

It's definitely in near max situation

bwplotka · 2019-04-30T10:15:24Z

Lol so curSize=1073741770 is clearly lying?

bwplotka · 2019-05-01T01:04:24Z

Deleted my comments regarding wrong assumption. Was trolled by bucket_e2e & swapping index cache. Fixed that part: #1098

Definitely cannot repro locally / with unit tests yet.

* query: cleanup store statuses as they come and go (thanos-io#910) Signed-off-by: Adrien Fillon <adrien.fillon@cdiscount.com> * [docs] Example of using official prometheus helm chart to deploy server with sidecar (thanos-io#1003) * update documentation with an example of using official prometheus helm chart Signed-off-by: Ivan Kiselev <ivan@messagebird.com> * a little formatting to values Signed-off-by: Ivan Kiselev <ivan@messagebird.com> * satisfy PR comments Signed-off-by: Ivan Kiselev <ivan@messagebird.com> * Compact: group concurrency (thanos-io#1010) * compact: add concurrency to group compact * add flag to controll the number of goroutines to use when compacting group * update compact.md for group-compact-concurrency * fixed: miss wg.Add() * address CR * regenerate docs * use err group * fix typo in flag description * handle context * set up workers in main loop * move var initialisation * remove debug log * validate concurrency * move comment * warn -> error * remove extra newline * fix typo * dns: Added miekgdns resolver as a hidden option to query and ruler. (thanos-io#1016) Fixes: thanos-io#1015 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * query: set default evaluation interval (thanos-io#1028) Subqueries allows request with no [specified resolution](https://prometheus.io/blog/2019/01/28/subquery-support/). Set it up and allow to configure default evaluation interval. * store+compactor: pre-compute index cache during compaction (thanos-io#986) Fixes first part of thanos-io#942 This changes allow to safe some startup & sync time in store gateway as it is no longer is needed to compute index-cache from block index on its own. For compatibility store GW still can do it, but it first checks bucket if there is index-cached uploaded already. In the same time, compactor precomputes the index cache file on every compaction. To allow quicker addition of index cache files we added `--index.generate-missing-cache-file` flag, that if enabled precompute missing files on compactor startup. Note that it will take time and it's only one-off step per bucket. Signed-off-by: Aleksei Semiglazov <xjewer@gmail.com> * Added website for Thanos' docs using Hugo. (thanos-io#807) Hosted in github pages. Signed-off-by: adrien-f <adrien.fillon@gmail.com> Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * gcs: Fixed scopes for inline ServiceAccount option. (thanos-io#1033) Without this that option was unusable. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Fixed root docs and liche is now checking root dir as well. (thanos-io#1040) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * storage docs: add detail about GCS policies and testing (thanos-io#1037) * add more details about GCS policies and testing * remove fixed names from exec command * Prometheus library updated to v2.8.1 (thanos-io#1009) * compact: group concurrency improvements (thanos-io#1029) * group concurrency improvements * remove unnecessary error check * add to wg in main goroutine * receive: Add block shipping (thanos-io#1011) * receive: Add retention flag for local tsdb storage (thanos-io#1046) * querier: Add /api/v1/labels support (thanos-io#905) * Feature: add /api/v1/labels support Signed-off-by: jojohappy <sarahdj0917@gmail.com> * Disabled gossip by default, marked all flags as deprecated. (thanos-io#1055) + changed small label. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * ruler: Fixed Chunk going out or Max Uint16. (thanos-io#1041) Fixes thanos-io#1038 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * store: azure: allow passing an endpoint parameter for specific regions (thanos-io#980) Fix thanos-io#968 Signed-off-by: Adrien Fillon <adrien.fillon@cdiscount.com> * feature: support POST method for series endpoint (thanos-io#1021) Signed-off-by: Joseph Lee <joseph.t.lee@outlook.com> * bucket verify: repair out of order labels (thanos-io#964) * bucket verify: repair out of order labels * verify repair: correctly order series in the index on rewrite When we have label sets that are not in the correct order, fixing that changes the order of the series in the index. So the index must be rewritten in that new order. This makes this repair tool take up a bunch more memory, but produces blocks that verify correctly. * Fix the TSDB block safe-delete function The directory name must be the block ID name exactly to verify. A temp directory or random name will not work here. * verify repair: fix duplicate chunk detection Pointer/reference logic error was eliminating all chunks for a series in a given TSDB block that wasn't the first chunk. Chunks are now referenced correctly via pointers. * PR feedback: use errors.Errorf() instead of fmt.Errorf() * Use errors.New() Some linters catch errors.Errorf() as its not really part of the errors package. * Liberally comment this for loop We're comparing items by pointers, using Go's range variables is misleading here and we need not fall into the same trap. * Take advantage of sort.Interface This prevents us from having to re-implement label sorting. * PR Feedback: Comments are full sentences. * Cut release 0.4.0-rc.0 (thanos-io#1017) * Cut release 0.4.0-rc.0 🎉 🎉 NOTE: This is last release that has gossip. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> Co-Authored-By: povilasv <p.versockas@gmail.com> * Fixed crossbuild. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * ci: Env fixes. (thanos-io#1058) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Removed bzr requirement for make crossbuild. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Bump github.com/hashicorp/golang-lru from 0.5.0 to 0.5.1 (thanos-io#1051) Bumps [github.com/hashicorp/golang-lru](https://github.com/hashicorp/golang-lru) from 0.5.0 to 0.5.1. - [Release notes](https://github.com/hashicorp/golang-lru/releases) - [Commits](hashicorp/golang-lru@v0.5.0...v0.5.1) Signed-off-by: dependabot[bot] <support@dependabot.com> * Initialze and correctly register all index cache metrics. (thanos-io#1069) * store/cache: add more tests (thanos-io#1071) * Fixed Downsampling process; Fixed `runutil.CloseAndCaptureErr` (thanos-io#1070) * runutil. Simplified CloseWithErrCapture. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Fixed Downsampling process; Fixed runutil.CloseAndCaptureErr Fixes thanos-io#1065 Root cause: * runutil defered capture error function was not passing error properly so unit tests were passing, event though there was bug * streamed block write index cache requires index file which was not closed (saved) properly yet. Closers need to be closed to perform this. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * objstore: Expose S3 region attribute (thanos-io#1060) Minio is able to autodetect the region for cloud providers like AWS but the logic fails with Scaleway Object Storage solution. Related issue on Minio: minio/mc#2570 * Fixed fetching go-bindata failed (thanos-io#1074) * Fixed bug: - fetching go-bindata failed. - change the repo of go-bindata to github.com/go-bindata/go-bindata, because old repo has been archived. - pin the go-bindata as v3.3.1. Signed-off-by: jojohappy <sarahdj0917@gmail.com> * Add CHANGELOG Signed-off-by: jojohappy <sarahdj0917@gmail.com> * Remove CHANGELOG Signed-off-by: jojohappy <sarahdj0917@gmail.com> * add compare flags func to compare flags between prometheus and sidecar (thanos-io#838) Original message: * update documentation for a max/min block duration add compare flags func to compare flags between prom and sidecar * fix some nits Functional change: now we check the configured flags (if possible) and error out if MinTime != MaxTime. We need to check this always since if that is not true then we will get overlapping blocks. Additionally, an error message is printed out if it is not equal to 2h (the recommended value). * Ensured index cache is best effort, refactored tests, validated edge cases. (thanos-io#1073) Fixes thanos-io#651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * website: Moved to netlify. (thanos-io#1078) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * website: Fixing netlify. (thanos-io#1080) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * website: Added "founded by" footer. (thanos-io#1081) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * store/proxy: properly check if context has ended (thanos-io#1082) How the code was before it could happen that we might receive some series from the stream however by the time we'd send them back to the reader, it would not read it anymore since the deadline would have been exceeded. Properly use a `select` here to get out of the goroutine if the deadline has been exceeded. Might potentially fix a problem where we see one goroutine hanging constantly (and thus blocking from work being done): ``` goroutine profile: total 126 25 @ 0x42f62f 0x40502b 0x405001 0x404de5 0xe7435b 0x45cc41 0xe7435a github.com/improbable-eng/thanos/pkg/store.startStreamSeriesSet.func1+0x18a /go/src/github.com/improbable-eng/thanos/pkg/store/proxy.go:318 ``` * Cut release v0.4.0-rc.1 (thanos-io#1088) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * website: Removed ghpages handling; fixed docs; and status badge. (thanos-io#1084) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Fix readme (thanos-io#1090) * store: Compose indexCache properly allowing injection for testing purposes. (thanos-io#1098) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * website: add sponsor section on homepage (thanos-io#1062) * website: Adjusted logos sizing and responsiveness. (thanos-io#1105) Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Add Monzo to "Used by" section 🎉 (thanos-io#1106) * Compactor: remove malformed blocks after delay (thanos-io#1053) * compactor removes malformed blocks after delay * compactor removes malformed blocks after delay * include missing file * reuse existing freshness check * fix comment * remove unused var * fix comment * syncDelay -> consistencyDelay * fix comment * update flag description * address cr * fix dupliacte error handling * minimum value for --consistency-delay * update * docs * add test case * move test to inmem bucket * Add Utility Warehouse to "used by" section (thanos-io#1108) * Add Utility Warehouse logo * Make logo smaller * website: add Adform as users (thanos-io#1109) We use Thanos extensively as well so I have added Adform. * Cut release v0.4.0 (thanos-io#1107) Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added the bug label Nov 30, 2018

bwplotka added difficulty: medium help wanted priority: P1 priority: P0 and removed priority: P1 labels Feb 1, 2019

GiedriusS mentioned this issue Feb 1, 2019

LRU cache ensureFits() tweaks WRT deadlocking #796

Merged

bwplotka closed this as completed Feb 1, 2019

bwplotka reopened this Apr 23, 2019

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

25aecef

…cases. Fixes #651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka mentioned this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge cases. #1073

Merged

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

9d89257

…cases. Fixes #651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

b987b50

…cases. Fixes #651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

828a81e

…cases. Fixes #651 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

968f83f

…cases. Fixes #651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

e4b1541

…cases. Fixes #651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 23, 2019

Ensured index cache is best effort, refactored tests, validated edge …

be539e1

…cases. Fixes #651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka added a commit that referenced this issue Apr 24, 2019

Ensured index cache is best effort, refactored tests, validated edge …

a8841e5

…cases. Fixes #651 Current size also includes slice header. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka closed this as completed in #1073 Apr 24, 2019

bwplotka reopened this Apr 25, 2019

povilasv closed this as completed May 7, 2019

abursavich mentioned this issue May 14, 2019

store/cache: drop updates to existing items and dedupe get/set methods #1142

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines #651

store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines #651

GiedriusS commented Nov 30, 2018 •

edited

Loading

bwplotka commented Nov 30, 2018

GiedriusS commented Jan 30, 2019

GiedriusS commented Feb 1, 2019

bwplotka commented Feb 1, 2019 •

edited

Loading

bwplotka commented Feb 1, 2019

bwplotka commented Apr 23, 2019

bwplotka commented Apr 23, 2019

devnev commented Apr 23, 2019

GiedriusS commented Apr 23, 2019

bwplotka commented Apr 23, 2019

devnev commented Apr 23, 2019

bwplotka commented Apr 25, 2019 •

edited

Loading

bwplotka commented Apr 29, 2019

bwplotka commented Apr 29, 2019

bwplotka commented Apr 30, 2019

bwplotka commented May 1, 2019

store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines #651

store: sometimes pkg/store.(*indexCache).ensureFits spins forever, blocking other goroutines #651

Comments

GiedriusS commented Nov 30, 2018 • edited Loading

bwplotka commented Nov 30, 2018

GiedriusS commented Jan 30, 2019

GiedriusS commented Feb 1, 2019

bwplotka commented Feb 1, 2019 • edited Loading

bwplotka commented Feb 1, 2019

bwplotka commented Apr 23, 2019

bwplotka commented Apr 23, 2019

devnev commented Apr 23, 2019

GiedriusS commented Apr 23, 2019

bwplotka commented Apr 23, 2019

devnev commented Apr 23, 2019

bwplotka commented Apr 25, 2019 • edited Loading

bwplotka commented Apr 29, 2019

bwplotka commented Apr 29, 2019

bwplotka commented Apr 30, 2019

bwplotka commented May 1, 2019

GiedriusS commented Nov 30, 2018 •

edited

Loading

bwplotka commented Feb 1, 2019 •

edited

Loading

bwplotka commented Apr 25, 2019 •

edited

Loading