With hierarchal query instances receiving "dropping store, external labels are not unique" messages #1337

PsychoSid · 2019-07-18T13:22:33Z

We have a hierarchal thanos setup which pulls from multiple k8s clusters. With v0.5 this worked OK. Upon upgrading to v0.6 the top level query nodes no longer would return any data.

Messages received as

Jul 18 05:54:52 lpposput50201.example.com thanos_query[12446]: level=warn ts=2019-07-18T12:54:52.14074902Z caller=storeset.go:252 component=storeset msg="dropping store, external labels are not unique" address=lpdospeu50701.example.com:19094 extLset="{cluster=\"v4/customer/paas/e1\",environment=\"e1_blue\",replica=\"lpdosput50627\"},{cluster=\"v4/customer/paas/e1\",environment=\"e1_blue\",replica=\"lpdosput50628\"}" duplicates=2

The daemons are started with --query.replica-label='replica'

Debug messages show:-

Jul 18 06:03:21 lpposput50202 thanos_query: level=debug ts=2019-07-18T13:03:21.438543301Z caller=storeset.go:223 component=storeset msg="updating healthy stores" externalLabelOccurrencesInStores="map[string]int{\"{cluster=\\\"v4/customer/paas/e1\\\",environment=\\\"e1_blue\\\",replica=\\\"lpdosput50627\\\"},{cluster=\\\"v4/customer/paas/e1\\\",environment=\\\"e1_blue\\\",replica=\\\"lpdosput50628\\\"}\":2, \"{cluster=\\\"v4/customer/paas/e2\\\",environment=\\\"e2_blue\\\",replica=\\\"lpqosput50217\\\"},{cluster=\\\"v4/customer/paas/e2\\\",environment=\\\"e2_blue\\\",replica=\\\"lpqosput50218\\\"}\":2, \"{cluster=\\\"v4/customer/paas/e3_ipc1\\\",environment=\\\"e3_ipc1_blue\\\",replica=\\\"lppospeu50376\\\"},{cluster=\\\"v4/customer/paas/e3_ipc1\\\",environment=\\\"e3_ipc1_blue\\\",replica=\\\"lppospeu50377\\\"}\":2, \"{cluster=\\\"v4/customer/paas/e3_ipc2\\\",environment=\\\"e3_ipc2_blue\\\",replica=\\\"lgposput60242\\\"},{cluster=\\\"v4/customer/paas/e3_ipc2\\\",environment=\\\"e3_ipc2_blue\\\",replica=\\\"lgposput60243\\\"}\":2, \"{cluster=\\\"v4/customer/paas/e3_ipc2_r2\\\",environment=\\\"e3_ipc2_blue_r2\\\",replica=\\\"lgpospeu60166\\\"},{cluster=\\\"v4/customer/paas/e3_ipc2_r2\\\",environment=\\\"e3_ipc2_blue_r2\\\",replica=\\\"lgpospeu60168\\\"}\":2}"

This appears to be related to change:-
05f81a5

Reverting the top level query nodes to v0.5 and data is now returning.

The text was updated successfully, but these errors were encountered:

GiedriusS · 2019-07-18T13:29:02Z

More context:

Giedrius Statkevičius [16 minutes ago]
This probably happens now because all of the labels of the stores that one of your lower level Thanos Query nodes has "bubble up" to the higher level and at the top you have two of the same. Does this sound something like the setup you have?

Paul Seymour [16 minutes ago]
Very much so.

Not many ideas pop into my mind but I think we could leave the first store up and then drop all of the other duplicates. Not sure if we can avoid such situation in this use-case if Thanos Query will work like it does in 0.6.0.

Thoughts?

bwplotka · 2019-07-18T13:35:14Z

What's you setup again? I am still missing what's is happening here, why you end up with same external labels.

bwplotka · 2019-07-18T13:43:12Z

Do you have maybe 2 replicas of Querier for the same underlying stores pointed directly?

map[string]int{
		"{cluster=\\\"v4/customer/paas/e1\\\",environment=\\\"e1_blue\\\",replica=\\\"lpdosput50627\\\"},{cluster=\\\"v4/customer/paas/e1\\\",environment=\\\"e1_blue\\\",replica=\\\"lpdosput50628\\\"}":2,
		"{cluster=\\\"v4/customer/paas/e2\\\",environment=\\\"e2_blue\\\",replica=\\\"lpqosput50217\\\"},{cluster=\\\"v4/customer/paas/e2\\\",environment=\\\"e2_blue\\\",replica=\\\"lpqosput50218\\\"}":2,
		"{cluster=\\\"v4/customer/paas/e3_ipc1\\\",environment=\\\"e3_ipc1_blue\\\",replica=\\\"lppospeu50376\\\"},{cluster=\\\"v4/customer/paas/e3_ipc1\\\",environment=\\\"e3_ipc1_blue\\\",replica=\\\"lppospeu50377\\\"}":2,
		"{cluster=\\\"v4/customer/paas/e3_ipc2\\\",environment=\\\"e3_ipc2_blue\\\",replica=\\\"lgposput60242\\\"},{cluster=\\\"v4/customer/paas/e3_ipc2\\\",environment=\\\"e3_ipc2_blue\\\",replica=\\\"lgposput60243\\\"}":2,
		"{cluster=\\\"v4/customer/paas/e3_ipc2_r2\\\",environment=\\\"e3_ipc2_blue_r2\\\",replica=\\\"lgpospeu60166\\\"},{cluster=\\\"v4/customer/paas/e3_ipc2_r2\\\",environment=\\\"e3_ipc2_blue_r2\\\",replica=\\\"lgpospeu60168\\\"}":2,
	}

It looks like it.

bwplotka · 2019-07-18T13:44:29Z

Replica label is fine - it is never excluded in dedup.

bwplotka · 2019-07-18T13:51:46Z

bwplotka  [7 minutes ago]
I think the problem is different, but @Paul Seymour please confirm - Are you maybe connect high level Querier to leaf multiple replica Queriers directly (that have the same data)?

Paul Seymour  [5 minutes ago]
Yes I am there are duplicate pairs of Prometheus nodes, and Thanos Storage/Query nodes downstream

bwplotka  [5 minute ago]
That's wrong approach then -> you just duplicate queries in your setup. You should just put loadbalancer on top of same Queriers :slightly_smiling_face:

bwplotka  [4 minutes ago]
But in the same case in this error case we probably would need to at least not drop all of it

bwplotka  [4 minutes ago]
but just warn

bwplotka  [4 minutes ago]
so I am fixing that

Giedrius Statkevičius  [3 minutes ago]
Yeah, that seems to be the solution to me as well or you could add different selector labels **but** that would mean 2x of the load which is really unnecessary :slightly_smiling_face:

Paul Seymour  [3 minutes ago]
They are load balanced by nginx but the query nodes reference all the storage nodes.

bwplotka  [2 minutes ago]
We need to still solve round robin  load balancing: https://github.com/improbable-eng/thanos/issues/1083
GitHub
querier: Add option for different gRPC client Loadbalancing options for StoreAPI · Issue #1083 · improbable-eng/thanos
Currently if someone pass domain to --store of via file SD without dns loopups, the target is passed to gRPC DialContext which then if A lookup gives more than one response it loadbalance. However ...

bwplotka  [2 minutes ago]
> They are load balanced by nginx but the query nodes reference all the storage nodes.

bwplotka  [1 minute ago]
hm

bwplotka  [1 minute ago]
Can you give some diagram to that?

bwplotka  [< 1 minute ago]
Is it a "diamond" Querier problem then? :slightly_smiling_face: So top Querier points to 2 nginx that points to different Queriers and those access all stores? (edited)


bwplotka  [< 1 minute ago]
That's doubling your traffic indeed

bwplotka  [< 1 minute ago]
(if that's the case)

bwplotka · 2019-07-18T14:14:02Z

Discussed offline and confirmed that is the case.

For example to `e2_blue` one Querier and probably there is another domain that `lpdospeuXXX` that points to another Querier in the same e2_blue` cluster.

The plan is to:

Warn in this case as you double/triple/10x depending on number of Querier replicas in this configuration
Educate/show how to do client LB or server LB. For client one, essentially you can give a Querier a domain that returns targer for each replica of Querier in the same cluster. Then put the domain as target in Querier (or --store) and done.
Keep one to not break users with this setup.

Fixes #1337 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

* querier: Allows single store in case of duplicates. Drop others. Fixes #1337 Signed-off-by: Bartek Plotka <bwplotka@gmail.com> * Updated CHANGELOG. Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

GiedriusS added the component: query label Jul 18, 2019

bwplotka added a commit that referenced this issue Jul 18, 2019

querier: Allows single store in case of duplicates. Drop others.

3740a5a

Fixes #1337 Signed-off-by: Bartek Plotka <bwplotka@gmail.com>

bwplotka mentioned this issue Jul 18, 2019

querier: Allows single store in case of duplicates. Drop others. #1338

Merged

bwplotka closed this as completed in #1338 Jul 18, 2019

realdimas mentioned this issue Aug 14, 2019

docs: update estimated release of the PR #1338 #1417

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

With hierarchal query instances receiving "dropping store, external labels are not unique" messages #1337

With hierarchal query instances receiving "dropping store, external labels are not unique" messages #1337

PsychoSid commented Jul 18, 2019

GiedriusS commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

With hierarchal query instances receiving "dropping store, external labels are not unique" messages #1337

With hierarchal query instances receiving "dropping store, external labels are not unique" messages #1337

Comments

PsychoSid commented Jul 18, 2019

GiedriusS commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019

bwplotka commented Jul 18, 2019