Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Receiver: Metrics with out-of-order labels get accepted #5499

Closed
matej-g opened this issue Jul 13, 2022 · 15 comments · Fixed by #5508
Closed

Receiver: Metrics with out-of-order labels get accepted #5499

matej-g opened this issue Jul 13, 2022 · 15 comments · Fixed by #5508

Comments

@matej-g
Copy link
Collaborator

matej-g commented Jul 13, 2022

Thanos, Prometheus and Golang version used:
Latest Thanos (0.27.0)

What happened:
Receiver happily accepts metrics with out-of-order labels. By itself, this seemed to be acceptable (i.e. TSDB will append them without complaining). But this is causing issues in compactor, which expects labels to be in order and this causes it to get into a crash loop.

This is similar to situation that happened previously with prometheus/prometheus#5372. Though we're in remote write land and we may assume in most cases we are getting metrics from 'official' clients that should adhere to the labels ordering, this cannot be guaranteed with all clients (as is evidenced by our related issue #5497, which we caused by running some custom tests with an 'unofficial' client from https://github.com/observatorium/up).

What you expected to happen:
I would expect the receiver to handle this, at least in a minimal fashion, to prevent potential issues with compactor down the line. Possible options:

  • Silently fix label ordering
  • Warn about wrong label order
  • Both of the above
  • Refuse request as bad request
  • Make this configurable
@douglascamata
Copy link
Contributor

Can Receive know if the labels are out of order without having to effectively copy, sort the copy, and then and compare to the original? 🤔

@matej-g
Copy link
Collaborator Author

matej-g commented Jul 15, 2022

After some more thought and taking a look at the remote write spec, I feel we should not be condoning client's bad behavior, so instead of trying to fix wrong labels I'd suggest to drop series with invalid labels from the request and handle this as a bad request error. I gave it a stab in #5508.

@douglascamata
Copy link
Contributor

@matej-g so we drop the series with bad labels, accept the others with good labels and returning a bad request? How does Prometheus client react to a bad request? I am afraid it could send again the same request, duplicating the series with good labels.

@matej-g
Copy link
Collaborator Author

matej-g commented Jul 15, 2022

Prometheus considers bad request to be unrecoverable errors (unlike e.g. server errors), so it should not retry on such response. This is equivalent to how we handle requests with invalid samples or exemplars. I tried to describe it more in depth in the PR description.

@douglascamata
Copy link
Contributor

Awesome. It sounds like a good flow. Thanks for the explanation. 🙇

@hanem100k
Copy link

hanem100k commented Apr 9, 2024

uuh, google does not like the suffix of this, found a lot of issues with out-of-order samples, but this with labels was buried:)

Running 0.34 thanos-receivers and It does work accordingly to the above, drops the requests:)

image

I'm sending writes from Prometheus instances, however, remote_write config doesn't seem to have an option to send sorted, which is weird because the spec explicitly states that stuff should be sorted, it might make sense to provide the possibility to sort before egressing. Or inside the receiver to sort before ingesting?

Either way, is there a way I can explicitly sort? or just labelkeep blocks in order? @matej-g @douglascamata

Thanks!

@MichaHoffmann
Copy link
Contributor

Oh I don't think Prometheus should write ooo labels in remote write, is there something else writing?

@hanem100k
Copy link

Oh I don't think Prometheus should write ooo labels in remote write, is there something else writing?

It's only promethesis, hundreds of them. Most of them are on version 2.50, with a couple still on 2.26, let me see if the old ones are causing the issue

@MichaHoffmann
Copy link
Contributor

Can you run receivers with debug logging? I think they should log the offending labels then iirc.

@hanem100k
Copy link

checked the versions:

image We run quite a few, unfortunately, tho it doesn't seem to be version specific, newest is having the same issues.

Can you run receivers with debug logging? I think they should log the offending labels then iirc.

Yes they do log it, tho it is very very spammy, I mean we're having this issue all over the place, hence wanting to defer manual labelkeep:)

@MichaHoffmann
Copy link
Contributor

Our of curiosity how do the rejected labelsets look?

@hanem100k
Copy link

hanem100k commented Apr 9, 2024

image

*hand edited values, might have syntax errors

Edit: added screenshot, text was rendered very weird

@MichaHoffmann
Copy link
Contributor

That lset also has a couple duplicate labels after rule_group! That's really weird

@hanem100k
Copy link

hanem100k commented Apr 9, 2024

it is really weird, because checking the values, those are different its even a different pod, from a different node.

one is RISC the other is CISC

It's like two long label strings were randomly concatenated.

@hanem100k
Copy link

looking at an another one:
image

This is not sorted to being with, tho it duplicates again, even the node.

Will debug tomorrow, but I think the logging might be wrong, (we are consuming these currently with another layer of big prom instances and did not notice any label duplication or weirdness like the above. Ofc those are scraping)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants