Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compact: crashes when finding a sample with out of order labels #5497

Closed
douglascamata opened this issue Jul 13, 2022 · 5 comments
Closed

Compact: crashes when finding a sample with out of order labels #5497

douglascamata opened this issue Jul 13, 2022 · 5 comments

Comments

@douglascamata
Copy link
Contributor

Thanos: v0.27.0-rc.0

What happened:

Thanos Compact found a posting with out of order labels and started crashing infinitely.

What you expected to happen:

Thanos Compact finds a posting with out of order labels and then:

  • Fixes it before compacting if possible.
  • Otherwise, ignores it.
  • Log at warning/critical level that this happened.
  • Exports a metric to track this error so that alerts can be created to trigger when this happens a lot.

Most importantly, I would love Compact to not completely crash and stop doing its job.

How to reproduce it (as minimally and precisely as possible):

Have a block with posting containing out of order labels and try to compact them.

Full logs to relevant components:

Logs

level=warn ts=2022-07-13T13:09:33.180528476Z caller=index.go:267 msg="out-of-order label set: known bug in Prometheus 2.8.0 and below" labelset="{_id=\"test\", __name__=\"rhobs_e2e\"}" series=244977
level=warn ts=2022-07-13T13:09:33.180773659Z caller=intrumentation.go:67 msg="changing probe status" status=not-ready reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.180794468Z caller=http.go:84 service=http/server component=compact msg="internal server is shutting down" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.182943478Z caller=http.go:103 service=http/server component=compact msg="internal server is shutdown gracefully" err="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=info ts=2022-07-13T13:09:33.18296634Z caller=intrumentation.go:81 msg="changing probe status" status=not-healthy reason="error executing compaction: first pass of downsampling failed: downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels"
level=error ts=2022-07-13T13:09:33.183061471Z caller=main.go:158 err="downsampling to 5 min: input block index not valid: index contains 1 postings with out of order labels\nfirst pass of downsampling failed\nmain.runCompact.func7\n\t/app/cmd/thanos/compact.go:440\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:476\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\nerror executing compaction\nmain.runCompact.func8.1\n\t/app/cmd/thanos/compact.go:503\ngithub.com/thanos-io/thanos/pkg/runutil.Repeat\n\t/app/pkg/runutil/runutil.go:75\nmain.runCompact.func8\n\t/app/cmd/thanos/compact.go:475\ngithub.com/oklog/run.(*Group).Run.func1\n\t/go/pkg/mod/github.com/oklog/run@v1.1.0/group.go:38\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1581\ncompact command...

Anything else we need to know:

The metric was written via Receive. So it looks like there isn't a validation there for out of order labels.

@matej-g
Copy link
Collaborator

matej-g commented Jul 13, 2022

I'm wondering why the compactor does not halt in this case 🤔 Someone with more compact knowledge, @bwplotka @yeya24 @GiedriusS?

Apart from that, fixing manually should be an option with bucket verify (#964)

@yeya24
Copy link
Contributor

yeya24 commented Jul 13, 2022

Halt only happens at the compaction stage. This one happens during downsampling so no halting if I understand correctly.
I am also wondering when TSDB persists the head block to disk, why labels are not sorted. If that's the design, then we are required to sort labels at ingestion time.

@matej-g
Copy link
Collaborator

matej-g commented Jul 14, 2022

Thanks for the pointers @yeya24, I realized now that this actually happens in the downsampling phase and not in the compaction. Since we had the debug.accept-malformed-index flag enabled, compaction went through but now we have an 'incosistency', since downsampling does not have an option to ignore malformed index and it always errors out and crashes compactor on that error.

Luckily we hit this with some test metrics which we don't really need and we can just delete the offending block. But I'm wondering what would be a better course of action, to not get compactor into crash loop (besides ensuring ordering at ingestion time, which I'll look at in #5499).

@whc9527
Copy link

whc9527 commented Aug 25, 2022

@matej-g 你好
I deployed Thanos Compact component in K8S. After crash, K8S will restart the POD. Will the restart continue from the last compression node or will it fall into an infinite loop from the beginning?

@matej-g
Copy link
Collaborator

matej-g commented Sep 16, 2022

This is now resolved via #5690

@matej-g matej-g closed this as completed Sep 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants