Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

Open
JensErat opened this issue Jun 25, 2020 · 16 comments

Comments

@JensErat
Copy link
Contributor

JensErat commented Jun 25, 2020

Thanos, Prometheus and Golang version used:

Thanos 0.13 rc2, currently upgrading to 0.13 but commits do not indicate any relevant change since the tag was crafted

Object Storage Provider:

Internal Cloudian S3 service

What happened:

Compactor halts with following error message:

level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

After restarting compactor, compacting continues successfully (usually hours later, we only get warned if compactor does not run for quite some time -- so I cannot tell whether this is a pretty short inconsistency window and lots of bad luck involved).

What you expected to happen:

If this is really a transient issue based on eventual consistency issues, retry and don't halt.

How to reproduce it (as minimally and precisely as possible):

Slightly overload your S3 service and wait. We have a fail rate of estimated once a week for a single compactor. Mostly happens in installations with some network distance (ping >= 200ms)

Full logs to relevant components:

Logs

[...]
level=info ts=2020-06-25T07:42:56.091077077Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAQYJKPJS17YPE28VZ48JF7Q
level=info ts=2020-06-25T07:42:56.11841702Z caller=compact.go:792 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="marking compacted block for deletion" old_block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:56.358365126Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:57.695050512Z caller=compact.go:633 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="compaction available and planned; downloading blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]"
level=info ts=2020-06-25T07:55:17.496533315Z caller=compact.go:692 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="downloaded and verified blocks; compacting blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]" duration=12m19.801396575s
level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

Anything else we need to know:

Pretty sure related to #2128, this and prometheus/prometheus#6040 (and our observations) make me think this might be a transient issue. Hard to tell since we cannot really reproduce, is #2637 maybe fixing this already?

@GiedriusS
Copy link
Member

This is not a retryable error since we don't have enough data on disk when compaction runs and this occurs.

We have been running into this the past month or so but only on our staging environment. We have found after #2637 that:

  • One chunk file is missing some data during compaction
  • All of the content of the affected file that is on disk has the correct data i.e. there is no corruption going, it only misses some data
  • The calls to S3 succeed i.e. 200 is returned to minio-go
  • We have load-balancers in front of our S3 endpoints and changing the configuration to go to the Ceph Object Gateway directly seems to fix this problem. It has been running for a week without a problem.

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF. Then, it thinks that it got everything and the caller continues its execution. I think it has something to do with keep-alive connections not being maintained properly.

After that PR, you should be able to inspect the state of files on disk and see if this happens for you as well. Help welcome!

@GiedriusS GiedriusS changed the title Compactor halts on what seems a retryable error ("segment doesn't include enough bytes") Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") Jun 25, 2020
@JensErat
Copy link
Contributor Author

I further analyzed one of the halts:

  • compact uploaded a block successfully (at least no issues indicated in logs) on 2020-06-25 around noon
  • compact downloaded the block again on 2020-06-27 and has incomplete data. Download of the block took around 5h(!), an operation usually done within minutes
  • after restarting compactor, it will download again and successfully continue

thanos-store did not have any issues with this block.

With two days in-between I do not really suspect any issues with eventual consistency. This looks more like connection issues during download, but shouldn't the client library recognize such issues?

If there's anything more we can do, we're glad to jump in. For now, we'll likely switch to debug.halt=false as retrying/redownloading seems to work well.

@stale
Copy link

stale bot commented Jul 29, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Jul 29, 2020
@JensErat
Copy link
Contributor Author

Still an issue with us, and automatically restarting/not halting compactor make compaction continue pretty much immediately without any issues.

@stale
Copy link

stale bot commented Aug 28, 2020

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Aug 28, 2020
@GiedriusS GiedriusS removed the stale label Aug 28, 2020
@crsandeep
Copy link
Contributor

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

@GiedriusS
Copy link
Member

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

Are you using S3 as well? What underneath is implementing this API i.e. are you using AWS or something else?

@crsandeep
Copy link
Contributor

@GiedriusS We are using Ceph object store which is S3 compatible.

@stale
Copy link

stale bot commented Nov 3, 2020

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Nov 3, 2020
@stale
Copy link

stale bot commented Nov 17, 2020

Closing for now as promised, let us know if you need this to be reopened! 🤗

@stale stale bot closed this as completed Nov 17, 2020
@goober
Copy link
Contributor

goober commented Jan 29, 2021

We are experience the same issue with Thanos v 0.17.2 and the Ceph Object Gateway.
@JensErat were you able to solve your issue?

@pracucci
Copy link
Contributor

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF

From a quick look at Minio internals (very quick), I suspect this the case as well.

@GiedriusS
Copy link
Member

Should be fixed with #3795.

@yeya24
Copy link
Contributor

yeya24 commented Jun 22, 2022

Still caught the same error during compaction time. Wondering is there a way for me to overcome this issue or delete the bad series only using the bucket rewrite tool.
From the error log, what I can do is very limited.

level=error ts=2022-06-16T06:34:26.254782121Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@17721352815461611655: compact blocks [/var/thanos/compact/compact/0@17721352815461611655/01G5HSCFX18SE036E3C80SWTMG /var/thanos/compact/compact/0@17721352815461611655/01G5J0875CJCBXEBQKN9608VPJ /var/thanos/compact/compact/0@17721352815461611655/01G5J73YD0EYV92KHTEEWFSVV3 /var/thanos/compact/compact/0@17721352815461611655/01G5JDZNP5A9GS6CQ695XGQ9M0]: populate block: chunk iter: cannot populate chunk 8858369979: segment doesn't include enough bytes to read the chunk - required:268435469, available:268435456"

@yeya24 yeya24 reopened this Jun 22, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

@stale stale bot added the stale label Sep 21, 2022
@nitnatsnocER
Copy link

Hi,
I also have the same issue with NetApp StorageGrid object storage running thanos 0.30.1. Is there any way to check of fix blocks with tools bucket rewrite tool?

@stale stale bot removed the stale label Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants