Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

JensErat · 2020-06-25T11:01:51Z

Thanos, Prometheus and Golang version used:

Thanos 0.13 rc2, currently upgrading to 0.13 but commits do not indicate any relevant change since the tag was crafted

Object Storage Provider:

Internal Cloudian S3 service

What happened:

Compactor halts with following error message:

level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

After restarting compactor, compacting continues successfully (usually hours later, we only get warned if compactor does not run for quite some time -- so I cannot tell whether this is a pretty short inconsistency window and lots of bad luck involved).

What you expected to happen:

If this is really a transient issue based on eventual consistency issues, retry and don't halt.

How to reproduce it (as minimally and precisely as possible):

Slightly overload your S3 service and wait. We have a fail rate of estimated once a week for a single compactor. Mostly happens in installations with some network distance (ping >= 200ms)

Full logs to relevant components:

Logs

[...]
level=info ts=2020-06-25T07:42:56.091077077Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAQYJKPJS17YPE28VZ48JF7Q
level=info ts=2020-06-25T07:42:56.11841702Z caller=compact.go:792 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="marking compacted block for deletion" old_block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:56.358365126Z caller=block.go:157 group="0@{monitor=\"prometheus\", replica=\"10\"}" groupKey=0@7398988455995530398 msg="block has been marked for deletion" block=01EAR5EAY80Z79JCDWQBAVVGJ4
level=info ts=2020-06-25T07:42:57.695050512Z caller=compact.go:633 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="compaction available and planned; downloading blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]"
level=info ts=2020-06-25T07:55:17.496533315Z caller=compact.go:692 group="0@{monitor=\"prometheus-ha\", replica=\"0\"}" groupKey=0@9955244503410235132 msg="downloaded and verified blocks; compacting blocks" plan="[/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]" duration=12m19.801396575s
level=error ts=2020-06-25T07:57:55.183632304Z caller=compact.go:375 msg="critical error detected; halting" err="compaction: group 0@9955244503410235132: compact blocks [/var/thanos/compact/data/compact/0@9955244503410235132/01EAQGV34VQ8PWVKXJ4V2K8M5B /var/thanos/compact/data/compact/0@9955244503410235132/01EAQQPTCT5ABJSXRBMG53V8ZC /var/thanos/compact/data/compact/0@9955244503410235132/01EAQYJHMZ9R0V8M5Y3GPG0NWD /var/thanos/compact/data/compact/0@9955244503410235132/01EAR5E8WTX7MJP7PG1DQMPY4W]: write compaction: iterate compaction set: chunk 41943009 not found: segment doesn't include enough bytes to read the chunk - required:41943044, available:41943040"

Anything else we need to know:

Pretty sure related to #2128, this and prometheus/prometheus#6040 (and our observations) make me think this might be a transient issue. Hard to tell since we cannot really reproduce, is #2637 maybe fixing this already?

The text was updated successfully, but these errors were encountered:

GiedriusS · 2020-06-25T16:36:30Z

This is not a retryable error since we don't have enough data on disk when compaction runs and this occurs.

We have been running into this the past month or so but only on our staging environment. We have found after #2637 that:

One chunk file is missing some data during compaction
All of the content of the affected file that is on disk has the correct data i.e. there is no corruption going, it only misses some data
The calls to S3 succeed i.e. 200 is returned to minio-go
We have load-balancers in front of our S3 endpoints and changing the configuration to go to the Ceph Object Gateway directly seems to fix this problem. It has been running for a week without a problem.

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF. Then, it thinks that it got everything and the caller continues its execution. I think it has something to do with keep-alive connections not being maintained properly.

After that PR, you should be able to inspect the state of files on disk and see if this happens for you as well. Help welcome!

JensErat · 2020-06-29T12:24:26Z

I further analyzed one of the halts:

compact uploaded a block successfully (at least no issues indicated in logs) on 2020-06-25 around noon
compact downloaded the block again on 2020-06-27 and has incomplete data. Download of the block took around 5h(!), an operation usually done within minutes
after restarting compactor, it will download again and successfully continue

thanos-store did not have any issues with this block.

With two days in-between I do not really suspect any issues with eventual consistency. This looks more like connection issues during download, but shouldn't the client library recognize such issues?

If there's anything more we can do, we're glad to jump in. For now, we'll likely switch to debug.halt=false as retrying/redownloading seems to work well.

stale · 2020-07-29T12:34:55Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

JensErat · 2020-07-29T14:19:59Z

Still an issue with us, and automatically restarting/not halting compactor make compaction continue pretty much immediately without any issues.

stale · 2020-08-28T17:15:09Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

crsandeep · 2020-09-03T06:25:45Z

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

GiedriusS · 2020-09-03T07:53:06Z

We are seeing the same issue. We have 200+ compactors and this is happening in all of them. Restarting pod fixes the issue temporarily.

Are you using S3 as well? What underneath is implementing this API i.e. are you using AWS or something else?

crsandeep · 2020-09-03T16:53:58Z

@GiedriusS We are using Ceph object store which is S3 compatible.

stale · 2020-11-03T03:10:49Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-11-17T04:03:27Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

goober · 2021-01-29T08:28:43Z

We are experience the same issue with Thanos v 0.17.2 and the Ceph Object Gateway.
@JensErat were you able to solve your issue?

pracucci · 2021-02-12T11:45:54Z

My best theory, for now, is that the transfer gets cut off abruptly, and minio-go eats that error up as io.EOF

From a quick look at Minio internals (very quick), I suspect this the case as well.

GiedriusS · 2021-02-24T22:11:06Z

Should be fixed with #3795.

yeya24 · 2022-06-22T01:11:04Z

Still caught the same error during compaction time. Wondering is there a way for me to overcome this issue or delete the bad series only using the bucket rewrite tool.
From the error log, what I can do is very limited.

level=error ts=2022-06-16T06:34:26.254782121Z caller=compact.go:486 msg="critical error detected; halting" err="compaction: group 0@17721352815461611655: compact blocks [/var/thanos/compact/compact/0@17721352815461611655/01G5HSCFX18SE036E3C80SWTMG /var/thanos/compact/compact/0@17721352815461611655/01G5J0875CJCBXEBQKN9608VPJ /var/thanos/compact/compact/0@17721352815461611655/01G5J73YD0EYV92KHTEEWFSVV3 /var/thanos/compact/compact/0@17721352815461611655/01G5JDZNP5A9GS6CQ695XGQ9M0]: populate block: chunk iter: cannot populate chunk 8858369979: segment doesn't include enough bytes to read the chunk - required:268435469, available:268435456"

stale · 2022-09-21T02:13:27Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

nitnatsnocER · 2023-01-11T17:05:06Z

Hi,
I also have the same issue with NetApp StorageGrid object storage running thanos 0.30.1. Is there any way to check of fix blocks with tools bucket rewrite tool?

JensErat mentioned this issue Jun 25, 2020

compact: set unready if halted #2806

Closed

2 tasks

GiedriusS added the component: compact label Jun 25, 2020

GiedriusS changed the title ~~Compactor halts on what seems a retryable error ("segment doesn't include enough bytes")~~ Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") Jun 25, 2020

GiedriusS added bug help wanted labels Jun 25, 2020

stale bot added the stale label Jul 29, 2020

stale bot removed the stale label Jul 29, 2020

GiedriusS mentioned this issue Aug 20, 2020

store panics with "slice bounds out of range" #3057

Closed

stale bot added the stale label Aug 28, 2020

GiedriusS removed the stale label Aug 28, 2020

stale bot added the stale label Nov 3, 2020

stale bot closed this as completed Nov 17, 2020

su47flying mentioned this issue Feb 12, 2021

for downloading object, be pessimistic and check how many bytes got c… #3790

Closed

2 tasks

GiedriusS reopened this Feb 12, 2021

stale bot removed the stale label Feb 12, 2021

pracucci mentioned this issue Feb 12, 2021

Truncated S3 "get object" response is reported as error #3795

Merged

2 tasks

GiedriusS closed this as completed Feb 24, 2021

yeya24 reopened this Jun 22, 2022

stale bot added the stale label Sep 21, 2022

stale bot removed the stale label Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

JensErat commented Jun 25, 2020 •

edited

Loading

GiedriusS commented Jun 25, 2020

JensErat commented Jun 29, 2020

stale bot commented Jul 29, 2020

JensErat commented Jul 29, 2020

stale bot commented Aug 28, 2020

crsandeep commented Sep 3, 2020

GiedriusS commented Sep 3, 2020

crsandeep commented Sep 3, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 17, 2020

goober commented Jan 29, 2021

pracucci commented Feb 12, 2021

GiedriusS commented Feb 24, 2021

yeya24 commented Jun 22, 2022 •

edited

Loading

stale bot commented Sep 21, 2022

nitnatsnocER commented Jan 11, 2023

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

Compactor/S3: sometimes segments are not being fully downloaded ("segment doesn't include enough bytes") #2805

Comments

JensErat commented Jun 25, 2020 • edited Loading

GiedriusS commented Jun 25, 2020

JensErat commented Jun 29, 2020

stale bot commented Jul 29, 2020

JensErat commented Jul 29, 2020

stale bot commented Aug 28, 2020

crsandeep commented Sep 3, 2020

GiedriusS commented Sep 3, 2020

crsandeep commented Sep 3, 2020

stale bot commented Nov 3, 2020

stale bot commented Nov 17, 2020

goober commented Jan 29, 2021

pracucci commented Feb 12, 2021

GiedriusS commented Feb 24, 2021

yeya24 commented Jun 22, 2022 • edited Loading

stale bot commented Sep 21, 2022

nitnatsnocER commented Jan 11, 2023

JensErat commented Jun 25, 2020 •

edited

Loading

yeya24 commented Jun 22, 2022 •

edited

Loading