Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[compact] Increase default consistency delay to 24h #1901

Closed
wants to merge 1 commit into from
Closed

[compact] Increase default consistency delay to 24h #1901

wants to merge 1 commit into from

Conversation

mattrco
Copy link
Contributor

@mattrco mattrco commented Dec 17, 2019

Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉

We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions:

• shard A compacts blocks X and Y, creating block Z on disk
• it starts to upload Z, starting with chunk 1 and going sequentially
• shard B scans the bucket, finds partially uploaded block Z
• it thinks Z is malformed because meta.json isn’t there (it gets uploaded last, after all chunks)
• shard B deletes block Z
• meanwhile, shard A is still uploading, so you end up with some of the latter blocks uploaded, and meta.json

As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.

If we're happy to go ahead I'll add a changelog entry.

Firstly, thanks for adding sharding to compact/store using relabeling config - I took part in a previous discussion on this and it's great to have the ability to do it now 🎉 

We've run into an issue which caused some data loss. We have very large blocks that can take hours to upload to S3. When running multiple shards, we found that the malformed block check caused erroneous deletions:

```
• shard A compacts blocks X and Y, creating block Z on disk
• it starts to upload Z, starting with chunk 1 and going sequentially
• shard B scans the bucket, finds partially uploaded block Z
• it thinks Z is malformed because meta.json isn’t there (it gets uploaded last, after all chunks)
• shard B deletes block Z
• meanwhile, shard A is still uploading, so you end up with some of the latter blocks uploaded, and meta.json
```

As a quick fix, upping the consistency deadline makes things safer. But I wonder if we should also disable the malformed block deletion by default (and log a warning) if a relabeling config exists, so that the user can manually inspect and delete those blocks.
@mattrco
Copy link
Contributor Author

mattrco commented Dec 17, 2019

(Another solution would be to use different buckets, but we'd need to add relabeling config support to sidecar.)

@bwplotka
Copy link
Member

bwplotka commented Dec 17, 2019

Thanks for this!

ConsistencyDelay I think should stay shorter. Otherwise, you don't compact quick enough the small 2h blocks.
We are touching the idea of upload timeout, but it looks like it cannot be hardcoded/configured with single value, especially on compactor with bigger blocks: It can vary a lot.

Unfortunately, we shard on meta.json which is missing due to pending upload, so putting blocks in different buckets would work but I think it's quite inconvenient. We need to make it work with what we have now.

Ideas:

  • For this particular problem, I think increasing just minimumAge of removal on partial upload will do the work.
  • Make upload more resistant and check if all files are there afterward.

@squat
Copy link
Member

squat commented Dec 17, 2019

We are discussing this issue, and many related ones, and would like to come up with something that prevents these racing deletions in general. Both of the changes mentioned by Bartek are pretty low cost and likely high impact changes that will fix a good proportion of the races

@mattrco
Copy link
Contributor Author

mattrco commented Dec 17, 2019

For this particular problem, I think increasing just minimumAge of removal on partial upload will do the work.

Thanks, I realise this doesn't behave precisely how I intended. I'll add a flag for minimumAge so it can be configured 👍

@bwplotka
Copy link
Member

@mattrco are you on CNCF slack maybe? See this working document we are working on before you jump into implementation: https://docs.google.com/document/d/1QvHt9NXRvmdzy51s4_G21UW00QwfL9BfqBDegcqNvg0/edit

@mattrco
Copy link
Contributor Author

mattrco commented Dec 17, 2019

Great, let me jump on slack 👀

@bwplotka
Copy link
Member

bwplotka commented Jan 6, 2020

#1937 fixes the issue.

@bwplotka bwplotka closed this Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants