Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move retries into DataSegmentPusher implementations. #15938

Merged
merged 6 commits into from
Mar 4, 2024

Conversation

gianm
Copy link
Contributor

@gianm gianm commented Feb 21, 2024

The individual implementations know better when they should and should not retry. They can also generate better error messages.

Most network-based deep storage implementations already have logic for retrying, except HDFS, which I added in this patch, and Azure. It looks like the Azure client itself may have some built-in retry stuff, but I'm not totally sure. That one might also need retry wrapping. If anyone has experience with Azure and knows the answer to this, please let me know. For now, I've left it without retry wrapping.

The inspiration for this patch was a situation where EntityTooLarge was generated by the S3DataSegmentPusher, and retried uselessly by the retry harness in PartialSegmentMergeTask. Other related changes in this patch include:

  • Adds an error message that spells out the problem, and potential fix, when a S3 upload attempt for a segment fails because it exceeds 5GB.
  • Stops unconditionally retrying on non-retryable S3 errors.
  • Updates the documentation to clarify that there is in some cases a cap on the max segment size.

The individual implementations know better when they should and should
not retry. They can also generate better error messages.

The inspiration for this patch was a situation where EntityTooLarge was
generated by the S3DataSegmentPusher, and retried uselessly by the
retry harness in PartialSegmentMergeTask.
Copy link
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@georgew5656
Copy link
Contributor

the azure client handles retries of transient exceptions. as far as i know there's no limit for size of upload using the BlobClient.upload operation that we use to upload segments so this wouldn't apply, but if it did the client should fail fast as expected rather than retrying.

@gianm
Copy link
Contributor Author

gianm commented Mar 4, 2024

The only error is Invalid docusaurus-theme-mermaid version 2.4.3 in docs as part of Static Checks CI / web-checks (pull_request). I tried restarting the job, but it failed again. Possibly something is wrong with the build cache?

Anyway, I don't think it's related to this patch, so I'll merge it anyway. Thanks for the reviews.

@gianm gianm merged commit 930655f into apache:master Mar 4, 2024
82 of 83 checks passed
@gianm gianm deleted the push-retries branch March 4, 2024 18:36
@clintropolis
Copy link
Member

The only error is Invalid docusaurus-theme-mermaid version 2.4.3 in docs as part of Static Checks CI / web-checks (pull_request). I tried restarting the job, but it failed again. Possibly something is wrong with the build cache?

ah, i saw this same error in a different PR and though it was nothing, but looking closer the real error here is a spellcheck failure:

> spellcheck
> mdspell --en-us --ignore-numbers --report '../docs/**/*.md' || (./script/notify-spellcheck-issues && false)

    ../docs/operations/segment-optimization.md
       54 | age imposes an upper limit of 5GB. 

>> 1 spelling error found in 251 files

I guess its complaining about 5GB, which i hear it would prefer as 5 GB

@gianm
Copy link
Contributor Author

gianm commented Mar 5, 2024

I guess its complaining about 5GB, which i hear it would prefer as 5 GB

Oops, fixed in #16040.

@adarshsanjeev adarshsanjeev added this to the 30.0.0 milestone May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants