-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up kill tasks by deleting segments in batch #14131
Speed up kill tasks by deleting segments in batch #14131
Conversation
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/apache/druid/segment/loading/OmniDataSegmentKiller.java
Show resolved
Hide resolved
Added |
b8a8285
to
6e5040e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on the API design. I think there are some improvements that can be made to the S3 implementation - let me know what you think. I also think that this behavior should be enabled by default in KillUnusedSegmentsTask
since it is the more efficient way to delete segments. Is there a reason not to enable it by default?
Please also describe the manual testing that was done with this change. Specifically, I would like to know:
- What happens when you try to delete segment files that have already been removed from S3?
- What happens when deleting one segment or descriptor in the batch fails for some reason?
Can you also share what the log messages look like in the failure cases. Since we are batching the requests - do we need to be more clever about listing all the keys that failed to delete in the logs?
Overall, this is a very exciting change! Thank you for implementing this!
@@ -129,8 +129,12 @@ public TaskStatus runTask(TaskToolbox toolbox) throws Exception | |||
|
|||
// Kill segments | |||
toolbox.getTaskActionClient().submit(new SegmentNukeAction(new HashSet<>(unusedSegments))); | |||
for (DataSegment segment : unusedSegments) { | |||
toolbox.getDataSegmentKiller().kill(segment); | |||
if (getContextValue("batchDelete", false)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep this feature flag, it should be enabled by default as this is the better method. Since there is no change in behavior with the flag enabled vs disabled, I don't see a reason to default disable this behavior when batch deleting segments is the more optimal way to delete segments. We should also document this in the task context documentation and mention that this feature flag will be going away in the very near future.
Is there a way to configure the auto-kill to specify this context so that Druid operators who have auto-kill enabled can enable / disable this behavior for kill tasks generated from the auto-kill functionality in Druid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah i was thinking the otherway around. have it be opt in at first, then opt out. Just to make it be for the more safety conscious users to have a chance to migrate to it. Never got around to imagining of removing it entirely. Im just hung up the whole feature flag and how no other extension can take advantage of it. So it's more of a best effort feature flag.
I would need to look into auto-kill and how that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO enabling this new feature by default is best, as it's better and the risk isn't super high.
I'm not even sure we need a feature flag at all. We should be able to get the risk low enough with appropriate testing that a flag isn't needed.
If we're gonna have one, then yeah, it needs to be settable for people that use autokill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As to testing: one thing to think about is whether this uses any new S3 APIs and therefore may require new IAM permissions. If it would require new IAM permissions then we definitely need a feature flag and we need to call it out in the docs and release notes. Or, better, adjust the design to not use new APIs and not need new IAM permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to go the "not require new APIs" route — which IMO is best — then we should have a note in the interface that says that the multi-arg form of kill
must not require additional permissions on the deep storage beyond what the single-arg form of kill
requires. Otherwise, when a multi-arg impl is contributed for another deep storage, it could break people in production that don't have the right set of permissions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm I decided to enable this by default and not have a feature flag for it at all. The api that is used to delete multiple keys from a bucket is already used elsewhere in the extension.
I added that verbiage for kill not require additional permissions
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
… one by one create new batchdelete method in datasegment killer that has default functionality of iterating through all segments and calling delete on them. This will enable a slow rollout of other deepstorage implementations to move to a batched delete on their own time
cleaned up code just need to add tests and docs for this functionality
add unit tests
…e. fix checkstyle
18d5f45
to
6b5c887
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new method signature looks good to me. I had some notes about the feature flag and testing and the required set of permissions; see the comments in thread https://github.com/apache/druid/pull/14131/files#r1178437209.
…couldn't be deleted if there was a client error or server error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading through the logs, I noticed that the log lines that report the files that are to be deleted and the files that failed to be deleted are very long and hard to read. However, I do not have a suggestion to make it better as an operator will likely want to see all the files that will be deleted in the logs. Perhaps a newline separator between the file names in the list instead of a comma will make it easier to read. ¯_(ツ)_/¯
I also noticed that to generate these messages we have to make copies of lists and create new objects. This can introduce additional memory pressure for the kill task that didn't exist before. I think this is acceptable as it is likely there is enough additional heap for the peons to deal with this.
LGTM after CI passes the intelliJ inspections. The code coverage failures reported in the processing modules test is not concerning to me as this is a very simple code path, that doesn't seem like it will benefit from unit tests.
Error: extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/S3DataSegmentKillerTest.java:355 -- The declared exception <code>SegmentLoadingException</code> is never thrown
Error: extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/S3DataSegmentKillerTest.java:392 -- The declared exception <code>SegmentLoadingException</code> is never thrown
Error: extensions-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/S3DataSegmentKillerTest.java:331 -- The declared exception <code>SegmentLoadingException</code> is never thrown
Error: Process completed with exit code 1.
@gianm / @rohangarg any other comments on this patch? |
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...ns-core/s3-extensions/src/test/java/org/apache/druid/storage/s3/S3DataSegmentKillerTest.java
Outdated
Show resolved
Hide resolved
// this was a shortcut to handle the many different ways there could potentially be failures and handle them | ||
// reasonably | ||
throw new SegmentLoadingException( | ||
"Couldn't delete segments from s3 see logs for more details" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please improve the grammar (missing punctuation) and capitalization (S3).
Also, can we interpolate some useful information into this error message? Such as the number of segments, the first segment ID (or first N IDs) that couldn't be deleted, the nature of the reason why it couldn't be deleted, etc. This info could come back from deleteKeysForBucket
rather than a boolean
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gianm The reason why i didn't do that is because all the information is truncated in the status for the delete task. So i figured it wasn't worth the complication to corral that state and present it in a meaningful way if it gets truncated.
This is how it looks in reality in the status object.
"errorMsg": "org.apache.druid.segment.loading.SegmentLoadingException: Couldn't delete from bucket: [tomfenwick-t..."
that being said i can still make those changes
@TSFenwick we'd spotted the same in #14634 , please @ me if you'd like any help to close this out, extra testing, etc, as this would save me a lot of effort too (~100 million segments pending to delete). Great work. |
fyi, a related change to the SQL layer #14639 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few notes
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Outdated
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Show resolved
Hide resolved
...nsions-core/s3-extensions/src/main/java/org/apache/druid/storage/s3/S3DataSegmentKiller.java
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/loading/DataSegmentKiller.java
Show resolved
Hide resolved
This PR now has three +1 from three committers. @TSFenwick Can you check the failing tests? @rohangarg @gianm Do you have any other comments or feel that this PR should not be merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the CI passes, would recommend rebasing it with master too while fixing the failed tests
…te-files-when-killing-unused-segments
@rohangarg @maytasm I fixed the one broken test. I was told by @suneet-s and @abhishekagarwal87 to not care about the failing code coverage the default method seen below for the past test runs https://github.com/apache/druid/actions/runs/5650168274/job/15306385499#step:11:2167 |
…te-files-when-killing-unused-segments
Description
create new kill method in
DataSegmentKiller
that takes a list of DataSegments to kill. This will enable implementers ofDataSegmentKiller
to leverage a batch delete that the underlying storage system may have.DataSegmentKiller
has a default method of iterating through all segments and calling delete on them to enable backwards compatibility. Nothing is new is needed to leverage batch delete for s3. We currently use the same s3 api call for other s3 deletions so i have full confidence in this change and have tested this using s3.Manual test case scenarios
testing results
Incorrect Access Key
log_invalid_access_key.txt
Successful no errors - single bucket
log_delete_no_issues.txt
unable to delete indices
log_delete_no_perms_on_keys.txt
Successful delete across multiple buckets
log_delete_across_multiple_buckets.txt
rough perf test results
~1025 segments deleted in 13seconds batch delete, normal delete took 5 minutes 40 seconds
Release note
Deletion of multiple segments stored in S3 will be much faster now with it leveraging batch deletion.
Key changed/added classes in this PR
S3DataSegmentKiller.java
OmniDataSegmentKiller.java
DataSegmentKiller.java
KillUnusedSegmentsTask.java
This PR has: