Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for task limit on kill tasks spawned by auto kill coordinator duty #14769

Merged
merged 6 commits into from
Aug 8, 2023

Conversation

zachjsh
Copy link
Contributor

@zachjsh zachjsh commented Aug 7, 2023

Description

Previously, the KillUnusedSegments coordinator duty, in charge of periodically deleting unused segments, could spawn an unlimited number of kill tasks for unused segments. This change adds 2 new coordinator dynamic configs that can be used to control the limit of tasks spawned by this coordinator duty

killTaskSlotRatio: Ratio of total available task slots, including autoscaling if applicable that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is 1, which allows all available tasks to be used, which is the existing behavior

maxKillTaskSlots: Maximum number of tasks that will be allowed for kill tasks. This limit only applies for kill tasks that are spawned automatically by the coordinator's auto kill duty. Default is INT.MAX, which essentially allows for unbounded number of tasks, which is the existing behavior.

Realize that we can effectively get away with just the one killTaskSlotRatio, but following similarly to the compaction config, which has similar properties; I thought it was good to have some control of the upper limit regardless of ratio provided.

Release note

NEW: killTaskSlotRatio and maxKillTaskSlots coordinator dynamic config properties added that allow control of task resource usage spawned by KillUnusedSegments coordinator task (auto kill)

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Previously, the `KillUnusedSegments` coordinator duty, in charge of
periodically deleting unused segments, could spawn an unlimited number
of kill tasks for unused segments. This change adds 2 new coordinator
dynamic configs that can be used to control the limit of tasks spawned
by this coordinator duty
Copy link
Contributor

@jasonk000 jasonk000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM! Some tweaks to consider.

@@ -149,7 +173,7 @@ private void killUnusedSegments(Collection<String> dataSourcesToKill)
}
}

log.debug("Submitted kill tasks for [%d] datasources.", submittedTasks);
log.debug("Submitted [%d] kill tasks for [%d] datasources.", submittedTasks, dataSourcesToKill.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any datasources were skipped due to not enough slots, we could consider logging them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will get this in a follow up patch

while (activeTasks.hasNext()) {
final TaskStatusPlus status = activeTasks.next();

// taskType can be null if middleManagers are running with an older version. Here, we consevatively regard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is important enough to capture as a method-level javadoc hint. ie: this might overestimate in some circumstances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taskType can be null if middleManagers are running with an older version.

Do we know in which versio the middle managers would report the task type as null? I saw the same comment in the compact segments duty that was from mid 2018. So I wonder if this null check is that important

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats exactly where I got it from 😆 . Just wanted to play it safe.

return numActiveKillTasks;
}

private int getKillTaskCapacity(double killTaskSlotRatio, int maxKillTaskSlots)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, I'd split this to a getTotalWorkerCapacity, to keep the kill calculation separate. I think testing would be cleaner this way too

It would be great to see a unit test that explicitly could test getKillTaskCapacity based only on numeric inputs, so that you can test 0s, infinities, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Fixed, and added tests. Let me know if better now.

public void testNoAvailableTaskCapacityForKill()
{
mockNoAvailableTaskSlotsForKill();
runAndVerifyNoKill();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice improvement!

@@ -495,6 +533,8 @@ private static class Defaults
static final int MAX_NON_PRIMARY_REPLICANTS_TO_LOAD = Integer.MAX_VALUE;
static final boolean USE_ROUND_ROBIN_ASSIGNMENT = true;
static final boolean SMART_SEGMENT_LOADING = true;
static final double KILL_TASK_SLOT_RATIO = 1.0;
static final int MAX_KILL_TASK_SLOTS = Integer.MAX_VALUE;
Copy link
Contributor

@suneet-s suneet-s Aug 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here that these defaults are to preserve the behavior before Druid 0.28 and a future version may want to consider better defaults so that kill tasks can not eat up all the capacity in the cluster would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added, thanks

Comment on lines 120 to 121
if (0 == availableKillTaskSlots) {
log.warn("Not killing any unused segments because there are no available kill task slots at this time.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a warning? It seems like it could be quite verbose if there are long running kill tasks and the KillUnusedSegments duty is scheduled to run frequently.

Perhaps, a better way of surfacing this information is emitting the number of kill tasks and calculated kill capacity as metrics so that operators can look at these metrics to decide if they need more kill capacity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to debug. 💯 about adding metrics, will do that in a follow up patch.

while (activeTasks.hasNext()) {
final TaskStatusPlus status = activeTasks.next();

// taskType can be null if middleManagers are running with an older version. Here, we consevatively regard
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

taskType can be null if middleManagers are running with an older version.

Do we know in which versio the middle managers would report the task type as null? I saw the same comment in the compact segments duty that was from mid 2018. So I wonder if this null check is that important

return numActiveKillTasks;
}

private int getTotalWorkerCapacity()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like a lot of this code is very similar to the code in CompactSegments Can we refactor the code so that the code can be re-used in both duties?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely correct. Will refactor this in a follow up patch.

@zachjsh zachjsh requested a review from suneet-s August 8, 2023 04:58
Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Looking forward to the follow up patches!

@zachjsh zachjsh merged commit 660e6cf into apache:master Aug 8, 2023
74 checks passed
@zachjsh zachjsh deleted the auto-kill-resource-control branch August 8, 2023 12:40
@vogievetsky vogievetsky added the Needs web console change Backend API changes that would benefit from frontend support in the web console label Aug 8, 2023
@vogievetsky vogievetsky removed the Needs web console change Backend API changes that would benefit from frontend support in the web console label Aug 17, 2023
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants