-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440
Conversation
adding |
@@ -29,6 +29,8 @@ | |||
@JsonProperty | |||
private int maxScalingStep = 10; | |||
|
|||
@JsonProperty | |||
private Integer workerCapacityFallback = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did you choose to use an Integer
object instead of an int
with a default value?
I suspect a Nullable Integer may be more error prone as callers may not think to check for null and handle that case. At the very least, this field and getWorkerCapacityFallback
should be annotated with Nullable
so static analysis can catch places where users forget to add null checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to use null as value not set. I guess I can also use -1 to do the same thing (value not set and do current behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to int and use -1
as default value to avoid potential NPE
@@ -280,7 +283,8 @@ private int getWorkersNeededToAssignTasks( | |||
final WorkerTaskRunnerConfig workerTaskRunnerConfig, | |||
final DefaultWorkerBehaviorConfig workerConfig, | |||
final Collection<Task> pendingTasks, | |||
final Collection<ImmutableWorkerInfo> workers | |||
final Collection<ImmutableWorkerInfo> workers, | |||
final Integer workerCapacityFallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final Integer workerCapacityFallback | |
final int workerCapacityFallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -441,12 +445,18 @@ private int getCurrValidWorkers(Collection<ImmutableWorkerInfo> workers) | |||
return currValidWorkers; | |||
} | |||
|
|||
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers) | |||
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback) | |
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final int workerCapacityFallback) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} else { | ||
// Assume capacity per worker as 1 | ||
return 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this function I don't think you need to detect the case where workerCapacityFallback
is unset. You can simply set the default value to 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having user set the config to 1 and the default value being 1 should not be the same due to https://github.com/apache/druid/pull/11440/files#diff-86276fbfe645ed104be05c850062fdb8e89fce43c656e41725c1e6d6ed671c25R253
If you have no worker and the new config NOT SET then you do the current behavior which is scale to minNumWorkers (this basically disregard capacity and pending tasks you have)
if you have no worker and the new config SET (i.e. set to 1) then you use that value to determine how many worker of said capacity needed to process the pending tasks you have...which may not be the same as the configured minNumWorkers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see. Sorry I missed that earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @maytasm for fixing this issue. I left a couple of comments.
docs/configuration/index.md
Outdated
@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled): | |||
|`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S| | |||
|`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null| | |||
|`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080| | |||
|`druid.indexer.autoscale.workerCapacityFallback`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"fallback" in the name does not seem intuitive to me. How about workerCapacityHint
? Also, it would be nice to document how to set this value. For example, the doc can say "this value should be typically equal to druid.worker.capacity
when you have a homogeneous cluster". For heterogeneous clusters, does it make sense to set this to the average capacity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the config name to workerCapacityHint
. Updated the docs. I think for heterogeneous clusters, it does make sense to set this to the average capacity. I think the auto scaler should use the average capacity when there is running worker nodes too but that can be a separate change/discussion.
// as we cannot determine the exact capacity here to fulfill the need. | ||
// However, if there are no worker but workerCapacityFallback config is set (>0), then we can | ||
// determine the number of workers needed using workerCapacityFallback config as expected worker capacity | ||
int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0 ? minWorkerCount : getWorkersNeededToAssignTasks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this line is too long. Please break the line. One example would be
int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0
? minWorkerCount
: getWorkersNeededToAssignTasks(
remoteTaskRunnerConfig,
workerConfig,
pendingTasks,
workers,
config.getWorkerCapacityFallback()
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -80,8 +80,8 @@ public GceAutoScaler( | |||
@JsonProperty("envConfig") GceEnvironmentConfig envConfig | |||
) | |||
{ | |||
Preconditions.checkArgument(minNumWorkers > 0, | |||
"minNumWorkers must be greater than 0"); | |||
Preconditions.checkArgument(minNumWorkers >= 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the overlord should fail to start if minNumWorkers = 0
and the new config is not set to a positive because auto scaler will not work anyway. It would be better to fail early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's is only if you are using PendingTaskBasedWorkerProvisioningStrategy. You can have minNumWorkers = 0 and the new config not set if you use SimpleWorkerProvisioningStrategy. I think it makes more sense to have the check done in the PendingTaskBasedWorkerProvisioningStrategy, instead of having a check in every AutoScaler (EC2, GCE, etc) to check if the ProvisioningStategy == PendingTaskBasedWorkerProvisioningStrategy and minNumWorkers = 0 and the new config is not set. However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, I didn't mean that the check code should be in each autoScaler because it's not possible as autoScaler doesn't know about provisioningStrategyConfig. The check code should be in PendingTaskBasedWorkerProvisioningStrategy
instead. I just didn't find a good place to leave my comment earlier 😅
However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.
Hmm, the check can throw an exception in the constructor of PendingTaskBasedWorkerProvisioningStrategy
, can't it? Because the constructor accepts both WorkerBehaviorConfig
and PendingTaskBasedWorkerProvisioningConfig
as its parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PendingTaskBasedWorkerProvisioningStrategy gets created once at startup which at that time may not have an auto scaler configured or if auto scaler was configured the minWorkerCount can change later. This is because the auto scaler along with the minWorkerCount can be configured dynamically after the cluster is already up and running. Hence, if the check is in the constructor of PendingTaskBasedWorkerProvisioningStrategy , then we can miss the cases I mentioned above. I think it is better to have the check in PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig which is every time we do terminate/provision, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah you are right. Thanks for the explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 but please update the doc before this PR is merged.
@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled): | |||
|`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S| | |||
|`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null| | |||
|`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080| | |||
|`druid.indexer.autoscale.workerCapacityHint`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. This value should typically be equal to `druid.worker.capacity` when you have a homogeneous cluster and the average of `druid.worker.capacity` across the workers when you have a heterogeneous cluster. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably better to clarify that we count only middleManagers and indexers here for homogeneous and heterogeneous clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make a followup PR to do this since doc only PR skip all the Travis tests it will be super fast.
Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better
Description
As described in #10918, the PendingTaskBasedWorkerProvisioningStrategy of the Auto scaler does not work well when there are 0 worker node running. The problems are the following:
The reason for the auto scaler scaling to minWorkerCount first is because without any running worker node, the auto scaler will not be able to determine the capacity per worker. (note even when there are running worker nodes, that the auto scaler assume that all worker nodes have the same capacity and use the capacity of the first running node).
To fix this problem, I introduce a new config in the PendingTaskBasedWorkerProvisioningConfig under
druid.indexer.autoscale.workerCapacityFallback
. This config tells the auto scaler the worker capcity for determining number of workers needed when there are currently no worker running. If unset or null, auto scaler will scale tominNumWorkers
in autoScaler config instead. Note: this config is only applicable topendingTaskBased
provisioning strategy. Even if this config value is not accurate (i.e. if your worker node capacity changed over time) it is still useful for solving problem # 2 above, as the auto scaler will at least provision some nodes and in the next providing cycle will be able to determine the correct number of workers needed (rather than being stuck at 0 workers forever).This PR has: