Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

Merged
merged 9 commits into from
Jul 14, 2021

Conversation

maytasm
Copy link
Contributor

@maytasm maytasm commented Jul 13, 2021

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better

Description

As described in #10918, the PendingTaskBasedWorkerProvisioningStrategy of the Auto scaler does not work well when there are 0 worker node running. The problems are the following:

  1. When there are 0 worker node running, currently the auto scaler will first scale up to minWorkerCount and only in the next provisioning cycle would be able to determine the correct number of workers needed to run all pending tasks. This is inefficient as we will have to go through two provisioning cycle plus the time it takes for the worker nodes in the first provisioning to be up and running before being able to scale to the correct number (basically it would take twice as long as needed)
  2. When the minWorkerCount is set to 0 and there are 0 worker node running, the autoscaler will never attempts to add more instances. This is because the auto scaler will try to scale to minWorkerCount (which is 0). Hence, pending task will not be able to run.

The reason for the auto scaler scaling to minWorkerCount first is because without any running worker node, the auto scaler will not be able to determine the capacity per worker. (note even when there are running worker nodes, that the auto scaler assume that all worker nodes have the same capacity and use the capacity of the first running node).

To fix this problem, I introduce a new config in the PendingTaskBasedWorkerProvisioningConfig under druid.indexer.autoscale.workerCapacityFallback. This config tells the auto scaler the worker capcity for determining number of workers needed when there are currently no worker running. If unset or null, auto scaler will scale to minNumWorkers in autoScaler config instead. Note: this config is only applicable to pendingTaskBased provisioning strategy. Even if this config value is not accurate (i.e. if your worker node capacity changed over time) it is still useful for solving problem # 2 above, as the auto scaler will at least provision some nodes and in the next providing cycle will be able to determine the correct number of workers needed (rather than being stuck at 0 workers forever).

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@suneet-s
Copy link
Contributor

adding Design Review since this introduces a new config

@@ -29,6 +29,8 @@
@JsonProperty
private int maxScalingStep = 10;

@JsonProperty
private Integer workerCapacityFallback = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you choose to use an Integer object instead of an int with a default value?

I suspect a Nullable Integer may be more error prone as callers may not think to check for null and handle that case. At the very least, this field and getWorkerCapacityFallback should be annotated with Nullable so static analysis can catch places where users forget to add null checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use null as value not set. I guess I can also use -1 to do the same thing (value not set and do current behavior)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to int and use -1 as default value to avoid potential NPE

@@ -280,7 +283,8 @@ private int getWorkersNeededToAssignTasks(
final WorkerTaskRunnerConfig workerTaskRunnerConfig,
final DefaultWorkerBehaviorConfig workerConfig,
final Collection<Task> pendingTasks,
final Collection<ImmutableWorkerInfo> workers
final Collection<ImmutableWorkerInfo> workers,
final Integer workerCapacityFallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final Integer workerCapacityFallback
final int workerCapacityFallback

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -441,12 +445,18 @@ private int getCurrValidWorkers(Collection<ImmutableWorkerInfo> workers)
return currValidWorkers;
}

private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers)
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback)
private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final int workerCapacityFallback)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

} else {
// Assume capacity per worker as 1
return 1;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this function I don't think you need to detect the case where workerCapacityFallback is unset. You can simply set the default value to 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having user set the config to 1 and the default value being 1 should not be the same due to https://github.com/apache/druid/pull/11440/files#diff-86276fbfe645ed104be05c850062fdb8e89fce43c656e41725c1e6d6ed671c25R253

If you have no worker and the new config NOT SET then you do the current behavior which is scale to minNumWorkers (this basically disregard capacity and pending tasks you have)
if you have no worker and the new config SET (i.e. set to 1) then you use that value to determine how many worker of said capacity needed to process the pending tasks you have...which may not be the same as the configured minNumWorkers

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Sorry I missed that earlier

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @maytasm for fixing this issue. I left a couple of comments.

@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled):
|`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S|
|`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null|
|`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080|
|`druid.indexer.autoscale.workerCapacityFallback`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"fallback" in the name does not seem intuitive to me. How about workerCapacityHint? Also, it would be nice to document how to set this value. For example, the doc can say "this value should be typically equal to druid.worker.capacity when you have a homogeneous cluster". For heterogeneous clusters, does it make sense to set this to the average capacity?

Copy link
Contributor Author

@maytasm maytasm Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed the config name to workerCapacityHint. Updated the docs. I think for heterogeneous clusters, it does make sense to set this to the average capacity. I think the auto scaler should use the average capacity when there is running worker nodes too but that can be a separate change/discussion.

// as we cannot determine the exact capacity here to fulfill the need.
// However, if there are no worker but workerCapacityFallback config is set (>0), then we can
// determine the number of workers needed using workerCapacityFallback config as expected worker capacity
int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0 ? minWorkerCount : getWorkersNeededToAssignTasks(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this line is too long. Please break the line. One example would be

      int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0
                              ? minWorkerCount
                              : getWorkersNeededToAssignTasks(
                                  remoteTaskRunnerConfig,
                                  workerConfig,
                                  pendingTasks,
                                  workers,
                                  config.getWorkerCapacityFallback()
                              );

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -80,8 +80,8 @@ public GceAutoScaler(
@JsonProperty("envConfig") GceEnvironmentConfig envConfig
)
{
Preconditions.checkArgument(minNumWorkers > 0,
"minNumWorkers must be greater than 0");
Preconditions.checkArgument(minNumWorkers >= 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the overlord should fail to start if minNumWorkers = 0 and the new config is not set to a positive because auto scaler will not work anyway. It would be better to fail early.

Copy link
Contributor Author

@maytasm maytasm Jul 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's is only if you are using PendingTaskBasedWorkerProvisioningStrategy. You can have minNumWorkers = 0 and the new config not set if you use SimpleWorkerProvisioningStrategy. I think it makes more sense to have the check done in the PendingTaskBasedWorkerProvisioningStrategy, instead of having a check in every AutoScaler (EC2, GCE, etc) to check if the ProvisioningStategy == PendingTaskBasedWorkerProvisioningStrategy and minNumWorkers = 0 and the new config is not set. However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, I didn't mean that the check code should be in each autoScaler because it's not possible as autoScaler doesn't know about provisioningStrategyConfig. The check code should be in PendingTaskBasedWorkerProvisioningStrategy instead. I just didn't find a good place to leave my comment earlier 😅

However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.

Hmm, the check can throw an exception in the constructor of PendingTaskBasedWorkerProvisioningStrategy, can't it? Because the constructor accepts both WorkerBehaviorConfig and PendingTaskBasedWorkerProvisioningConfig as its parameter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PendingTaskBasedWorkerProvisioningStrategy gets created once at startup which at that time may not have an auto scaler configured or if auto scaler was configured the minWorkerCount can change later. This is because the auto scaler along with the minWorkerCount can be configured dynamically after the cluster is already up and running. Hence, if the check is in the constructor of PendingTaskBasedWorkerProvisioningStrategy , then we can miss the cases I mentioned above. I think it is better to have the check in PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig which is every time we do terminate/provision, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah you are right. Thanks for the explanation.

Copy link
Contributor

@jihoonson jihoonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 but please update the doc before this PR is merged.

@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled):
|`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S|
|`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null|
|`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080|
|`druid.indexer.autoscale.workerCapacityHint`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. This value should typically be equal to `druid.worker.capacity` when you have a homogeneous cluster and the average of `druid.worker.capacity` across the workers when you have a heterogeneous cluster. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to clarify that we count only middleManagers and indexers here for homogeneous and heterogeneous clusters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me make a followup PR to do this since doc only PR skip all the Travis tests it will be super fast.

@maytasm maytasm merged commit 8d7d60d into apache:master Jul 14, 2021
@maytasm maytasm deleted the IMPLY-8399 branch July 14, 2021 23:52
@clintropolis clintropolis added this to the 0.22.0 milestone Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants