Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

maytasm · 2021-07-13T11:22:22Z

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better

Description

As described in #10918, the PendingTaskBasedWorkerProvisioningStrategy of the Auto scaler does not work well when there are 0 worker node running. The problems are the following:

When there are 0 worker node running, currently the auto scaler will first scale up to minWorkerCount and only in the next provisioning cycle would be able to determine the correct number of workers needed to run all pending tasks. This is inefficient as we will have to go through two provisioning cycle plus the time it takes for the worker nodes in the first provisioning to be up and running before being able to scale to the correct number (basically it would take twice as long as needed)
When the minWorkerCount is set to 0 and there are 0 worker node running, the autoscaler will never attempts to add more instances. This is because the auto scaler will try to scale to minWorkerCount (which is 0). Hence, pending task will not be able to run.

The reason for the auto scaler scaling to minWorkerCount first is because without any running worker node, the auto scaler will not be able to determine the capacity per worker. (note even when there are running worker nodes, that the auto scaler assume that all worker nodes have the same capacity and use the capacity of the first running node).

To fix this problem, I introduce a new config in the PendingTaskBasedWorkerProvisioningConfig under druid.indexer.autoscale.workerCapacityFallback. This config tells the auto scaler the worker capcity for determining number of workers needed when there are currently no worker running. If unset or null, auto scaler will scale to minNumWorkers in autoScaler config instead. Note: this config is only applicable to pendingTaskBased provisioning strategy. Even if this config value is not accurate (i.e. if your worker node capacity changed over time) it is still useful for solving problem # 2 above, as the auto scaler will at least provision some nodes and in the next providing cycle will be able to determine the correct number of workers needed (rather than being stuck at 0 workers forever).

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

suneet-s · 2021-07-13T19:09:00Z

adding Design Review since this introduces a new config

suneet-s · 2021-07-13T19:12:46Z

...org/apache/druid/indexing/overlord/autoscaling/PendingTaskBasedWorkerProvisioningConfig.java

@@ -29,6 +29,8 @@
  @JsonProperty
  private int maxScalingStep = 10;

+  @JsonProperty
+  private Integer workerCapacityFallback = null;


why did you choose to use an Integer object instead of an int with a default value?

I suspect a Nullable Integer may be more error prone as callers may not think to check for null and handle that case. At the very least, this field and getWorkerCapacityFallback should be annotated with Nullable so static analysis can catch places where users forget to add null checks.

I wanted to use null as value not set. I guess I can also use -1 to do the same thing (value not set and do current behavior)

Changed to int and use -1 as default value to avoid potential NPE

suneet-s · 2021-07-13T20:04:01Z

...g/apache/druid/indexing/overlord/autoscaling/PendingTaskBasedWorkerProvisioningStrategy.java

@@ -280,7 +283,8 @@ private int getWorkersNeededToAssignTasks(
        final WorkerTaskRunnerConfig workerTaskRunnerConfig,
        final DefaultWorkerBehaviorConfig workerConfig,
        final Collection<Task> pendingTasks,
-        final Collection<ImmutableWorkerInfo> workers
+        final Collection<ImmutableWorkerInfo> workers,
+        final Integer workerCapacityFallback


Suggested change

final Integer workerCapacityFallback

final int workerCapacityFallback

suneet-s · 2021-07-13T20:04:15Z

...g/apache/druid/indexing/overlord/autoscaling/PendingTaskBasedWorkerProvisioningStrategy.java

@@ -441,12 +445,18 @@ private int getCurrValidWorkers(Collection<ImmutableWorkerInfo> workers)
    return currValidWorkers;
  }

-  private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers)
+  private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback)


Suggested change

private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback)

private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final int workerCapacityFallback)

suneet-s · 2021-07-13T20:05:40Z

...g/apache/druid/indexing/overlord/autoscaling/PendingTaskBasedWorkerProvisioningStrategy.java

+      } else {
+        // Assume capacity per worker as 1
+        return 1;
+      }


Based on this function I don't think you need to detect the case where workerCapacityFallback is unset. You can simply set the default value to 1.

Having user set the config to 1 and the default value being 1 should not be the same due to https://github.com/apache/druid/pull/11440/files#diff-86276fbfe645ed104be05c850062fdb8e89fce43c656e41725c1e6d6ed671c25R253

If you have no worker and the new config NOT SET then you do the current behavior which is scale to minNumWorkers (this basically disregard capacity and pending tasks you have)
if you have no worker and the new config SET (i.e. set to 1) then you use that value to determine how many worker of said capacity needed to process the pending tasks you have...which may not be the same as the configured minNumWorkers

ah I see. Sorry I missed that earlier

jihoonson

Thanks @maytasm for fixing this issue. I left a couple of comments.

jihoonson · 2021-07-14T05:10:23Z

docs/configuration/index.md

@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled):
 |`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S|
 |`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null|
 |`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080|
+|`druid.indexer.autoscale.workerCapacityFallback`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1|


"fallback" in the name does not seem intuitive to me. How about workerCapacityHint? Also, it would be nice to document how to set this value. For example, the doc can say "this value should be typically equal to druid.worker.capacity when you have a homogeneous cluster". For heterogeneous clusters, does it make sense to set this to the average capacity?

Changed the config name to workerCapacityHint. Updated the docs. I think for heterogeneous clusters, it does make sense to set this to the average capacity. I think the auto scaler should use the average capacity when there is running worker nodes too but that can be a separate change/discussion.

jihoonson · 2021-07-14T05:14:12Z

...g/apache/druid/indexing/overlord/autoscaling/PendingTaskBasedWorkerProvisioningStrategy.java

+      // as we cannot determine the exact capacity here to fulfill the need.
+      // However, if there are no worker but workerCapacityFallback config is set (>0), then we can
+      // determine the number of workers needed using workerCapacityFallback config as expected worker capacity
+      int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0 ? minWorkerCount : getWorkersNeededToAssignTasks(


nit: this line is too long. Please break the line. One example would be

int moreWorkersNeeded = currValidWorkers == 0 && config.getWorkerCapacityFallback() <= 0 ? minWorkerCount : getWorkersNeededToAssignTasks( remoteTaskRunnerConfig, workerConfig, pendingTasks, workers, config.getWorkerCapacityFallback() );

jihoonson · 2021-07-14T05:18:29Z

...tensions/src/main/java/org/apache/druid/indexing/overlord/autoscaling/gce/GceAutoScaler.java

@@ -80,8 +80,8 @@ public GceAutoScaler(
          @JsonProperty("envConfig") GceEnvironmentConfig envConfig
  )
  {
-    Preconditions.checkArgument(minNumWorkers > 0,
-                                "minNumWorkers must be greater than 0");
+    Preconditions.checkArgument(minNumWorkers >= 0,


I think the overlord should fail to start if minNumWorkers = 0 and the new config is not set to a positive because auto scaler will not work anyway. It would be better to fail early.

That's is only if you are using PendingTaskBasedWorkerProvisioningStrategy. You can have minNumWorkers = 0 and the new config not set if you use SimpleWorkerProvisioningStrategy. I think it makes more sense to have the check done in the PendingTaskBasedWorkerProvisioningStrategy, instead of having a check in every AutoScaler (EC2, GCE, etc) to check if the ProvisioningStategy == PendingTaskBasedWorkerProvisioningStrategy and minNumWorkers = 0 and the new config is not set. However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.

Ah sure, I didn't mean that the check code should be in each autoScaler because it's not possible as autoScaler doesn't know about provisioningStrategyConfig. The check code should be in PendingTaskBasedWorkerProvisioningStrategy instead. I just didn't find a good place to leave my comment earlier 😅

However, this means that the failure would only show when the auto scaler cycle runs. I think this is ok as the PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig already have some check that make sure the config/AutoScaler set is valid.

Hmm, the check can throw an exception in the constructor of PendingTaskBasedWorkerProvisioningStrategy, can't it? Because the constructor accepts both WorkerBehaviorConfig and PendingTaskBasedWorkerProvisioningConfig as its parameter.

The PendingTaskBasedWorkerProvisioningStrategy gets created once at startup which at that time may not have an auto scaler configured or if auto scaler was configured the minWorkerCount can change later. This is because the auto scaler along with the minWorkerCount can be configured dynamically after the cluster is already up and running. Hence, if the check is in the constructor of PendingTaskBasedWorkerProvisioningStrategy , then we can miss the cases I mentioned above. I think it is better to have the check in PendingTaskBasedWorkerProvisioningStrategy#getDefaultWorkerBehaviorConfig which is every time we do terminate/provision, etc.

Ah yeah you are right. Thanks for the explanation.

jihoonson

+1 but please update the doc before this PR is merged.

jihoonson · 2021-07-14T16:59:59Z

docs/configuration/index.md

@@ -1015,6 +1015,7 @@ There are additional configs for autoscaling (if it is enabled):
 |`druid.indexer.autoscale.pendingTaskTimeout`|How long a task can be in "pending" state before the Overlord tries to scale up.|PT30S|
 |`druid.indexer.autoscale.workerVersion`|If set, will only create nodes of set version during autoscaling. Overrides dynamic configuration. |null|
 |`druid.indexer.autoscale.workerPort`|The port that MiddleManagers will run on.|8080|
+|`druid.indexer.autoscale.workerCapacityHint`| Worker capacity for determining the number of workers needed for auto scaling when there is currently no worker running. If unset or set to value of 0 or less, auto scaler will scale to `minNumWorkers` in autoScaler config instead. This value should typically be equal to `druid.worker.capacity` when you have a homogeneous cluster and the average of `druid.worker.capacity` across the workers when you have a heterogeneous cluster. Note: this config is only applicable to `pendingTaskBased` provisioning strategy|-1|


Probably better to clarify that we count only middleManagers and indexers here for homogeneous and heterogeneous clusters.

Let me make a followup PR to do this since doc only PR skip all the Travis tests it will be super fast.

maytasm added 2 commits July 13, 2021 18:06

fix pendingTaskBased

789ea9c

fix doc

b1d132e

suneet-s added the Design Review label Jul 13, 2021

suneet-s reviewed Jul 13, 2021

View reviewed changes

maytasm added 2 commits July 14, 2021 02:28

address comments

2459031

address comments

f3955ce

suneet-s reviewed Jul 13, 2021

View reviewed changes

address comments

d6bb9a8

suneet-s approved these changes Jul 13, 2021

View reviewed changes

jihoonson added the Area - Ingestion label Jul 14, 2021

jihoonson reviewed Jul 14, 2021

View reviewed changes

maytasm added 4 commits July 14, 2021 14:52

address comments

06441b9

address comments

f5ed8df

address comments

298146c

address comments

cf39ec2

jihoonson approved these changes Jul 14, 2021

View reviewed changes

maytasm merged commit 8d7d60d into apache:master Jul 14, 2021

maytasm deleted the IMPLY-8399 branch July 14, 2021 23:52

maytasm mentioned this pull request Jul 15, 2021

Improve documentation for druid.indexer.autoscale.workerCapacityHint config #11444

Merged

9 tasks

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

maytasm commented Jul 13, 2021 •

edited

Loading

suneet-s commented Jul 13, 2021

suneet-s Jul 13, 2021

maytasm Jul 13, 2021

maytasm Jul 13, 2021

suneet-s Jul 13, 2021

maytasm Jul 13, 2021

suneet-s Jul 13, 2021

maytasm Jul 13, 2021

suneet-s Jul 13, 2021

maytasm Jul 13, 2021

suneet-s Jul 13, 2021

jihoonson left a comment

jihoonson Jul 14, 2021

maytasm Jul 14, 2021 •

edited

Loading

jihoonson Jul 14, 2021

maytasm Jul 14, 2021

jihoonson Jul 14, 2021

maytasm Jul 14, 2021 •

edited

Loading

jihoonson Jul 14, 2021

maytasm Jul 14, 2021

jihoonson Jul 14, 2021

jihoonson left a comment

jihoonson Jul 14, 2021

maytasm Jul 14, 2021

	final Integer workerCapacityFallback
	final int workerCapacityFallback

	private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final Integer workerCapacityFallback)
	private static int getExpectedWorkerCapacity(final Collection<ImmutableWorkerInfo> workers, final int workerCapacityFallback)

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

Improve Auto scaler pendingTaskBased provisioning strategy to handle when there are no currently running worker node better #11440

Conversation

maytasm commented Jul 13, 2021 • edited Loading

Description

suneet-s commented Jul 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jihoonson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maytasm commented Jul 13, 2021 •

edited

Loading

maytasm Jul 14, 2021 •

edited

Loading

maytasm Jul 14, 2021 •

edited

Loading