Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase to upstream/cluster-autoscaler-release-1.16 #119

Closed
wants to merge 686 commits into from

Conversation

danielmellado
Copy link

@danielmellado danielmellado commented Oct 24, 2019

The PR was created by first taking upstream/cluster-autoscaler-release-1.16 as the base then applying UPSTREAM: patches on top. The set of picks applied was derived from:

PICKED/CONFLICT 470bd635e UPSTREAM: <carry>: openshift: revert https://github.com/openshift/kubernetes-autoscaler/pull/113
PICKED/OK 0b4a9d514 UPSTREAM: <carry>: openshift: compare_nodegroups: Tolerate small differences in memory capacity
PICKED/OK e0cbdb403 UPSTREAM: <carry>: openshift: Run 'make goimports' over cluster-autoscaler/cloudprovider/openshiftmachineapi
PICKED/OK 9847539cc UPSTREAM: <carry>: openshift: Extend makefile with 'make goimports' target
PICKED/CONFLICT b8ec486e1 UPSTREAM: <carry>: openshift: add custom nodeset comparator
PICKED/OK d65c26fb2 UPSTREAM: <carry>: openshift: report MaxNodesTotal count
PICKED/OK 1b746e405 UPSTREAM: <carry>: openshift: reference k8s.io/api/core/v1 with corev1 alias
PICKED/OK 35cc7965d UPSTREAM: <carry>: openshift: Rework logic in DeleteNodes()
PICKED/OK abbb314e0 UPSTREAM: <carry>: openshift: Switch builds to use Go 1.12
PICKED/CONFLICT 6efd972c0 UPSTREAM: <carry>: openshift: Bump deps for cloudprovider/openshiftmachineapi
PICKED/OK c688fa6ac UPSTREAM: <carry>: openshift: add test/openshift/Makefile
PICKED/OK b81de6560 UPSTREAM: <carry>: openshift: fix build post 1.14 rebase
PICKED/OK 86002a4a3 UPSTREAM: <carry>: openshift: simplify test setup for nodes/replicas
PICKED/OK baa22d62f UPSTREAM: <carry>: openshift: update criteria for returning a nodegroup
PICKED/OK 7e8be6416 UPSTREAM: <carry>: openshift: add test/openshift/hack scripts
PICKED/OK f3aa284a2 UPSTREAM: <carry>: openshift: move MaxNodesTotalReached event again
PICKED/OK f5d5d669e UPSTREAM: <carry>: openshift: Revert "Merge pull request #90 from frobware/add-cluster-size-reached-event"
PICKED/OK 0947e6480 UPSTREAM: <carry>: openshift: move maxnodes total event
PICKED/OK eb75e6d7d UPSTREAM: <carry>: openshift: Add joelsmith and sjenning to VPA OWNERS file
PICKED/OK beeedef12 UPSTREAM: <carry>: openshift: Add OpenShift VPA image builds
PICKED/OK 1dade0d72 UPSTREAM: <carry>: openshift: prioritize providerID in calls to Nodes()
PICKED/OK f25a6aa17 UPSTREAM: <carry>: openshift: add test to search using annotation
PICKED/OK 4ec5b6887 UPSTREAM: <carry>: openshift: Prioritize machine search using provider ID
PICKED/OK bf7bf2354 UPSTREAM: <carry>: openshift: add findMachineByNodeProviderID
PICKED/OK e030dd3d3 UPSTREAM: <carry>: openshift: add index to machine informer
PICKED/OK 3412cfc98 UPSTREAM: <carry>: openshift: simplify config creation
PICKED/OK a63568c3c UPSTREAM: <carry>: openshift: record max-nodes-total event
PICKED/OK 68c4c33e3 UPSTREAM: <carry>: openshift: disable MachineDeployment in normal operation
PICKED/OK a83320a7b UPSTREAM: <carry>: openshift: shorten test helper names
PICKED/OK 909396c97 UPSTREAM: <carry>: openshift: add debugFormat const
PICKED/OK dfb77c5a4 UPSTREAM: <carry>: openshift: move machineAnnotationKey constant
PICKED/OK ce3e9a721 UPSTREAM: <carry>: openshift: rework controller test setup
PICKED/OK 2b97154d1 UPSTREAM: <carry>: openshift: rework nodegroup test setup
PICKED/OK bdc7ed988 UPSTREAM: <carry>: openshift: rework provider test setup
PICKED/OK bc4db8435 UPSTREAM: <carry>: openshift: add new test helper
PICKED/OK a0ffce84f UPSTREAM: <carry>: openshift: expose nodegroup type for testing
PICKED/OK d816c806e UPSTREAM: <carry>: openshift: simplify controller node lookup tests
PICKED/OK d7b961851 UPSTREAM: <carry>: openshift: simplify TestControllerNodeGroupForNodeWithMissingMachineOwner
PICKED/OK ff81185b6 UPSTREAM: <carry>: openshift: remove TestControllerNodeGroupsSizes
PICKED/OK f4ad36cdf UPSTREAM: <carry>: openshift: simplify TestControllerMachinesInMachineSet
PICKED/OK 219d52859 UPSTREAM: <carry>: openshift: simplify TestControllerFindNodeByNodeName
PICKED/OK a4563cb4e UPSTREAM: <carry>: openshift: simplify TestControllerFindMachineByNodeProviderID
PICKED/OK 56ab19993 UPSTREAM: <carry>: openshift: simplify TestControllerFindMachineOwner
PICKED/OK efcb4e0e1 UPSTREAM: <carry>: openshift: fix lint issue machineapi_provider
PICKED/OK 5a0d21103 UPSTREAM: <carry>: openshift: Add fmt, lint, vet scripts/Makefile
PICKED/CONFLICT 35b37a052 UPSTREAM: <carry>: openshift: expose KubeConfigPath in CA options
PICKED/CONFLICT 6a251a307 UPSTREAM: <carry>: openshift: cloudprovider builder updates for 1.13
PICKED/OK 4ecc1d233 UPSTREAM: <carry>: openshift: cloudprovider updates for 1.13
PICKED/OK afa3ff275 UPSTREAM: <carry>: openshift: add unit test TestNodeGroupDecreaseTargetSize
PICKED/OK 2f11664a8 UPSTREAM: <carry>: openshift: add unit test TestNodeGroupIncreaseSize
PICKED/OK 820cea0ad UPSTREAM: <carry>: openshift: create git history verification script
PICKED/OK 587a0a5bc UPSTREAM: <carry>: openshift: check for explicit errors in resize tests
PICKED/OK be823b0bf UPSTREAM: <carry>: openshift: move all test utility functions
PICKED/OK 817bd97bd UPSTREAM: <carry>: openshift: remove parallel tests
PICKED/OK a11169ddf UPSTREAM: <carry>: openshift: simplify TestProviderConstructorProperties
PICKED/OK 2a43ac621 UPSTREAM: <carry>: openshift: add spec to clusterTestConfig
PICKED/OK acaebb962 UPSTREAM: <carry>: openshift: force test namespace ToLower()
PICKED/OK 8e96d1ac1 UPSTREAM: <carry>: openshift: remove unused makeMachineOwner()
PICKED/OK f4bdec1ab UPSTREAM: <carry>: openshift: remove t.Helper() in test helpers
PICKED/OK 14f6d94d9 UPSTREAM: <carry>: openshift: Rework TestControllerNodeGroupForNodeLookup
PICKED/OK 025c4e64a UPSTREAM: <carry>: openshift: Rework utils test funcs
PICKED/OK 548aef244 UPSTREAM: <carry>: openshift: Rework TestControllerFindMachineByID
PICKED/OK e7499baee UPSTREAM: <carry>: openshift: Rework TestControllerNodeGroups
PICKED/OK 368ba1b25 UPSTREAM: <carry>: openshift: Rework TestNodeGroupDeleteNodes
PICKED/OK 0368ecb33 UPSTREAM: <carry>: openshift: Rework TestNodeGroupResize
PICKED/OK 216538a54 UPSTREAM: <carry>: openshift: Rework TestNodeGroupNewNodeGroup
PICKED/OK 59c5273c8 UPSTREAM: <carry>: openshift: remove TODO
PICKED/OK e7184e4b9 UPSTREAM: <carry>: openshift: openshiftmachineapi: remove unused fields
PICKED/OK 346070a04 UPSTREAM: <carry>: openshift: Remove old test functions
PICKED/OK 6c294f2d5 UPSTREAM: <carry>: openshift: return no nodegroup when scaling bounds == 0
PICKED/OK 53626b2ad UPSTREAM: <carry>: openshift: create utility functions
PICKED/OK b2811c128 UPSTREAM: <carry>: openshift: validate node membership in DeleteNodes()
PICKED/OK 22c8f9347 UPSTREAM: <carry>: openshift: log nodegroup discovery at level 4
PICKED/OK 9a17ba0c1 UPSTREAM: <carry>: openshift: address review comments
PICKED/OK 13ef18dd2 UPSTREAM: <carry>: openshift: openshiftmachineapi: add feature gate for MachineDeployment support
PICKED/OK 54908968f UPSTREAM: <carry>: openshift: add unit test for nodegroup
PICKED/OK 751691069 UPSTREAM: <carry>: openshift: add unit test for provider
PICKED/OK 1c3006351 UPSTREAM: <carry>: openshift: add unit test for controller
PICKED/OK 0ebf27bd8 UPSTREAM: <carry>: openshift: add unit test for utils
PICKED/OK a3c490977 UPSTREAM: <carry>: openshift: MachineDeployment implementation of ScalableResource
PICKED/OK e7f222689 UPSTREAM: <carry>: openshift: MachineSet implementation of ScalableResource
PICKED/OK 468764cbc UPSTREAM: <carry>: openshift: Add interface ScalableResource
PICKED/OK 22cc1b4e2 UPSTREAM: <carry>: openshift: Revert to whitebox testing
PICKED/OK cf756c02d UPSTREAM: <carry>: openshift: Add MachineDeployment informers to controller
PICKED/OK c8752bd63 UPSTREAM: <carry>: openshift: cloudprovider: pivot to machine.openshift.io/v1beta1
PICKED/OK d71287be3 UPSTREAM: <carry>: openshift: assign ownership to cloud team
PICKED/OK bf71a0d67 UPSTREAM: <carry>: openshift: Remove machineController.MachineSets() as it is not called
PICKED/OK 2baa1b0ea UPSTREAM: <carry>: openshift: Add unit test for MachinesInMachineSet()
PICKED/OK e8bb3a916 UPSTREAM: <carry>: openshift: Add unit test for findMachineByNodeProviderID()
PICKED/OK 19c6a640f UPSTREAM: <carry>: openshift: Add unit test for findMachineOwner()
PICKED/OK 51f48149e UPSTREAM: <carry>: openshift: Add unit test for findNodeByNodeName()
PICKED/OK f7c9db1e5 UPSTREAM: <carry>: openshift: Add unit test for findMachine()
PICKED/OK 342b52a52 UPSTREAM: <carry>: openshift: Remove obsolete usage of "machine" annotation
PICKED/OK 6260a467c UPSTREAM: <carry>: openshift: utils: add unit tests for clusterapi_utils.go
PICKED/OK 13be13650 UPSTREAM: <carry>: openshift: Use node.Spec.ProviderID instead of node.Name
PICKED/OK 27b6ee75b UPSTREAM: <carry>: openshift: Decouple nodegroup via dependency injection
PICKED/OK 26dcf0944 UPSTREAM: <carry>: openshift: Use 'machine' for machine parameter names
PICKED/OK 21c6b0f0a UPSTREAM: <carry>: openshift: Rename field provider.clusterapi to clusterapiClient
PICKED/OK d56b05e9b UPSTREAM: <carry>: openshift: Use NamespaceAll in lieu of ""
PICKED/OK 2f796423b UPSTREAM: <carry>: openshift: Rename clusterController to machineController
PICKED/OK c91d57a1a UPSTREAM: <carry>: openshift: Additional logging in provider.NodeGroups()
PICKED/OK 1b8f4aac4 UPSTREAM: <carry>: openshift: Move min/max constants to clusterapi_utils.go
PICKED/OK 92ac3ce3c UPSTREAM: <carry>: openshift: Switch to informers for observing machines and machinesets
PICKED/OK a0b82709a UPSTREAM: <carry>: openshift: cluster-api switch to annotations
PICKED/OK be3b8e840 UPSTREAM: <carry>: openshift: Add a RHEL7 dockerfile and standarize format
PICKED/OK ca73a66de UPSTREAM: <carry>: openshift: initial cluster-api provider implementation
PICKED/OK 83db04b86 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: set golang_version to 1.10
PICKED/OK a4903d792 UPSTREAM: <carry>: openshift: cluster-autoscaler.spec: bump golang_version to 1.10.4
PICKED/OK 532d2e88e UPSTREAM: <carry>: openshift: Fix spec file to be consistent
PICKED/OK 3d619d08b UPSTREAM: <carry>: openshift: Bump embedded tools
PICKED/OK cc885725d UPSTREAM: <carry>: openshift: Fix the spec and hack scripts so the package can be built in both build systems
PICKED/CONFLICT 647555606 UPSTREAM: <carry>: openshift: Add openshift/release Makefile and hack scripts
PICKED/OK abd595981 UPSTREAM: <carry>: openshift: Add dockerfile for cluster autoscaler.
PICKED/OK 302c7ab11 UPSTREAM: <carry>: openshift: Add spec file for cluster-autoscaler.

k8s-ci-robot and others added 30 commits July 16, 2019 01:37
Use print() function in both Python 2 and Python 3
Run VPA e2e tests for both apis even if one fails
add test case for controller_fetcher
correct comments
typo
add test case for controller_fetcher and cluster_feeder
Currently we read certificate files into a buffer of 5000 bytes. This
sets a max size for our certificates; above this max, we will get
invalid certs, most likely, leading to runtime bad certificate errors.
Use ioutil for certificate file reading.
- SubscriptionID placeholder does not match documentation or surrounding placeholders
Add support for cronjobs in VPA
Update c5,i3en,m5,r5 instance types
xrange() was removed in Python 3 in favor of range()
Fix typo in autoscaler azure example
…rash

nodeGroup judy IsNil to avoid crashed
…erver-over-heapster

Use Metrics Server instead of Heapster in the FAQ
Update VPA dependencies to k8s 1.14.3
Switch VPA examples to use apps/v1 API for Deployment
…milarity_rules

fix: ignore agentpool label when looking for similar node groups with Azure provider
…onfig_readme

Add multinode config in readme.
"delete node" has a specific meaning in the context of the Kubernetes
API: deleting the node object. However, the cluster-autoscaler never
does that; it terminates the underlying instance and expects the
[cloud-node-controller](https://kubernetes.io/docs/concepts/architecture/cloud-controller/#node-controller)
to remove the corresponding node from Kubernetes.

Replace all mentions of "deleting" a node with "terminating" to
disambiguate this.

Signed-off-by: Matthias Rampke <mr@soundcloud.com>
This had me stumped for the better part of a day. While the
cluster-autoscaler "deletes" nodes, it does not actually delete the Node
object from the Kubernetes API.

In normal operations, with a well-configured cluster, this is a minor
point; however, when debugging why nodes do not get deleted, the
inconsistent terminology can be a major headache. This FAQ entry should
clarify the difference for anyone who needs to know.

Signed-off-by: Matthias Rampke <mr@soundcloud.com>
…examples

Add required selector to VPA deployment examples for apps/v1
up when there was a multizonal pool with number of nodes exceeding limit for one zone.
frobware and others added 17 commits October 24, 2019 13:41
…from frobware/add-cluster-size-reached-event"

This reverts commit 9b2cf49, reversing
changes made to 291cdcd.
Be less specific as to when the event is generated. The qualifying
criteria is when the adding a node would reach the --max-nodes-total
specified.
We previously only considered a MachineSet or MachineDeployment
suitable for scaling if it had positive scaling bounds.

For example:

    annotations:
      machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "1"
      machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "6"

On us-west-2 we have:

    $ kubectl get machinesets
    NAME                                  DESIRED   CURRENT   READY   AVAILABLE   AGE
    amcdermo-ca-5ws2b-worker-us-west-2a   1         1         1       1           4h23m
    amcdermo-ca-5ws2b-worker-us-west-2b   1         1         1       1           4h23m
    amcdermo-ca-5ws2b-worker-us-west-2c   1         1         1       1           4h23m
    amcdermo-ca-5ws2b-worker-us-west-2d   0         0                             4h23m

but the autoscaler will fail to scale up any of these because the last
machineset has a replica count of 0. This change modifies the criteria
for where we return a nodegroup (i.e., machine set) to must have:
- positive scaling bounds, and
- the underlying resource must have a replica count that is > 0.
When creating a test config I had previously allowed for the number of
nodes to be different to the number of replicas. We don't use this
distinction so dropping that entirely from all the unit tests.
This invokes the e2e tests and is mainly invoked by CI.
This reworks the logic in DeleteNodes([]nodes) to:

- annotate a node for deletion
- drop the replica count by 1 immediately
- fail fast on any error during those steps

This is different from the current behaviour which first annotated all
the nodes then dropped the replica count by len(nodes). The goal here
is to not leave machines and the respective machine set replica count
in an indeterminate state should any error occur.
…1 alias

This is largely to be consistent with other usages (in the community)
but really to be at parity with the upstream PR [1] that uses this
import alias already. This also makes it easier to backport changes
made from openshift/autoscaler into upstream.

[1] kubernetes#1866
The original intent of the event was to convey that the ceiling had
been reached. This now logs that configured maximum.
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1731011

In OpenShift clusters we see very small memory differences between
instances, typically 8Ki. The default comparator which is used when
--balance-similar-node-groups is enabled only considers nodes in a
nodegroup to be equal when the memory capacity is identical. This new
comparator is identical to the default but will tolerate a 128Ki
difference in memory capacity.

With this change, and with --balance-similar-node-groups enabled, I
now see the following:

```console
$ oc get machinesets
NAME                               DESIRED   CURRENT   READY   AVAILABLE   AGE
amcdermo-w6m6z-worker-us-east-2a   3         3         3       3           24h
amcdermo-w6m6z-worker-us-east-2b   4         4         4       4           24h
amcdermo-w6m6z-worker-us-east-2c   4         4         4       4           24h
e2e-59443-w-0                      3         3         1       1           20m
e2e-59443-w-1                      3         3         1       1           20m
e2e-59443-w-2                      3         3         1       1           20m
e2e-59443-w-3                      3         3         1       1           20m
e2e-59443-w-4                      3         3         1       1           20m
e2e-59443-w-5                      3         3         1       1           20m
e2e-59443-w-6                      3         3         1       1           20m
e2e-59443-w-7                      3         3         1       1           20m
e2e-59443-w-8                      3         3         1       1           20m
```
…arget

So the target can be run in CI and fail in case files are not properly goimport formated.
…erences in memory capacity

The current comparator expects memory capacity values to be identical.
However across AWS, Azure and GCP I quite often see very small
differences in capacity, typically 8-16Ki. When this occurs the
nodegroups are considered not equal when balancing is in effect which
is unfortunate because, in reality, they are identical.

This change will now tolerate a 128Ki difference before memory
capacity values are considered unequal.
@danielmellado
Copy link
Author

Pull-request updated, HEAD is now 6ddbc80

1 similar comment
@danielmellado
Copy link
Author

Pull-request updated, HEAD is now 6ddbc80

@openshift-ci-robot openshift-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Oct 24, 2019
@openshift-ci-robot openshift-ci-robot added the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Oct 24, 2019
@openshift-ci-robot
Copy link

The following users are mentioned in OWNERS file(s) but are not members of the openshift org.

Once all users have been added as members of the org, you can trigger verification by writing /verify-owners in a comment.

  • krzysztof-jastrzebski
    • cluster-autoscaler/OWNERS
  • Jeffwan
    • cluster-autoscaler/OWNERS
    • cluster-autoscaler/cloudprovider/aws/OWNERS
  • ringtail
    • cluster-autoscaler/cloudprovider/alicloud/OWNERS
  • hello2mao
    • cluster-autoscaler/cloudprovider/baiducloud/OWNERS
  • andrewsykim
    • cluster-autoscaler/cloudprovider/digitalocean/OWNERS
  • tghartland
    • cluster-autoscaler/cloudprovider/magnum/OWNERS

@openshift-ci-robot
Copy link

@danielmellado: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/unit 6ddbc80 link /test unit
ci/prow/e2e-aws-operator 6ddbc80 link /test e2e-aws-operator
ci/prow/images 6ddbc80 link /test images
ci/prow/govet 6ddbc80 link /test govet

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@danielmellado danielmellado changed the title Pull request for rebase_1.16 Rebase to upstream/cluster-autoscaler-release-1.16 Oct 24, 2019
@enxebre
Copy link
Member

enxebre commented Oct 30, 2019

thanks @danielmellado! closing in favour of #120

@enxebre
Copy link
Member

enxebre commented Oct 30, 2019

/close

@openshift-ci-robot
Copy link

@enxebre: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.