Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services #3499

JoelSpeed · 2023-03-06T12:29:16Z

What happened:

I have been investigating disruption of services behind load balancers with their externalTrafficPolicy: Cluster (herein ETP Cluster). Currently in Azure, for an ETP Cluster service, the health probes on Azure are configured to check the node port of the application itself, using a blind TCP check. They can be configured to use an HTTP(S) based check but this only allows the path of the check to be specified, it doesn't allow you to specify a custom port.

When draining an application, for an ETP Local service, assuming you have configured graceful shutdown correctly, you can return a negative health probe while still accepting new connections, during the graceful termination period. This allows you to signal to the load balancer before you go away, that you are going away. Allowing it to route new connections to other nodes.

In an ETP Cluster service, this doesn't work. With the same backend, during the graceful shutdown period, the application will start showing a negative readiness probe within the cluster, and will be removed from the service endpoints. To the nodeport however, this is completely opaque and it has no idea that the pod has been removed from the node.

As soon as the application goes into an unready state, kube-proxy (or equivalent) on the node will start routing traffic to other endpoints within the service. The load balancer still believes that the instance is healthy, no matter how you configure the health checks. It is only once the node itself can no longer route traffic to these other endpoints, that the health probes start to fail.
Because we are health checking, effectively, another load balancer, this can only happen after the application can no longer accept new connections (typically this means kube-proxy has shut down).

What you expected to happen:

I should be able to configure a health probe that tells Azure that my instance will shortly, no longer be able to serve traffic, allowing it to remove the instance from service before the instance can no longer accept new connections.

How to reproduce it (as minimally and precisely as possible):

I've created a test that you can run which runs some disruption tests to observe how much a service is disrupted when we upgrade a cluster (we run this within CI as part of OpenShift's release process).

To run it, you will need

A copy of oc on your PATH (download here)
Check out the disruption-test branch from github.com/JoelSpeed/origin
An Azure cluster running with your KUBECONFIG var pointing to the the cluster
To execute go test ./test/e2e/upgrade/service/disruption-test -v –timeout=60m from the checked out repo above

Note, this test will take approximately 30 minutes to run and will fail if there is any disruption to the service. We are typically seeing between 3 and 6 seconds of disruption, where we see 0 seconds with the equivalent for an ETP local service.

The test runs through each node in the cluster, drains the instance, reboots it, then uncordons it before moving onto the next instance.

Anything else we need to know?:

I think you don't necessarily need to reproduce this as we can have a discussion logically about this.

Since the service and kube-proxy handle the load balancing within the cluster, health checking the application itself is not helpful from an azure perspective to know whether to route traffic to the instance. Instead, we need to ascertain whether the instance has an ability to route traffic to the backend. This relies on whatever is implementing the node ports, in most cases this is kube-proxy.

Based on that, we should logically be checking kube-proxy's health, and not the node port itself.

In fact, GCP came to the same conclusion and, for an ETP Cluster service, they by default check port 10256 with path /healthz, which is exposed by kube-proxy. (internal and external).

I've been testing a patch that is very much WIP right now, but moves Azure's default ETP Cluster check to match GCPs and, in some testing with the above tests, I have achieved results of 0s of disruption on Azure as well. Because the Azure LB can now see when the instance is going to go unhealthy, it can remove the instance from service earlier meaning fewer dropped connections.

I would like to propose that we make that patch a feature of the Azure CCM and make sure that, by default, when no other health probe configuration is specified, Azure ETP Cluster services health check port 10256 with path /healthz instead of their current behaviour.

Some other notes:

I've been testing with ovn-kubernetes rather than kube-proxy, I'm working with the team there who have made ovn-k behave in the same way as kube-proxy for my testing, so the results should be the same
We also tried this with a patch that prevents Nodes from being removed from the load balancer when they are unready according to their status. We believe this is causing separate issues and I will raise a separate issue for this. Without the improved health checks, we were seeing 7-15s of disruption with this patch, with the improved health checks, we saw 0s.
@damdo and @ricky-rav have been helping me to put together the various patches we needed to get to this point
I was using the kubelets graceful node shutdown feature to test this, without it, disruption is around 4s while not removing instances from the load balancer, better than the 7-15 seconds we were seeing

Environment:

Kubernetes version (use kubectl version): 1.26.2
Cloud provider or hardware configuration: Azure
OS (e.g: cat /etc/os-release):
Kernel (e.g. uname -a):
Install tools: OpenShift 4.13 pre-release
Network plugin and version (if this is a network-related bug): ovn-kubernetes (OpenShift 4.13 pre-release)
Others:

The text was updated successfully, but these errors were encountered:

JoelSpeed · 2023-03-10T09:05:47Z

I think this KEP which talks about graceful draining of connections might be related, I wonder if Azure maintainers have seen this?

issue: kubernetes-sigs#3499 bug: https://issues.redhat.com/browse/OCPBUGS-7359

MartinForReal · 2023-03-15T06:32:48Z

The following annotation is added to change the probe port number for service. FYI
service.beta.kubernetes.io/port_{port}_health-probe_port

JoelSpeed · 2023-03-15T08:47:25Z

Hi @MartinForReal thanks for the suggestion, have you confirmed that this actually works with the Azure cloud provider?

I'm on PTO right now so don't have a laptop handy, but I did go through the code pretty thoroughly before submitting this issue and came to the conclusion that there was no way to override the port. I saw other annotations of a similar format, but this one didn't appear to be supported in Azure.

MartinForReal · 2023-03-15T10:47:48Z

I think it is added.

cloud-provider-azure/pkg/provider/azure_loadbalancer.go

Line 2119 in b8b4145

    
           probePort, err := consts.GetHealthProbeConfigOfPortFromK8sSvcAnnotation(serviceManifest.Annotations, port.Port, consts.HealthProbeParamsPort, func(s *string) error {

JoelSpeed · 2023-03-24T12:08:26Z

Reviewing that, looks like we need to have the port in the service for it to be used as a health check right?

cloud-provider-azure/pkg/provider/azure_loadbalancer.go

Line 2145 in b8b4145

return fmt.Errorf("port %s not found in service", *s)

So in that case, you'd expect to add a dummy port 10256 to the service, and then set service.beta.kubernetes.io/port_10256_health-probe_only, have I understood correctly?

… health probes by default See issue: kubernetes-sigs/cloud-provider-azure#3499 bug: https://issues.redhat.com/browse/OCPBUGS-7359

JoelSpeed · 2023-03-29T12:27:20Z

I've tried the above today to no luck, if I specify a port annotation to redirect the health check port on my service to port 10256, it errors because there is no port within the service with port 10256. If i were to create a port, that doesn't actually help me because I can't force the nodeport to be port 10256

The service I was trying is as below:

apiVersion: v1
kind: Service
metadata:
  name: test-service
  annotations:
    service.beta.kubernetes.io/port_80_health-probe_port: "10256"
    service.beta.kubernetes.io/port_10256_no_probe_rule: "true"
    service.beta.kubernetes.io/port_10256_no_lb_rule: "true"
spec:
  selector:
    k8s-app: test
  ports:
  - name: app
    protocol: TCP
    port: 80
  - name: healthz
    protocol: TCP
    port: 10256
  type: LoadBalancer
  externalTrafficPolicy: Cluster

So how can I get the load balancer health probe port to be an arbitrary port that isn't within the bounds of the service?

… health probes by default See issue: kubernetes-sigs/cloud-provider-azure#3499 bug: https://issues.redhat.com/browse/OCPBUGS-7359

JoelSpeed · 2023-04-21T13:47:21Z

Cc @alexanderConstantinescu I know you've been working on improving disruption to services, could you check out this issue and see if you agree/if there's any more you can add

…rs from lb when unready workaround to mitigate issue: kubernetes-sigs/cloud-provider-azure#3500 bug: https://issues.redhat.com/browse/OCPBUGS-7359 UPSTREAM: <carry>: legacy-cloud-providers: azure: use kube-proxy based health probes by default See issue: kubernetes-sigs/cloud-provider-azure#3499 bug: https://issues.redhat.com/browse/OCPBUGS-7359

k8s-triage-robot · 2024-01-23T22:56:03Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-22T23:02:05Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2024-03-23T23:41:55Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-03-23T23:41:58Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

JoelSpeed added the kind/bug Categorizes issue or PR as related to a bug. label Mar 6, 2023

JoelSpeed mentioned this issue Mar 6, 2023

Azure instances can become wedged when they are unready and need outbound connections to become ready #3500

Closed

damdo added a commit to damdo/cloud-provider-azure that referenced this issue Mar 10, 2023

loadbalancer: use kube-proxy based health probes by default

a6dd952

issue: kubernetes-sigs#3499 bug: https://issues.redhat.com/browse/OCPBUGS-7359

damdo mentioned this issue Mar 29, 2023

OCPBUGS-7359: Azure: move to kube-proxy LB probes, don't detach masters when unready openshift/kubernetes#1506

Merged

2 tasks

damdo mentioned this issue May 16, 2023

Default to kube-proxy health probes for Service type=LoadBalancer #3887

Closed

This was referenced Jun 7, 2023

DNM: Default to kube-proxy health probes for Service type=LoadBalancer openshift/cloud-provider-azure#71

Closed

OCPBUGS-14368: Default to kube-proxy health probes for Service type=LoadBalancer openshift/cloud-provider-azure#72

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 23, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 22, 2024

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services #3499

Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services #3499

JoelSpeed commented Mar 6, 2023 •

edited

Loading

JoelSpeed commented Mar 10, 2023

MartinForReal commented Mar 15, 2023

JoelSpeed commented Mar 15, 2023

MartinForReal commented Mar 15, 2023

JoelSpeed commented Mar 24, 2023

JoelSpeed commented Mar 29, 2023 •

edited

Loading

JoelSpeed commented Apr 21, 2023

k8s-triage-robot commented Jan 23, 2024

k8s-triage-robot commented Feb 22, 2024

k8s-triage-robot commented Mar 23, 2024

k8s-ci-robot commented Mar 23, 2024

Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services #3499

Health Probes do not gracefully shutdown traffic for externalTrafficPolicy Cluster services #3499

Comments

JoelSpeed commented Mar 6, 2023 • edited Loading

What happened:

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

JoelSpeed commented Mar 10, 2023

MartinForReal commented Mar 15, 2023

JoelSpeed commented Mar 15, 2023

MartinForReal commented Mar 15, 2023

JoelSpeed commented Mar 24, 2023

JoelSpeed commented Mar 29, 2023 • edited Loading

JoelSpeed commented Apr 21, 2023

k8s-triage-robot commented Jan 23, 2024

k8s-triage-robot commented Feb 22, 2024

k8s-triage-robot commented Mar 23, 2024

k8s-ci-robot commented Mar 23, 2024

JoelSpeed commented Mar 6, 2023 •

edited

Loading

JoelSpeed commented Mar 29, 2023 •

edited

Loading