Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VM Instance Initialization: Driver.InitializeMachine #898

Merged
merged 10 commits into from
Apr 3, 2024
2 changes: 1 addition & 1 deletion docs/development/cp_support_new.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ The contract between the Machine Controller Manager (MCM) and the Machine Contro
1. Fill in the methods described at `pkg/provider/core.go` to manage VMs on your cloud provider. Comments are provided above each method to help you fill them up with desired `REQUEST` and `RESPONSE` parameters.
- A sample provider implementation for these methods can be found [here](https://github.com/gardener/machine-controller-manager-provider-aws/blob/master/pkg/aws/core.go).
- Fill in the required methods `CreateMachine()`, and `DeleteMachine()` methods.
- Optionally fill in methods like `GetMachineStatus()`, `ListMachines()`, and `GetVolumeIDs()`. You may choose to fill these once the working of the required methods seems to be working.
- Optionally fill in methods like `GetMachineStatus()`, `InitializeMachine`, `ListMachines()`, and `GetVolumeIDs()`. You may choose to fill these once the working of the required methods seems to be working.
- `GetVolumeIDs()` expects VolumeIDs to be decoded from the volumeSpec based on the cloud provider.
- There is also an OPTIONAL method `GenerateMachineClassForMigration()` that helps in migration of `{ProviderSpecific}MachineClass` to `MachineClass` CR (custom resource). This only makes sense if you have an existing implementation (in-tree) acting on different CRD types. You would like to migrate this. If not, you MUST return an error (machine error UNIMPLEMENTED) to avoid processing this step.
1. Perform validation of APIs that you have described and make it a part of your methods as required at each request.
Expand Down
59 changes: 59 additions & 0 deletions docs/development/machine_error_codes.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,64 @@ The MCM MUST implement the specified error recovery behavior when it encounters
The status `message` MUST contain a human readable description of error, if the status `code` is not `OK`.
This string MAY be surfaced by MCM to end users.

#### `InitializeMachine`

Provider can OPTIONALLY implement this driver call. Else should return a `UNIMPLEMENTED` status in error.
This interface method will be called by the MCM to initialize a new VM just after creation. This can be used to configure network configuration etc.

- This call requests the provider to initialize a newly created VM backing the machine-object.
- The `InitializeMachineResponse` returned by this method is expected to return
- `ProviderID` that uniquely identifys the VM at the provider. This is expected to match with the `node.Spec.ProviderID` on the node object.
- `NodeName` that is the expected name of the machine when it joins the cluster. It must match with the node name.

```protobuf
unmarshall marked this conversation as resolved.
Show resolved Hide resolved
// InitializeMachine call is responsible for VM initialization on the provider.
InitializeMachine(context.Context, *InitializeMachineRequest) (*InitializeMachineResponse, error)

// InitializeMachineRequest encapsulates params for the VM Initialization operation (Driver.InitializeMachine).
type InitializeMachineRequest struct {
// Machine object representing VM that must be initialized
Machine *v1alpha1.Machine

// MachineClass backing the machine object
MachineClass *v1alpha1.MachineClass

// Secret backing the machineClass object
Secret *corev1.Secret
}

// InitializeMachineResponse is the response for VM instance initialization (Driver.InitializeMachine).
type InitializeMachineResponse struct {
// ProviderID is the unique identification of the VM at the cloud provider.
// ProviderID typically matches with the node.Spec.ProviderID on the node object.
// Eg: gce://project-name/region/vm-ID
ProviderID string

// NodeName is the name of the node-object registered to kubernetes.
NodeName string
}

```

##### InitializeMachine Errors

If the provider is unable to complete the `InitializeMachine` call successfully, it MUST return a non-ok machine code in the machine status.

If the conditions defined below are encountered, the provider MUST return the specified machine error code.
The MCM MUST implement the specified error recovery behavior when it encounters the machine error code.

| machine Code | Condition | Description | Recovery Behavior | Auto Retry Required |
|-----------|-----------|-------------|-------------------|------------|
| 0 OK | Successful | The call was successful in initializing a VM that matches supplied initialization request. The `InitializeMachineResponse` is returned with desired values | | N |
| 5 NOT_FOUND | Timeout | VM Instance for Machine isn't found at provider | Skip Initialization and Continue | N |
| 12 UNIMPLEMENTED | Not implemented | Unimplemented indicates operation is not implemented or not supported/enabled in this service. | Skip Initialization and continue | N |
| 13 INTERNAL | Major error | Means some invariants expected by underlying system has been broken. | Needs investigation and possible intervention to fix this | Y |
| 17 UNINITIALIZED | Failed Initialization| VM Instance could not be initializaed | Initialization is reattempted in next reconcile cycle | Y |
unmarshall marked this conversation as resolved.
Show resolved Hide resolved
Comment on lines +199 to +200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| 13 INTERNAL | Major error | Means some invariants expected by underlying system has been broken. | Needs investigation and possible intervention to fix this | Y |

In practice not many will make a distinction between uninitialized and internal - why would they since both errors are retryable. If there is a difference in the requeue time maybe but. Saying Needs investigation and possible intervention to fix this and still allowing retries does not make sense. I think we could live without the "uninitialized" error per the spec.

Copy link
Contributor Author

@elankath elankath Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Konstantinos, We need the Uninitialized error code to disambiguate the specific case of an un-initialized machine. The Internal error code is more like a generic error code in the MCM presently.

Maybe you are right and we can remove codes.Internal from the response error codes for InitializeMachine to be more strict and not lenient.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But how should I, as a extension maintainer differentiate ? What is an example use-case ? Maybe it can be added in the description


The status `message` MUST contain a human readable description of error, if the status `code` is not `OK`.
This string MAY be surfaced by MCM to end users.


#### `DeleteMachine`

A Provider is REQUIRED to implement this driver call.
Expand Down Expand Up @@ -267,6 +325,7 @@ If the conditions defined below are encountered, the provider MUST return the sp
| 13 INTERNAL | Major error | Means some invariants expected by underlying system has been broken. If you see one of these errors, something is very broken. | Needs manual intervension to fix this | N |
| 14 UNAVAILABLE | Not Available | Unavailable indicates the service is currently unavailable. | Retry operation after sometime | Y |
| 16 UNAUTHENTICATED | Missing provider credentials | Request does not have valid authentication credentials for the operation | Fix the provider credentials | N |
| 17 UNINITIALIZED | Failed Initialization| VM Instance could not be initializaed | Initialization is reattempted in next reconcile cycle | N |
Copy link
Contributor

@kon-angelo kon-angelo Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initializaed

initialized


The status `message` MUST contain a human readable description of error, if the status `code` is not `OK`.
This string MAY be surfaced by MCM to end users.
Expand Down
34 changes: 33 additions & 1 deletion pkg/util/provider/driver/driver.go
Original file line number Diff line number Diff line change
Expand Up @@ -20,14 +20,23 @@ package driver
import (
"context"

"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
corev1 "k8s.io/api/core/v1"

"github.com/gardener/machine-controller-manager/pkg/apis/machine/v1alpha1"
)

// Driver is the common interface for creation/deletion of the VMs over different cloud-providers.
type Driver interface {
// CreateMachine call is responsible for VM creation on the provider
CreateMachine(context.Context, *CreateMachineRequest) (*CreateMachineResponse, error)
// InitializeMachine call is responsible for VM initialization on the provider.
// This method should only be invoked as a post VM creation initialization to configure network configuration etc.
//
// In case of an error, this operation should return an error with one of the following status codes
// - codes.Unimplemented if the provider does not support VM instance initialization.
// - codes.Uninitialized initialization of VM instance failed due to errors
// - codes.NotFound if VM instance was not found.
InitializeMachine(context.Context, *InitializeMachineRequest) (*InitializeMachineResponse, error)
// DeleteMachine call is responsible for VM deletion/termination on the provider
DeleteMachine(context.Context, *DeleteMachineRequest) (*DeleteMachineResponse, error)
// GetMachineStatus call get's the status of the VM backing the machine object on the provider
Expand Down Expand Up @@ -64,6 +73,29 @@ type CreateMachineResponse struct {
LastKnownState string
}

// InitializeMachineRequest encapsulates params for the VM Initialization operation (Driver.InitializeMachine).
type InitializeMachineRequest struct {
// Machine object representing VM that must be initialized
Machine *v1alpha1.Machine

// MachineClass backing the machine object
MachineClass *v1alpha1.MachineClass

// Secret backing the machineClass object
Secret *corev1.Secret
}

// InitializeMachineResponse is the response for VM instance initialization (Driver.InitializeMachine).
type InitializeMachineResponse struct {
// ProviderID is the unique identification of the VM at the cloud provider.
// ProviderID typically matches with the node.Spec.ProviderID on the node object.
// Eg: gce://project-name/region/vm-ID
ProviderID string

// NodeName is the name of the node-object registered to kubernetes.
NodeName string
}

// DeleteMachineRequest is the delete request for VM deletion
type DeleteMachineRequest struct {
// Machine object from whom VM is to be deleted
Expand Down
22 changes: 21 additions & 1 deletion pkg/util/provider/driver/fake.go
Original file line number Diff line number Diff line change
Expand Up @@ -74,6 +74,27 @@ func (d *FakeDriver) CreateMachine(ctx context.Context, createMachineRequest *Cr
return nil, d.Err
}

// InitializeMachine makes a call to the driver to initialize the VM instance of machine.
func (d *FakeDriver) InitializeMachine(ctx context.Context, initMachineRequest *InitializeMachineRequest) (*InitializeMachineResponse, error) {
sErr, ok := status.FromError(d.Err)
if ok && sErr != nil {
switch sErr.Code() {
case codes.NotFound:
d.VMExists = false
return nil, d.Err
case codes.Unimplemented:
break
default:
return nil, d.Err
}
}
d.VMExists = true
return &InitializeMachineResponse{
ProviderID: d.ProviderID,
NodeName: d.NodeName,
}, d.Err
}

// DeleteMachine make a call to the driver to delete the machine.
func (d *FakeDriver) DeleteMachine(ctx context.Context, deleteMachineRequest *DeleteMachineRequest) (*DeleteMachineResponse, error) {
d.VMExists = false
Expand All @@ -92,7 +113,6 @@ func (d *FakeDriver) GetMachineStatus(ctx context.Context, getMachineStatusReque
case d.Err != nil:
return nil, d.Err
}

return &GetMachineStatusResponse{
ProviderID: d.ProviderID,
NodeName: d.NodeName,
Expand Down
2 changes: 2 additions & 0 deletions pkg/util/provider/machinecodes/codes/code_string.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,8 @@ func (c Code) String() string {
return "DataLoss"
case Unauthenticated:
return "Unauthenticated"
case Uninitialized:
unmarshall marked this conversation as resolved.
Show resolved Hide resolved
return "Uninitialized"
default:
return "Code(" + strconv.FormatInt(int64(c), 10) + ")"
}
Expand Down
6 changes: 6 additions & 0 deletions pkg/util/provider/machinecodes/codes/codes.go
Original file line number Diff line number Diff line change
Expand Up @@ -144,6 +144,11 @@ const (
// Unauthenticated indicates the request does not have valid
// authentication credentials for the operation.
Unauthenticated Code = 16

// Uninitialized indicates that the VM instance initialization was not performed.
// This is meant to be used by providers in implementation of
// [github.com/gardener/machine-controller-manager/pkg/util/provider/driver.Driver.GetMachineStatus]
Uninitialized Code = 17
)

var strToCode = map[string]Code{
Expand All @@ -164,6 +169,7 @@ var strToCode = map[string]Code{
"Unavailable": Unavailable,
"DataLoss": DataLoss,
"Unauthenticated": Unauthenticated,
"Uninitialized": Uninitialized,
}

// StringToCode coverts string into the Code.
Expand Down
71 changes: 68 additions & 3 deletions pkg/util/provider/machinecontroller/machine.go
Original file line number Diff line number Diff line change
Expand Up @@ -360,8 +360,9 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
nodeName, providerID string

// Initializations
machine = createMachineRequest.Machine
machineName = createMachineRequest.Machine.Name
machine = createMachineRequest.Machine
machineName = createMachineRequest.Machine.Name
uninitializedMachine = false
)

// we should avoid mutating Secret, since it goes all the way into the Informer's store
Expand Down Expand Up @@ -390,6 +391,7 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
klog.Errorf("Error occurred while decoding machine error for machine %q: %s", machine.Name, err)
return machineutils.MediumRetry, err
}
klog.Warningf("For machine %q, obtained VM error status as: %s", machineErr)

// Decoding machine error code
switch machineErr.Code() {
Expand All @@ -403,7 +405,7 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
klog.V(2).Infof("The machine creation is triggered with timeout of %s", c.getEffectiveCreationTimeout(createMachineRequest.Machine).Duration)
createMachineResponse, err := c.driver.CreateMachine(ctx, createMachineRequest)
if err != nil {
// Create call returned an error.
// Create call returned an error
klog.Errorf("Error while creating machine %s: %s", machine.Name, err.Error())
return c.machineCreateErrorHandler(ctx, machine, createMachineResponse, err)
}
Expand Down Expand Up @@ -467,6 +469,7 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
klog.V(2).Infof("Machine %q marked Failed as VM was referring to a stale node object", machine.Name)
return machineutils.ShortRetry, err
}
uninitializedMachine = true
} else {
// if node label present that means there must be a backing VM ,without need of GetMachineStatus() call
nodeName = machine.Labels[v1alpha1.NodeLabelKey]
Expand Down Expand Up @@ -496,6 +499,10 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque

return machineutils.ShortRetry, err

case codes.Uninitialized:
uninitializedMachine = true
klog.Infof("VM instance associated with machine %s was created but not initialized.", machine.Name)

default:
updateRetryPeriod, updateErr := c.machineStatusUpdate(
ctx,
Expand Down Expand Up @@ -526,6 +533,13 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
providerID = getMachineStatusResponse.ProviderID
}

if uninitializedMachine {
retryPeriod, err := c.initializeMachine(ctx, createMachineRequest.Machine, createMachineRequest.MachineClass, createMachineRequest.Secret)
if err != nil {
return retryPeriod, err
}
}

_, machineNodeLabelPresent := createMachineRequest.Machine.Labels[v1alpha1.NodeLabelKey]
_, machinePriorityAnnotationPresent := createMachineRequest.Machine.Annotations[machineutils.MachinePriority]

Expand Down Expand Up @@ -585,6 +599,57 @@ func (c *controller) triggerCreationFlow(ctx context.Context, createMachineReque
return machineutils.LongRetry, nil
}

func (c *controller) initializeMachine(ctx context.Context, machine *v1alpha1.Machine, machineClass *v1alpha1.MachineClass, secret *corev1.Secret) (machineutils.RetryPeriod, error) {
req := &driver.InitializeMachineRequest{
Machine: machine,
MachineClass: machineClass,
Secret: secret,
}
klog.V(3).Infof("Initializing VM instance for Machine %q", machine.Name)
resp, err := c.driver.InitializeMachine(ctx, req)
if err != nil {
errStatus, ok := status.FromError(err)
if !ok {
klog.Errorf("Cannot decode Driver error for machine %q: %s. Unexpected behaviour as Driver errors are expected to be of type status.Status", machine.Name, err)
return machineutils.LongRetry, err
}
klog.Errorf("Error occurred while initializing VM instance for machine %q: %s", machine.Name, err)
switch errStatus.Code() {
case codes.Unimplemented:
klog.V(3).Infof("Provider does not support Driver.InitializeMachine - skipping VM instance initialization for %q.", machine.Name)
return 0, nil
case codes.NotFound:
rishabh-11 marked this conversation as resolved.
Show resolved Hide resolved
klog.Warningf("No VM instance found for machine %q. Skipping VM instance initialization.", machine.Name)
return 0, nil
}
updateRetryPeriod, updateErr := c.machineStatusUpdate(
ctx,
machine,
v1alpha1.LastOperation{
Description: fmt.Sprintf("Provider error: %s. %s", err.Error(), machineutils.InstanceInitialization),
ErrorCode: errStatus.Code().String(),
State: v1alpha1.MachineStateFailed,
unmarshall marked this conversation as resolved.
Show resolved Hide resolved
Type: v1alpha1.MachineOperationCreate,
LastUpdateTime: metav1.Now(),
},
v1alpha1.CurrentStatus{
Phase: c.getCreateFailurePhase(machine),
LastUpdateTime: metav1.Now(),
},
machine.Status.LastKnownState,
)

if updateErr != nil {
return updateRetryPeriod, updateErr
}

return machineutils.ShortRetry, err
}

klog.V(3).Infof("VM instance %q for machine %q was initialized", resp.ProviderID, machine.Name)
return 0, nil
}

func (c *controller) triggerDeletionFlow(ctx context.Context, deleteMachineRequest *driver.DeleteMachineRequest) (machineutils.RetryPeriod, error) {
var (
machine = deleteMachineRequest.Machine
Expand Down
Loading