ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531

jenting · 2022-06-08T12:10:35Z

Is your feature request related to a problem? Please describe

Find a way to make the orphaned PVC happen without the corresponding VolumeSnapshot object.
For example, will the following scenario trigger the orphaned PVC to happen without the corresponding VolumeSnapshot object?

node crashes
workspace pod dies
ws-manager pod dies
etc

Describe the behaviour you'd like

Find a way to make the orphaned PVC happen, and without the corresponding VolumeSnapshot object.
If yes, make sure the ws-manager can reconcile the orphaned PVC, creating the VolumeSnapshot object and making sure the VolumeSnapshot object is ready to use, and remove the PVC after that.

Describe alternatives you've considered

If we can't find a way to make the orphaned PVC happen, this issue is not needed.

Additional context

The workspace Pod has the finalizer to protect the pod itself. The finalizer is added when the workspace pod enters the Running state.

gitpod/components/ws-manager/pkg/manager/monitor.go

Line 407 in e0579ef

err = m.modifyFinalizer(ctx, workspaceID, gitpodFinalizerName, true)

The finalizer is removed either

the pod is never ready

gitpod/components/ws-manager/pkg/manager/monitor.go

Line 496 in e0579ef

return m.modifyFinalizer(ctx, workspaceID, gitpodFinalizerName, false)
the pod is stopped

gitpod/components/ws-manager/pkg/manager/monitor.go

Line 506 in e0579ef

return m.modifyFinalizer(ctx, workspaceID, gitpodFinalizerName, false)

The PVC has a finalizer to protect it, the PVC would be removed only when the corresponding Pod is gone.
Therefore, in general, the Pod would be removed first and then the PVC.

So, the orphaned PVC would happen only when the workspace Pod is removed (which means the workspace pod finalizer is removed before the VolumeSnapshot object is created).

sagor999 · 2022-06-08T17:19:31Z

@jenting we use workspace pod to keep track of PVC.
If node crashes: pod will be in terminating state and wsman should recover pvc just fine.
if pod gets oom killed => it will be restarted and same pvc will be used.

basically we use workspace pod (that have finalizer) to keep track of pvc and ensure clean up is done (convert pvc into volume snapshot).

I suspect we need to ensure that if wsman crashes, that it can recover the work if it was in the process of converting pvc into volume snapshot.

Another edge case: if we find PVC that is unmounted. Meaning pod was deleted somehow without finalizer running. But that is a tricky case. What if user restarted WS and already has a newer backup? We cannot just convert that orphaned pvc into volume snapshot, as it might override newer snapshot. 🤔

Maybe there is some other edge case then we need to handle.

jenting · 2022-06-09T09:01:20Z

I suspect we need to ensure that if wsman crashes, that it can recover the work if it was in the process of converting pvc into volume snapshot.

Some ideas come to my mind that we need to tackle. Correct me if I'm wrong.

The node crashes/reboots (the ws-manager locates) or the ws-manager Pod crashes: during the ws-manager start, the user stops the workspace. Then, there is no corresponding ws-manager to handle the request from the server to create the VolumeSnapshot resource.

For the above case, I could only think of having HA ws-manager or the server have a retry mechanism once the ws-manager is unreachable.

WDYT?

jenting · 2022-06-09T13:13:07Z

@kylos101 could you please describe some scenarios that the PVC could be orphaned? Thank you.

kylos101 · 2022-06-09T19:27:03Z

Hi @jenting, I defer to @sagor999 on helping build a list of scenarios (and which ones are deemed edge cases).

He shared some thoughts, and you did too, perhaps you two can sync up? Another option might be for you to make a brainstorm/picture to 'see' the use cases, or a table of use cases (ranking most likely to least likely).

Some considerations for this exercise:

failures can happen with regular workspaces and prebuild workspaces
different things can fail (nodes, pods, pvc, snapshots)
they can fail at different times (before restore is done, after restore is done, before snapshot is done, after snapshot is done)
for each scenario, how would you test manually, how expensive would it be to write an integration test?

sagor999 · 2022-06-09T21:18:53Z

The node crashes/reboots (the ws-manager locates) or the ws-manager Pod crashes: during the ws-manager start, the user stops the workspace. Then, there is no corresponding ws-manager to handle the request from the server to create the VolumeSnapshot resource.

That is fine I think. WS pod will be waiting in terminating state until wsman will be available to process it. (at least that is how it should work)

kylos101 · 2022-06-10T17:55:46Z

@sagor999 what do you think for when the "cluster" is gone (yikes), but persistent disks remain for the prior PVCs (who's disks are now orphaned, where this is no terminating pod).

sagor999 · 2022-06-10T18:10:35Z

that should not happen. It would mean we killed (removed finalizer) from the WS pod without letting it go through finalizer.
To make sure we do not delete such cluster, we can add a check to ensure there are no unbound PVCs in the cluster prior to deleting them (same as we check for active workspaces in the cluster prior to deleting it)

kylos101 · 2022-06-12T23:54:52Z

@sagor999 wonderful idea, I added as a task in this issue, under Ops, in a new jobs section

jenting · 2022-06-14T10:04:56Z

Here are the scenarios I have tested so far.

Case 1. During the workspace stop, restart the ws-manager Deployment ❌

VolumeSnapshot/VolumeSnapshotContent created ✅
PVC/Pod is still there ❌

Case 2. During the ws-manager restart, stop the workspace ❌

VolumeSnapshot/VolumeSnapshotContent created ✅
PVC/Pod is still there ❌

After that, I disable the PVC feature flag and re-test cases 1 & 2, and the workspace Pod keeps in a Terminating state. Therefore, it's a bug/behavior even without PVC. (The workspace Pod would be cleaned up after 1 hour).

jenting · 2022-06-20T23:20:15Z

Some ideas and thoughts on how to fix it.

During the workspace Pod in the finalizer loop, we call the APIs to perform backs up. The APIs are

For the object-store backup: the ws-manager calls the ws-daemon's APIs.
For the PVC backup: the ws-manager calls the Kubernetes APIs to create the VolumeSnapshot object.

We need to make sure these:

The APIs are idempotent.
There is only one component (ws-manager) to add/remove the finalizer to prevent the update conflict problem.
Having a way to known the backup is in progress or finished from the APIs, so the ws-manager could know when to remove the finalizer.
The ws-manager keeps retrying the finalizer loop if there is an error occurs within the finalizer loop, and we could consider having
- A queue to store the items for retry.
- A backoff retry mechanism to make the retry loop won't hit the rate limit (either Kubernetes API or ws-daemon or other components).

sagor999 · 2022-06-21T22:32:46Z

We should also look to handle the case if GCP is having an outage affecting snapshots. For example, what if we cannot create a snapshot for extended amount of time? Should we keep accumulating workspaces in terminating state while keep on retrying snapshot creation?

jenting · 2022-06-24T02:15:40Z

Related to #5846 (comment)

I feel this issue now is similar to #5846 if we only talk about backing up and restoring the PVC?

kylos101 · 2022-06-28T16:15:58Z

@jenting I agree it is similar, but it is not the same. We're talking about the same action being done under two different scenarios:

In a healthy cluster, ws-manager should be able to reconcile orphaned PVC (which lacks a pod) and do the snapshot without human operator intervention, this is the purpose of ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531, which should be done as part of this epic, I think.
In an unhealthy cluster, ws-manager cannot reconcile orphaned PVC, as human operators, we need the ability to snapshot the PVC, this is the purpose of Add 'out-of-band' workspace backup #5846, this can be done (I think) after the epic.

jenting · 2022-07-06T09:26:00Z

I simulate the node being removed in workspace-preview, the corresponding workspace Pod being removed as well, and the orphaned PVC appears.

jenting · 2022-07-07T02:09:26Z

Since the orphaned PVC appeared if the node be evicated by auto-scaler. Therefore, I'd like to propose Kubernetes controller patten reconcile the PVC, which means to have to persistent_volume_claim_controller to reconcile the PVC, the steps would be

check the PVC is an orphaned PVC
If no, do nothing. If yes,
- create the corresponding VolumeSnapshot object to take the snapshot of the PVC
- wait the VolumeSnapshot becomes Ready
- delete the orphaned PVC

@sagor999 @aledbf, are there any other approaches to take snapshot of orphaned PVC other the Kubernetes controller pattern?

sagor999 · 2022-07-07T04:16:12Z

hm. if we take volume snapshot, we also need to tell webapp server about it. and there is also a question of potentially overwriting newer volume snapshot for that workspace (example workspace failed, user opened workspace with previous snapshot and did some changes and closed it, and we try to create volume snapshot out of band here we will overwrite newer data).

sagor999 · 2022-07-07T04:17:16Z

you said you removed the node and pvc became orphaned.
was it because ws manager was also on that node? if yes, can you try this again with ws manager being on a different node.

jenting · 2022-07-07T04:46:00Z

hm. if we take volume snapshot, we also need to tell webapp server about it. and there is also a question of potentially overwriting newer volume snapshot for that workspace (example workspace failed, user opened workspace with previous snapshot and did some changes and closed it, and we try to create volume snapshot out of band here we will overwrite newer data).

Let me check the webapp server part, thanks.

I think workspace failed != orphaned PVC case. So basically, we would not trigger take volume snapshot.

was it because ws manager was also on that node? if yes, can you try this again with ws manager being on a different node.

No, the ws-manager is on a different node.

kylos101 · 2022-07-08T21:00:43Z

Thanks for sharing your thoughts, @jenting and asking @sagor999 and @aledbf to provide input. 🤝 Have a nice weekend! 👋

sagor999 · 2022-07-08T21:43:53Z

No, the ws-manager is on a different node.

I am curious then how and why did we lose the pod and did not finish PVC backup process.
Since workspace pod should be stuck in terminating until ws-manager will finish back up and only then would remove finalizer.
Maybe we overlooked something in the clean up logic that allows to delete the pod without finishing backup? 🤔

jenting · 2022-07-11T01:14:25Z

Maybe we overlooked something in the clean up logic that allows to delete the pod without finishing backup? 🤔

Make sense. I will check the auto-scale scale-down code logic to see if it matches the way I was tested.

My testing is a bit hacked, by removing the worker node from workspace-preview gcloud beta compute instances delete <vm-id> or systemctl stop k3s-agent, and kubectl delete <worker-node>.

jenting · 2022-07-11T06:16:27Z

According to the autoscaler doc and the PR, we use the annotation
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to prevent the autoscaler scaling down the node.

Therefore, the node shouldn't be removed by autoscaler if there is still have workspace Pods running on top of it. Extra doc on how auto-scaler scale down the node

sagor999 · 2022-07-11T19:35:32Z

My testing is a bit hacked, by removing the worker node from workspace-preview

that is a good test, as I think in this case we still should be able to backup PVC correctly.
Pod should be stuck in terminating state (because of finalizer) even if node was deleted. Or am I wrong here?

jenting · 2022-07-12T06:54:40Z

The workspace pod in Terminating state, but the status.containerStatuses.state is in running state.

  ...
  deletionTimestamp: "2022-07-12T06:07:13Z"
  finalizers:
  - gitpod.io/finalizer
  ...
  containerStatuses:
  - containerID: containerd://f9a34830e71beac227d7eb38d527ead553a3bf39a847c29b62d7363dbf4aefde
    image: reg.g37a8ca0daf64f13cc37955.workspace-preview.gitpod-io-dev.com:20000/remote/a89526c6-0444-4ea7-a24f-9ff1dba23d98:latest
    imageID: reg.g37a8ca0daf64f13cc37955.workspace-preview.gitpod-io-dev.com:20000/remote/a89526c6-0444-4ea7-a24f-9ff1dba23d98@sha256:81d3aa67984a54e1c7b96b3212270096a79ae1b388cde193be7f5e52f1f669bf
    lastState: {}
    name: workspace
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-12T06:04:28Z"

So, the code does not be executed. No VolumeSnapshot object be created if we enable the PVC feature flag.

Manually create the VolumeSnapshot object (Pod is in Terminating state), and the VolumeSnapshot + VolumeSnapshotContect creates successfully.

sagor999 · 2022-07-12T17:50:48Z

oh. that is quite interesting state.
can you check ws-daemon and ws-manager logs to see why it did not send terminate state into pod? 🤔

jenting · 2022-07-13T09:01:31Z

the code does not be executed

From my observation, the code does not be executed. I created a new issue for this case (not orphaned PVC).

jenting · 2022-07-13T09:19:08Z

Okay, so my next thing is to find a way to make the orphaned PVC happen.

jenting · 2022-07-18T09:18:52Z

👋 Hey @kylos101 as we discussed, we tried to simulate a scenario where the admin or on-caller accidentally deleted the workspace Pod and removed the finalizer, to see if we can make the orphaned PVC happen.

I have tried the below three cases:

✅ Directly delete the workspace Pod: the VolumeSnapshot created and the Pod/PVC be removed after the VolumeSnapshot is ready to use.
❌ Directly delete the workspace Pod and remove the finalizer before the VolumeSnapshot is ready: the VolumeSnapshot is created, and Pod is removed as well. The orphaned PVC happens however the VolumeSnapshot is ready to use. The good news volume snapshot is ready to use, the problem is the workspace status in DB is in stopping.
✅ Make the underlying node NotReady (SSH to the worker node and stop the k3s-agent): the Pod goes into a Terminating state after 5 minutes. After that, the VolumeSnapshot is created. But the workspace Pod will be removed after the underlying node is removed from the k8s cluster kubectl delete node <node-name>.

For case 2, the orphaned PVC is created however, the VolumeSnapshot is created as well (and is ready to use).

I can't think of other cases that the orphaned PVC created however, the VolumeSnapshot does not create.

kylos101 · 2022-07-18T21:43:48Z

Regarding scenario 2 (delete pod and remove pod finalizer), it sounds like:

The PVC exists, but the pod is gone
The snapshot is created
The workspace is stuck in "stopping" state in Gitpod

Is that right?

jenting · 2022-07-18T23:20:34Z

Regarding scenario 2 (delete pod and remove pod finalizer), it sounds like:

The PVC exists, but the pod is gone

The snapshot is created

The workspace is stuck in "stopping" state in Gitpod

Is that right?

🎯 Correct! The snapshot is ready to use.

kylos101 · 2022-07-19T02:23:35Z

Interesting. So we have a path where we do the right thing...except tell server to set the workspace status to Stopped. 🤔

Does this seem like something that would be doable to intercept/catch/respond to? I ask because, it would be ideal if the Gitpod workspace were in a final state, so that the user could access and use the data.

jenting · 2022-07-22T08:07:34Z

Interesting. So we have a path where we do the right thing...except tell server to set the workspace status to Stopped. 🤔

Does this seem like something that would be doable to intercept/catch/respond to? I ask because, it would be ideal if the Gitpod workspace were in a final state, so that the user could access and use the data.

Okay, the problem is that we depend on the workspace Pod annotation gitpod.io/disposalStatus to determine the workspace state changes from STOPPING to STOPPED.

gitpod/components/ws-manager/pkg/manager/status.go

Line 394 in e0579ef

result.Phase = api.WorkspacePhase_STOPPED

In the scenario, the workspace pod is forced to remove the finalizer before the volume snapshot is ready to use, so the ws-manager has no way to update/add the annotation gitpod.io/disposalStatus to the workspace pod since the pod is gone already.

We need to think of a way to update the pod status even if the workspace pod is gone, ref code of the ws-manager subscriber

gitpod/components/ws-manager/pkg/manager/monitor.go

Line 265 in e0579ef

m.manager.OnChange(ctx, status)

gitpod/components/ws-manager/pkg/manager/manager.go

Lines 1225 to 1228 in e0579ef

    
           m.publishToSubscribers(ctx, &api.SubscribeResponse{ 
        
           	Status: status, 
        
           	Header: header, 
        
           })

atduarte · 2022-07-22T09:22:22Z

@jenting given this issue pertains to many different scenarios, and is being discussed for almost 2 months, could you please keep the description of the issue up to date with a summary of the scenarios and conclusions? It will help others not involved in the conversation from the start to contribute 🎈

jenting · 2022-07-27T02:47:38Z

Definition: The orphaned PVC means the workspace pod is gone, and the PVC has the user's content but the corresponding VolumeSnapshot object is not created by the ws-manager.

Context: The PVC has a finalizer to protect it, and the PVC would be removed only when the corresponding Pod is gone.
Therefore, the Pod would be removed first, and then the PVC be removed.

Scenarios We Tested: We have tested against lots of different scenarios so far to simulate orphaned PVC happens. Sadly, we can't reproduce it. Luckily, we found other issues when we test different scenarios.

Directly delete the workspace Pod: No orphaned PVC happens ✅
Make the underlying node NotReady: No orphaned PVC happens ✅
Directly delete the workspace Pod and remove the finalizer before the VolumeSnapshot is ready: No orphaned PVC happens ✅
- But the workspace pod keeps Stopping. Note that, it's a rare use case because in general, the workspace pod finalizer should not be removed before the VolumeSnapshot is ready to use. Also, it needs lots of time to fix due to the current code design (We rely on the workspace pod gitpod.io/disposalStatus annotation to the workspace pod to determine the workspace backup is complete, moves the workspace pod state to Stopped, and then onPodEvent notify the WebApp thru the ws-manager-bridge). Therefore, I recommend we should create another issue to tackle it.

Therefore, so far we can't find a way to make orphaned PVC happens (no workspace pod and no VolumeSnapshot object), I think we are okay to close it issue.

@kylos101 @atduarte @sagor999 Is it okay to you?

kylos101 · 2022-07-27T13:42:19Z

@jenting thank you for the detailed breakdown! Given this, I agree, it makes sense to stop on this effort. As such, I will close this issue.

jenting mentioned this issue Jun 8, 2022

Epic: Ensure durability for user workspace files #7901

Closed

77 tasks

jenting added the team: workspace Issue belongs to the Workspace team label Jun 8, 2022

jenting self-assigned this Jun 8, 2022

jenting mentioned this issue Jul 22, 2022

Delete PVC once the VolumeSnapshot is ready to use #11553

Merged

1 task

jenting changed the title ~~ws-manager: ensure it can reconcile orphaned PVCs (if node crashes, pod dies, etc)~~ ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created Jul 22, 2022

kylos101 closed this as completed Jul 27, 2022

jenting mentioned this issue Jul 27, 2022

Workspace pod keeps Stopping if the finalizer removes too early before the VolumeSnapshot is ready #11697

Closed

ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531

ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531

Comments

jenting commented Jun 8, 2022 • edited Loading

Is your feature request related to a problem? Please describe

Describe the behaviour you'd like

Describe alternatives you've considered

Additional context

sagor999 commented Jun 8, 2022

jenting commented Jun 9, 2022

jenting commented Jun 9, 2022

kylos101 commented Jun 9, 2022

sagor999 commented Jun 9, 2022

kylos101 commented Jun 10, 2022

sagor999 commented Jun 10, 2022

kylos101 commented Jun 12, 2022

jenting commented Jun 14, 2022 • edited Loading

jenting commented Jun 20, 2022 • edited Loading

sagor999 commented Jun 21, 2022

jenting commented Jun 24, 2022

kylos101 commented Jun 28, 2022

jenting commented Jul 6, 2022

jenting commented Jul 7, 2022

sagor999 commented Jul 7, 2022

sagor999 commented Jul 7, 2022

jenting commented Jul 7, 2022 • edited Loading

kylos101 commented Jul 8, 2022

sagor999 commented Jul 8, 2022

jenting commented Jul 11, 2022 • edited Loading

jenting commented Jul 11, 2022

sagor999 commented Jul 11, 2022

jenting commented Jul 12, 2022 • edited Loading

sagor999 commented Jul 12, 2022

jenting commented Jul 13, 2022

jenting commented Jul 13, 2022

jenting commented Jul 18, 2022 • edited Loading

kylos101 commented Jul 18, 2022

jenting commented Jul 18, 2022

kylos101 commented Jul 19, 2022

jenting commented Jul 22, 2022

atduarte commented Jul 22, 2022 • edited Loading

jenting commented Jul 27, 2022

kylos101 commented Jul 27, 2022

jenting commented Jun 8, 2022 •

edited

Loading

jenting commented Jun 14, 2022 •

edited

Loading

jenting commented Jun 20, 2022 •

edited

Loading

jenting commented Jul 7, 2022 •

edited

Loading

jenting commented Jul 11, 2022 •

edited

Loading

jenting commented Jul 12, 2022 •

edited

Loading

jenting commented Jul 18, 2022 •

edited

Loading

atduarte commented Jul 22, 2022 •

edited

Loading