Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531

Closed
Tracked by #7901
jenting opened this issue Jun 8, 2022 · 35 comments
Closed
Tracked by #7901
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@jenting
Copy link
Contributor

jenting commented Jun 8, 2022

Is your feature request related to a problem? Please describe

Find a way to make the orphaned PVC happen without the corresponding VolumeSnapshot object.
For example, will the following scenario trigger the orphaned PVC to happen without the corresponding VolumeSnapshot object?

  • node crashes
  • workspace pod dies
  • ws-manager pod dies
  • etc

Describe the behaviour you'd like

  • Find a way to make the orphaned PVC happen, and without the corresponding VolumeSnapshot object.
  • If yes, make sure the ws-manager can reconcile the orphaned PVC, creating the VolumeSnapshot object and making sure the VolumeSnapshot object is ready to use, and remove the PVC after that.

Describe alternatives you've considered

If we can't find a way to make the orphaned PVC happen, this issue is not needed.

Additional context

The workspace Pod has the finalizer to protect the pod itself. The finalizer is added when the workspace pod enters the Running state.

err = m.modifyFinalizer(ctx, workspaceID, gitpodFinalizerName, true)

The finalizer is removed either

The PVC has a finalizer to protect it, the PVC would be removed only when the corresponding Pod is gone.
Therefore, in general, the Pod would be removed first and then the PVC.

So, the orphaned PVC would happen only when the workspace Pod is removed (which means the workspace pod finalizer is removed before the VolumeSnapshot object is created).

@jenting jenting added the team: workspace Issue belongs to the Workspace team label Jun 8, 2022
@jenting jenting self-assigned this Jun 8, 2022
@sagor999
Copy link
Contributor

sagor999 commented Jun 8, 2022

@jenting we use workspace pod to keep track of PVC.
If node crashes: pod will be in terminating state and wsman should recover pvc just fine.
if pod gets oom killed => it will be restarted and same pvc will be used.

basically we use workspace pod (that have finalizer) to keep track of pvc and ensure clean up is done (convert pvc into volume snapshot).

I suspect we need to ensure that if wsman crashes, that it can recover the work if it was in the process of converting pvc into volume snapshot.

Another edge case: if we find PVC that is unmounted. Meaning pod was deleted somehow without finalizer running. But that is a tricky case. What if user restarted WS and already has a newer backup? We cannot just convert that orphaned pvc into volume snapshot, as it might override newer snapshot. 🤔

Maybe there is some other edge case then we need to handle.

@jenting
Copy link
Contributor Author

jenting commented Jun 9, 2022

I suspect we need to ensure that if wsman crashes, that it can recover the work if it was in the process of converting pvc into volume snapshot.

Some ideas come to my mind that we need to tackle. Correct me if I'm wrong.

The node crashes/reboots (the ws-manager locates) or the ws-manager Pod crashes: during the ws-manager start, the user stops the workspace. Then, there is no corresponding ws-manager to handle the request from the server to create the VolumeSnapshot resource.

For the above case, I could only think of having HA ws-manager or the server have a retry mechanism once the ws-manager is unreachable.

WDYT?

@jenting
Copy link
Contributor Author

jenting commented Jun 9, 2022

@kylos101 could you please describe some scenarios that the PVC could be orphaned? Thank you.

@kylos101
Copy link
Contributor

kylos101 commented Jun 9, 2022

Hi @jenting, I defer to @sagor999 on helping build a list of scenarios (and which ones are deemed edge cases).

He shared some thoughts, and you did too, perhaps you two can sync up? Another option might be for you to make a brainstorm/picture to 'see' the use cases, or a table of use cases (ranking most likely to least likely).

Some considerations for this exercise:

  1. failures can happen with regular workspaces and prebuild workspaces
  2. different things can fail (nodes, pods, pvc, snapshots)
  3. they can fail at different times (before restore is done, after restore is done, before snapshot is done, after snapshot is done)
  4. for each scenario, how would you test manually, how expensive would it be to write an integration test?

@sagor999
Copy link
Contributor

sagor999 commented Jun 9, 2022

The node crashes/reboots (the ws-manager locates) or the ws-manager Pod crashes: during the ws-manager start, the user stops the workspace. Then, there is no corresponding ws-manager to handle the request from the server to create the VolumeSnapshot resource.

That is fine I think. WS pod will be waiting in terminating state until wsman will be available to process it. (at least that is how it should work)

@kylos101
Copy link
Contributor

@sagor999 what do you think for when the "cluster" is gone (yikes), but persistent disks remain for the prior PVCs (who's disks are now orphaned, where this is no terminating pod).

@sagor999
Copy link
Contributor

that should not happen. It would mean we killed (removed finalizer) from the WS pod without letting it go through finalizer.
To make sure we do not delete such cluster, we can add a check to ensure there are no unbound PVCs in the cluster prior to deleting them (same as we check for active workspaces in the cluster prior to deleting it)

@kylos101
Copy link
Contributor

@sagor999 wonderful idea, I added as a task in this issue, under Ops, in a new jobs section

@jenting
Copy link
Contributor Author

jenting commented Jun 14, 2022

Here are the scenarios I have tested so far.

Case 1. During the workspace stop, restart the ws-manager Deployment ❌

  • VolumeSnapshot/VolumeSnapshotContent created ✅
  • PVC/Pod is still there ❌

Case 2. During the ws-manager restart, stop the workspace ❌

  • VolumeSnapshot/VolumeSnapshotContent created ✅
  • PVC/Pod is still there ❌

After that, I disable the PVC feature flag and re-test cases 1 & 2, and the workspace Pod keeps in a Terminating state. Therefore, it's a bug/behavior even without PVC. (The workspace Pod would be cleaned up after 1 hour).

@jenting
Copy link
Contributor Author

jenting commented Jun 20, 2022

Some ideas and thoughts on how to fix it.

During the workspace Pod in the finalizer loop, we call the APIs to perform backs up. The APIs are

  • For the object-store backup: the ws-manager calls the ws-daemon's APIs.
  • For the PVC backup: the ws-manager calls the Kubernetes APIs to create the VolumeSnapshot object.

We need to make sure these:

  • The APIs are idempotent.
  • There is only one component (ws-manager) to add/remove the finalizer to prevent the update conflict problem.
  • Having a way to known the backup is in progress or finished from the APIs, so the ws-manager could know when to remove the finalizer.
  • The ws-manager keeps retrying the finalizer loop if there is an error occurs within the finalizer loop, and we could consider having
    • A queue to store the items for retry.
    • A backoff retry mechanism to make the retry loop won't hit the rate limit (either Kubernetes API or ws-daemon or other components).

@sagor999
Copy link
Contributor

We should also look to handle the case if GCP is having an outage affecting snapshots. For example, what if we cannot create a snapshot for extended amount of time? Should we keep accumulating workspaces in terminating state while keep on retrying snapshot creation?

@jenting
Copy link
Contributor Author

jenting commented Jun 24, 2022

Related to #5846 (comment)

I feel this issue now is similar to #5846 if we only talk about backing up and restoring the PVC?

@kylos101
Copy link
Contributor

@jenting I agree it is similar, but it is not the same. We're talking about the same action being done under two different scenarios:

  1. In a healthy cluster, ws-manager should be able to reconcile orphaned PVC (which lacks a pod) and do the snapshot without human operator intervention, this is the purpose of ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created #10531, which should be done as part of this epic, I think.
  2. In an unhealthy cluster, ws-manager cannot reconcile orphaned PVC, as human operators, we need the ability to snapshot the PVC, this is the purpose of Add 'out-of-band' workspace backup #5846, this can be done (I think) after the epic.

@jenting
Copy link
Contributor Author

jenting commented Jul 6, 2022

I simulate the node being removed in workspace-preview, the corresponding workspace Pod being removed as well, and the orphaned PVC appears.

@jenting
Copy link
Contributor Author

jenting commented Jul 7, 2022

Since the orphaned PVC appeared if the node be evicated by auto-scaler. Therefore, I'd like to propose Kubernetes controller patten reconcile the PVC, which means to have to persistent_volume_claim_controller to reconcile the PVC, the steps would be

  • check the PVC is an orphaned PVC
  • If no, do nothing. If yes,
    • create the corresponding VolumeSnapshot object to take the snapshot of the PVC
    • wait the VolumeSnapshot becomes Ready
    • delete the orphaned PVC

@sagor999 @aledbf, are there any other approaches to take snapshot of orphaned PVC other the Kubernetes controller pattern?

@sagor999
Copy link
Contributor

sagor999 commented Jul 7, 2022

hm. if we take volume snapshot, we also need to tell webapp server about it. and there is also a question of potentially overwriting newer volume snapshot for that workspace (example workspace failed, user opened workspace with previous snapshot and did some changes and closed it, and we try to create volume snapshot out of band here we will overwrite newer data).

@sagor999
Copy link
Contributor

sagor999 commented Jul 7, 2022

you said you removed the node and pvc became orphaned.
was it because ws manager was also on that node? if yes, can you try this again with ws manager being on a different node.

@jenting
Copy link
Contributor Author

jenting commented Jul 7, 2022

hm. if we take volume snapshot, we also need to tell webapp server about it. and there is also a question of potentially overwriting newer volume snapshot for that workspace (example workspace failed, user opened workspace with previous snapshot and did some changes and closed it, and we try to create volume snapshot out of band here we will overwrite newer data).

Let me check the webapp server part, thanks.

I think workspace failed != orphaned PVC case. So basically, we would not trigger take volume snapshot.

was it because ws manager was also on that node? if yes, can you try this again with ws manager being on a different node.

No, the ws-manager is on a different node.

@kylos101
Copy link
Contributor

kylos101 commented Jul 8, 2022

Thanks for sharing your thoughts, @jenting and asking @sagor999 and @aledbf to provide input. 🤝 Have a nice weekend! 👋

@sagor999
Copy link
Contributor

sagor999 commented Jul 8, 2022

No, the ws-manager is on a different node.

I am curious then how and why did we lose the pod and did not finish PVC backup process.
Since workspace pod should be stuck in terminating until ws-manager will finish back up and only then would remove finalizer.
Maybe we overlooked something in the clean up logic that allows to delete the pod without finishing backup? 🤔

@jenting
Copy link
Contributor Author

jenting commented Jul 11, 2022

Maybe we overlooked something in the clean up logic that allows to delete the pod without finishing backup? 🤔

Make sense. I will check the auto-scale scale-down code logic to see if it matches the way I was tested.

My testing is a bit hacked, by removing the worker node from workspace-preview gcloud beta compute instances delete <vm-id> or systemctl stop k3s-agent, and kubectl delete <worker-node>.

@jenting
Copy link
Contributor Author

jenting commented Jul 11, 2022

According to the autoscaler doc and the PR, we use the annotation
"cluster-autoscaler.kubernetes.io/safe-to-evict": "false" to prevent the autoscaler scaling down the node.

Therefore, the node shouldn't be removed by autoscaler if there is still have workspace Pods running on top of it. Extra doc on how auto-scaler scale down the node

@sagor999
Copy link
Contributor

My testing is a bit hacked, by removing the worker node from workspace-preview

that is a good test, as I think in this case we still should be able to backup PVC correctly.
Pod should be stuck in terminating state (because of finalizer) even if node was deleted. Or am I wrong here?

@jenting
Copy link
Contributor Author

jenting commented Jul 12, 2022

The workspace pod in Terminating state, but the status.containerStatuses.state is in running state.

  ...
  deletionTimestamp: "2022-07-12T06:07:13Z"
  finalizers:
  - gitpod.io/finalizer
  ...
  containerStatuses:
  - containerID: containerd://f9a34830e71beac227d7eb38d527ead553a3bf39a847c29b62d7363dbf4aefde
    image: reg.g37a8ca0daf64f13cc37955.workspace-preview.gitpod-io-dev.com:20000/remote/a89526c6-0444-4ea7-a24f-9ff1dba23d98:latest
    imageID: reg.g37a8ca0daf64f13cc37955.workspace-preview.gitpod-io-dev.com:20000/remote/a89526c6-0444-4ea7-a24f-9ff1dba23d98@sha256:81d3aa67984a54e1c7b96b3212270096a79ae1b388cde193be7f5e52f1f669bf
    lastState: {}
    name: workspace
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2022-07-12T06:04:28Z"

So, the code does not be executed. No VolumeSnapshot object be created if we enable the PVC feature flag.

Manually create the VolumeSnapshot object (Pod is in Terminating state), and the VolumeSnapshot + VolumeSnapshotContect creates successfully.

@sagor999
Copy link
Contributor

oh. that is quite interesting state.
can you check ws-daemon and ws-manager logs to see why it did not send terminate state into pod? 🤔

@jenting
Copy link
Contributor Author

jenting commented Jul 13, 2022

the code does not be executed

From my observation, the code does not be executed. I created a new issue for this case (not orphaned PVC).

@jenting
Copy link
Contributor Author

jenting commented Jul 13, 2022

Okay, so my next thing is to find a way to make the orphaned PVC happen.

@jenting
Copy link
Contributor Author

jenting commented Jul 18, 2022

👋 Hey @kylos101 as we discussed, we tried to simulate a scenario where the admin or on-caller accidentally deleted the workspace Pod and removed the finalizer, to see if we can make the orphaned PVC happen.

I have tried the below three cases:

  • Directly delete the workspace Pod: the VolumeSnapshot created and the Pod/PVC be removed after the VolumeSnapshot is ready to use.
  • Directly delete the workspace Pod and remove the finalizer before the VolumeSnapshot is ready: the VolumeSnapshot is created, and Pod is removed as well. The orphaned PVC happens however the VolumeSnapshot is ready to use. The good news volume snapshot is ready to use, the problem is the workspace status in DB is in stopping.
    image
  • Make the underlying node NotReady (SSH to the worker node and stop the k3s-agent): the Pod goes into a Terminating state after 5 minutes. After that, the VolumeSnapshot is created. But the workspace Pod will be removed after the underlying node is removed from the k8s cluster kubectl delete node <node-name>.

For case 2, the orphaned PVC is created however, the VolumeSnapshot is created as well (and is ready to use).

I can't think of other cases that the orphaned PVC created however, the VolumeSnapshot does not create.

@kylos101
Copy link
Contributor

Regarding scenario 2 (delete pod and remove pod finalizer), it sounds like:

  1. The PVC exists, but the pod is gone
  2. The snapshot is created
  3. The workspace is stuck in "stopping" state in Gitpod

Is that right?

@jenting
Copy link
Contributor Author

jenting commented Jul 18, 2022

Regarding scenario 2 (delete pod and remove pod finalizer), it sounds like:

  1. The PVC exists, but the pod is gone

  2. The snapshot is created

  3. The workspace is stuck in "stopping" state in Gitpod

Is that right?

🎯 Correct! The snapshot is ready to use.

@kylos101
Copy link
Contributor

Interesting. So we have a path where we do the right thing...except tell server to set the workspace status to Stopped. 🤔

Does this seem like something that would be doable to intercept/catch/respond to? I ask because, it would be ideal if the Gitpod workspace were in a final state, so that the user could access and use the data.

@jenting
Copy link
Contributor Author

jenting commented Jul 22, 2022

Interesting. So we have a path where we do the right thing...except tell server to set the workspace status to Stopped. 🤔

Does this seem like something that would be doable to intercept/catch/respond to? I ask because, it would be ideal if the Gitpod workspace were in a final state, so that the user could access and use the data.

Okay, the problem is that we depend on the workspace Pod annotation gitpod.io/disposalStatus to determine the workspace state changes from STOPPING to STOPPED.

result.Phase = api.WorkspacePhase_STOPPED

In the scenario, the workspace pod is forced to remove the finalizer before the volume snapshot is ready to use, so the ws-manager has no way to update/add the annotation gitpod.io/disposalStatus to the workspace pod since the pod is gone already.

We need to think of a way to update the pod status even if the workspace pod is gone, ref code of the ws-manager subscriber

@atduarte
Copy link
Contributor

atduarte commented Jul 22, 2022

@jenting given this issue pertains to many different scenarios, and is being discussed for almost 2 months, could you please keep the description of the issue up to date with a summary of the scenarios and conclusions? It will help others not involved in the conversation from the start to contribute 🎈

@jenting jenting changed the title ws-manager: ensure it can reconcile orphaned PVCs (if node crashes, pod dies, etc) ws-manager: ensure it can reconcile orphaned PVCs without VolumeSnapshot is created Jul 22, 2022
@jenting
Copy link
Contributor Author

jenting commented Jul 27, 2022

Definition: The orphaned PVC means the workspace pod is gone, and the PVC has the user's content but the corresponding VolumeSnapshot object is not created by the ws-manager.

Context: The PVC has a finalizer to protect it, and the PVC would be removed only when the corresponding Pod is gone.
Therefore, the Pod would be removed first, and then the PVC be removed.

Scenarios We Tested: We have tested against lots of different scenarios so far to simulate orphaned PVC happens. Sadly, we can't reproduce it. Luckily, we found other issues when we test different scenarios.

  • Directly delete the workspace Pod: No orphaned PVC happens ✅
  • Make the underlying node NotReady: No orphaned PVC happens ✅
  • Directly delete the workspace Pod and remove the finalizer before the VolumeSnapshot is ready: No orphaned PVC happens ✅
    • But the workspace pod keeps Stopping. Note that, it's a rare use case because in general, the workspace pod finalizer should not be removed before the VolumeSnapshot is ready to use. Also, it needs lots of time to fix due to the current code design (We rely on the workspace pod gitpod.io/disposalStatus annotation to the workspace pod to determine the workspace backup is complete, moves the workspace pod state to Stopped, and then onPodEvent notify the WebApp thru the ws-manager-bridge). Therefore, I recommend we should create another issue to tackle it.

Therefore, so far we can't find a way to make orphaned PVC happens (no workspace pod and no VolumeSnapshot object), I think we are okay to close it issue.

@kylos101 @atduarte @sagor999 Is it okay to you?

@kylos101
Copy link
Contributor

@jenting thank you for the detailed breakdown! Given this, I agree, it makes sense to stop on this effort. As such, I will close this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants