You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 23, 2020. It is now read-only.
This looks similar to #140, but in different repro steps.
We deploy a StatefulSet with 9 replica and attached with OCI block storage persistent volume in OCI PHX. Recently, we found 2 of 9 replicas’ persistent volume were mounted as ‘Readonly file system’ suddenly, causing failure for the replica, and we can’t fix the issue with a re-deployment. The script we mount is the as the following, and by searching online, people suggest that “when kubelet is restarted, volume will be detached while it is still mounted and cause file system corruption.” We don't how this happened exactly, but suspect it happened when we re-deploy the StatefulSet.
On the host that repro'd this issue, kubelet says something like Jun 24 04:17:51 <OKE-HOST-NAME> kubelet[24925]: E0624 04:17:51.541571 24925 kubelet_volumes.go:128] Orphaned pod "<POD-GUID>" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Exam the mount info on the host, it shows the Block Storage volume was mounted by the followig three location with RO not the expected RW mode
Last week, we have two PODs of the same StatefulSet repro this issue, here is the two kinds of mitigation fix I did:
Cold recycle the problematic host
After I delete the host from the Kubernats, Kubernates' re-deploy the POD replica to a different host and the new POD could successfully mount the block storage volume with RW mode properly. However, in a live system, this mitigation plan will not be the first option for us due to the HA requirements.
Delete the problematic mount
On the problematic host, do the following steps:
1) Stop kubelet and kube-proxy
2) Umout
/var/lib/kubelet/plugins/kubernetes.io/flexvolume/oracle/oci/mounts/<BL-OCID-1>
/var/lib/kubelet/pods/<OLD-POD-GUID>/volumes/oracle~oci/<BL-OCID-1>
/var/lib/kubelet/pods/<NEW-POD-GUID>/volumes/oracle~oci/<BL-OCID-1>
3) Delete
/var/lib/kubelet/plugins/kubernetes.io/flexvolume/oracle/oci/mounts/<BL-OCID-1>
/var/lib/kubelet/pods/<OLD-POD-GUID>/volumes/oracle~oci/<BL-OCID-1>
/var/lib/kubelet/pods/<NEW-POD-GUID>/volumes/oracle~oci/<BL-OCID-1>
4) Start kube-proxy kubelet
5) Bounce the new POD
After that, I can see the new POD mounts the BL volume properly as the expected RW mode, however we start to see the following BL errors (journalctl -f)
un 24 04:17:29 <OKE-HOST> iscsid[1672]: Kernel reported iSCSI connection 7:0 error (1020 - ISCSI_ERR_TCP_CONN_CLOSE: TCP connection closed) state (2)
Jun 24 04:17:31 <OKE-HOST> kernel: connection7:0: detected conn error (1020)
J
It seems like the BL naming is changed, but somehow kubelet fails to clean up the old entry.
rjtsdl
pushed a commit
to rjtsdl/oci-volume-provisioner
that referenced
this issue
Dec 20, 2018
This looks similar to #140, but in different repro steps.
We deploy a StatefulSet with 9 replica and attached with OCI block storage persistent volume in OCI PHX. Recently, we found 2 of 9 replicas’ persistent volume were mounted as ‘Readonly file system’ suddenly, causing failure for the replica, and we can’t fix the issue with a re-deployment. The script we mount is the as the following, and by searching online, people suggest that “when kubelet is restarted, volume will be detached while it is still mounted and cause file system corruption.” We don't how this happened exactly, but suspect it happened when we re-deploy the StatefulSet.
On the host that repro'd this issue, kubelet says something like
Jun 24 04:17:51 <OKE-HOST-NAME> kubelet[24925]: E0624 04:17:51.541571 24925 kubelet_volumes.go:128] Orphaned pod "<POD-GUID>" found, but volume paths are still present on disk : There were a total of 1 errors similar to this. Turn up verbosity to see them.
Exam the mount info on the host, it shows the Block Storage volume was mounted by the followig three location with RO not the expected RW mode
Seems like the oci-volume-provisioner doesn't clean up the BL volume for the previous POD before it mounted it to the new POD.
The text was updated successfully, but these errors were encountered: