-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failback failed on a gateway after performing failover/failback using ceph orch daemon stop/start command on new HA build for 4 Gateway configuration #542
Comments
Seeing this log in the mon2
Seeing this error in GW2
|
Same issue seen while performing failover and failback using node power off and power on as well. Failback failed and also in both cases, nvme list does not list the devices in the failed gateway after failback. IOs get stuck.
|
Fixed in 1.2.0. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
While performing failover/failback using ceph orch daemon stop/start commands, the failover of a gateway completed successfully, however after failback the restored gateway did not pick up the optimised path for the corresponding load balancing group id and hence IOs running on the corresponding namespaces got stuck indefinitely.
Steps performed.
quay.io/roysahar-ibm/ceph:bf9505fb569e9b95a78f9700ed8c4bd20508ef55
Logs
Before failover.
GW1
GW2
GW3
GW4
At the initiator
Mount and run IOs on the disk /dev/nvme13n1
Failover performed on GW1 using ceph orch daemon stop command
GW1 - is down.
GW2 takes over the load balancing group id of GW1
GW2
At the initiator
IOs start running on the disk as expected and GW2 now picks up the IOs.
Failback performed using ceph orch daemon start command
However failback is not successful.
At GW1 after its restored.
GW2
At initiator, IOs get stuck
Also
However ceph nvme-gw show command always gives proper output.
However need to check why failback failed and IOs got stuck.
Also we can see that ceph -s command has a slow ops on the mon node (node this node is not the leader though)
The text was updated successfully, but these errors were encountered: