Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot stuck on DELETING #32

Open
elonen opened this issue Nov 2, 2018 · 17 comments
Open

Snapshot stuck on DELETING #32

elonen opened this issue Nov 2, 2018 · 17 comments

Comments

@elonen
Copy link

elonen commented Nov 2, 2018

Similar to #19 (which is about a Resource), I have a snapshot that's stuck on DELETING with a NullPointerException:

╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName  ┊ SnapshotName               ┊ NodeNames ┊ Volumes   ┊ State    ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ vm-105-disk-1 ┊ snap_vm-105-disk-1_vanilla ┊ mox-a     ┊ 0: 32 GiB ┊ DELETING ┊
╰───────────────────────────────────────────────────────────────────────────────╯

Satellite error report:

ERROR REPORT 5BDC4670-2E931-000001

============================================================

Application:                        LINBIT® LINSTOR
Module:                             Satellite
Version:                            0.7.1
Build ID:                           b4dab7399d24dc10917d221bd25ffcf0ed8f9712
Build time:                         2018-10-31T12:18:38+00:00
Error time:                         2018-11-02 14:43:52
Node:                               mox-a

============================================================

Reported error:
===============

Category:                           RuntimeException
Class name:                         NullPointerException
Class canonical name:               java.lang.NullPointerException
Generated at:                       Method 'computeVlmName', Source file 'DrbdDeviceHandler.java', Line #252


Error context:
    NullPointerException

Call backtrace:

    Method                                   Native Class:Line number
    computeVlmName                           N      com.linbit.linstor.core.DrbdDeviceHandler:252
    initializeResourceState                  N      com.linbit.linstor.core.DrbdDeviceHandler:177
    dispatchResource                         N      com.linbit.linstor.core.DrbdDeviceHandler:340
    run                                      N      com.linbit.linstor.core.DeviceManagerImpl$DeviceHandlerInvocation:1262
    run                                      N      com.linbit.WorkerPool$WorkerThread:179


END OF ERROR REPORT.
@elonen
Copy link
Author

elonen commented Nov 2, 2018

(Not really a fatal error but it there's a workaround to get the snapshot and then the resource deleted, I'd appreciate it.)

@rp-
Copy link
Contributor

rp- commented Nov 2, 2018

what is your output of linstor volume-definition list and linstor resource list?

@ghernadi
Copy link
Contributor

ghernadi commented Nov 5, 2018

Can you also provide the linstor command history you did? In case you used linstor in interactive mode, such a history is stored in ~/.config/linstor/client.history.

@elonen
Copy link
Author

elonen commented Nov 6, 2018

Here you go. No interactive history, sorry. The stuck resource definition is vm-105-disk-1, and it has no volume definitions or active resources, only the snapshot that won't get deleted:

root@mox-a:~# linstor r list | grep 105
| vm-105-disk-2         | mox-a        | 7013 | InUse  |     UpToDate |
| vm-105-disk-2         | mox-b        | 7013 | Unused |     UpToDate |
| vm-105-disk-2         | mox-c        | 7013 | Unused |     UpToDate |

root@mox-a:~# linstor vd list | grep 105
| vm-105-disk-2         | 0        | 1012        | 16.00 GiB  | ok    |

# linstor rd list | grep 105
| vm-105-disk-1         | 7010 | ok    |
| vm-105-disk-2         | 7013 | ok    |

root@mox-a:~# linstor snapshot list
╭───────────────────────────────────────────────────────────────────────────────╮
┊ ResourceName  ┊ SnapshotName               ┊ NodeNames ┊ Volumes   ┊ State    ┊
╞┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄┄╡
┊ vm-105-disk-1 ┊ snap_vm-105-disk-1_vanilla ┊ mox-a     ┊ 0: 32 GiB ┊ DELETING ┊
╰───────────────────────────────────────────────────────────────────────────────╯

# linstor rd delete vm-105-disk-1
ERROR:
Description:
    Cannot delete resource definition 'vm-105-disk-1' because it has snapshots.
Details:
    Resource definition: vm-105-disk-1
Show reports:
    linstor error-reports show 5BE1E08E-00000-000004

@rp-
Copy link
Contributor

rp- commented Nov 8, 2018

Ok thanks for the update.
Definitely a bug that, we allow to remove volume definition while there are still snapshots that use that volume definition.
As a workaround it should work to create a "fake" volume definition:
linstor vd c vm-105-disk-1 10M --vlmnr 0
and than delete the snapshot and afterwards the resource definition.

@kvaps
Copy link

kvaps commented May 24, 2019

I have same problem but with different reason:

# linstor s l | grep DELETING         
| one-vm-140-disk-2 | snapshot-0        | m1c4, m1c6 | 0: 3 GiB | DELETING   |

Satellite logs:

14:19:09.107 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0 [Report number 5CE7FCEC-16A3B-000005]
ERROR REPORT 5CE7FCEC-16A3B-000005

============================================================

Application:                        LINBIT? LINSTOR
Module:                             Satellite
Version:                            0.9.9
Build ID:                           64b60e05c91b50097963c60f88830ce504dd4fd7
Build time:                         2019-05-23T07:18:37+00:00
Error time:                         2019-05-24 14:19:09
Node:                               m1c4

============================================================

Reported error:
===============

Description:
    Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0
Additional information:
    Command 'lvconvert --merge data/one-vm-140-disk-2_00000_snapshot-0' returned with exitcode 5. 

    Standard out: 


    Error message: 
      Failed to find logical volume "data/one-vm-140-disk-2_00000_snapshot-0"


Category:                           LinStorException
Class name:                         StorageException
Class canonical name:               com.linbit.linstor.storage.StorageException
Generated at:                       Method 'checkExitCode', Source file 'ExtCmdUtils.java', Line #66

Error message:                      Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0
Error context:
    An error occurred while processing resource 'Node: 'm1c4', Rsc: 'one-vm-140-disk-2''

Call backtrace:

    Method                                   Native Class:Line number
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:66
    checkExitCode                            N      com.linbit.extproc.ExtCmdUtils:31
    genericExecutor                          N      com.linbit.linstor.storage.layer.provider.utils.Commands:71
    genericExecutor                          N      com.linbit.linstor.storage.layer.provider.utils.Commands:34
    rollbackToSnapshot                       N      com.linbit.linstor.storage.utils.lvm.LvmCommands:238
    rollbackImpl                             N      com.linbit.linstor.storage.layer.provider.lvm.LvmThinProvider:213
    rollbackImpl                             N      com.linbit.linstor.storage.layer.provider.lvm.LvmThinProvider:35
    handleRollbacks                          N      com.linbit.linstor.storage.layer.provider.AbsStorageProvider:501
    process                                  N      com.linbit.linstor.storage.layer.provider.AbsStorageProvider:264
    process                                  N      com.linbit.linstor.storage.layer.provider.StorageLayer:224
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:655
    processChild                             N      com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:285
    process                                  N      com.linbit.linstor.storage.layer.adapter.drbd.DrbdLayer:232
    process                                  N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:655
    processResourcesAndTheirSnapshots        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:280
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceHandlerImpl:133
    dispatchResources                        N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:250
    phaseDispatchDeviceHandlers              N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:868
    devMgrLoop                               N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:609
    run                                      N      com.linbit.linstor.core.devmgr.DeviceManagerImpl:526
    run                                      N      java.lang.Thread:834


END OF ERROR REPORT.

lv one-vm-140-disk-2_00000_snapshot-0 is not existing anymore

Probably I was called rollback before deletion, and now it is somehow stuck

UPD:: If I create snapshot manually via lvcreate --snapshot -n one-vm-140-disk-2_00000_snapshot-0 data/one-vm-140-disk-2_00000 it is disappears after satellite restart, and I see the same error:

Failed to rollback to snapshot data/one-vm-140-disk-2_00000_snapshot-0 [Report number 5CE8036A-16A3B-000000]

@kvaps
Copy link

kvaps commented May 27, 2019

Can I remove this snapshot forcefully somehow? - It is not existing anymore in backing lvm datastore.

@ghernadi
Copy link
Contributor

Have you tried to simply remove the snapshot using linstor s d <resource-name> <snapshot-name>?

@kvaps
Copy link

kvaps commented May 27, 2019

Yes, it can't be removed because linstor tries to merge it:

# linstor r l -r one-image-46
+-----------------------------------------------+
| ResourceName | Node | Port | Usage  |   State |
|-----------------------------------------------|
| one-image-46 | m1c4 | 7108 | Unused | Unknown |
| one-image-46 | m1c6 | 7108 | Unused | Unknown |
+-----------------------------------------------+


# linstor s l | grep one-image-46
| one-image-46  | snapshot-0        | m1c4, m1c6 | 0: 512 MiB | DELETING   |


# linstor s d one-image-46 snapshot-0
SUCCESS:
    Snapshot 'snapshot-0' of resource 'one-image-46' marked for deletion.
ERROR:
Description:
    (Node: 'm1c4') Failed to rollback to snapshot data/one-image-46_00000_snapshot-0
Details:
    Command 'lvconvert --merge data/one-image-46_00000_snapshot-0' returned with exitcode 5. 
    
    Standard out: 
    
    
    Error message: 
      Failed to find logical volume "data/one-image-46_00000_snapshot-0"
    
Show reports:
    linstor error-reports show 5CEBAD7B-16A3B-000003
ERROR:
Description:
    (Node: 'm1c6') Failed to rollback to snapshot data/one-image-46_00000_snapshot-0
Details:
    Command 'lvconvert --merge data/one-image-46_00000_snapshot-0' returned with exitcode 5. 
    
    Standard out: 
    
    
    Error message: 
      Failed to find logical volume "data/one-image-46_00000_snapshot-0"
    
Show reports:
    linstor error-reports show 5CE9CD24-85D84-000015

@ghernadi
Copy link
Contributor

Okay, I just wanted to suggest to manually create a snapshot so that linstor can happily merge something, but as I see, you have tried that already.

So, correct me if I am mistaken, but you are in a situation, where you have deleted a resource, linstor knows about a snapshot that is also already gone on LVM-level. Is that correct?

If so, try to re-create a dummy-resource (with the expected name) and manually create the snapshot to make linstor happy (afterwards you can also remove the dummy-resource).

@kvaps
Copy link

kvaps commented May 27, 2019

No, resource is not deleted yet, but snapshot is

# linstor rd l -R one-image-46
+-----------------------------+
| ResourceName | Port | State |
|-----------------------------|
| one-image-46 | 7108 | ok    |
+-----------------------------+
# linstor r l -r one-image-46
+-----------------------------------------------+
| ResourceName | Node | Port | Usage  |   State |
|-----------------------------------------------|
| one-image-46 | m1c4 | 7108 | Unused | Unknown |
| one-image-46 | m1c6 | 7108 | Unused | Unknown |
+-----------------------------------------------+
# linstor s l | grep one-image-46
| one-image-46  | snapshot-0        | m1c4, m1c6 | 0: 512 MiB | DELETING   |

on the node

# lvs data | grep one-image-46
  one-image-46_00000                   data Vwi-a-tz-- 516.00m thindata                    100.00                          
# lvcreate --snapshot -n one-image-46_00000_snapshot-0 data/one-image-46_00000           
  Using default stripesize 64.00 KiB.
  Logical volume "one-image-46_00000_snapshot-0" created.
# lvs data | grep one-image-46
  one-image-46_00000                   data Vwi-a-tz-- 516.00m thindata                    100.00                                 
  one-image-46_00000_snapshot-0        data Vwi---tz-k 516.00m thindata one-image-46_00000     

now try to restart satellite:

...
10:47:29.146 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Snapshot 'snapshot-0' of resource 'one-image-46' registered.
10:47:29.698 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000000]
10:47:30.377 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Snapshot 'snapshot-0' of resource 'one-image-46' registered.
10:47:30.759 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000001]
10:47:30.836 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Resource 'one-image-46' updated for node 'm1c4'.
10:47:30.836 [MainWorkerPool-1] INFO  LINSTOR/Satellite - Resource 'one-image-46' updated for node 'm1c6'.
10:47:31.000 [DeviceManager] ERROR LINSTOR/Satellite - Failed to rollback to snapshot data/one-image-46_00000_snapshot-0 [Report number 5CEBC037-16A3B-000002]
# lvs data | grep one-image-46
  one-image-46_00000                   data Vwi---tz-- 516.00m thindata 

looks like it tries to remove same snapshot multiple times

@raltnoeder
Copy link
Member

Not sure where the "rollback to snapshot" came from, but there may be two problems:

  1. any delete operation is supposed to override (skip) any other actions on the resource/volume/snapshot/etc. and just delete it
  2. it does not really matter whether the system utilities think they deleted something, if whatever we deleted is not there afterwards, then the deletion should be successful. If it's not there in the first place, it may be unnecessary to even try to delete it, we might want to skip that.

@ghernadi
Copy link
Contributor

Can you please restart the linstor-satellite but this time with TRACE logging enabled and attach here the full log?

Also, for completeness' sake, please attach all new ErrorReports, just to be sure

@ghernadi
Copy link
Contributor

ghernadi commented Jun 17, 2019

Thanks, that helped.

I was able to somehow reproduce the issue (I am not totally happy with that because the only way I could reproduce this issue was manually removing the snapshot, which I hope you did not do :) )

Anyways, here is what happens (although, as mentioned, I am not entirely sure how you got into this situation):

When the controller connects to the satellite, it sends all data the satellite needs. In your case we have a strange combination of data such that the snapshot data/one-image-46_00000_snapshot-0 is marked for deletion AND the resource one-image-46 is configured to be rolled back from that snapshot that is just marked as deleted.
The storage layer first tries to remove the snapshot (this needs to be done first because of other corner cases) and afterwards the rollback-operation. This explains why the attempt to "create the snapshot to make linstor happy" did not succeed.

I am still thinking about how to add a safety check for preventing this unrecoverable situation from happening again. However, this might take some time, but this is on my todo-list :)

To fix your current setup, I am afraid you will have to manually edit the database (usually you should NOT do that, but right now I cannot think of an alternative) and execute

DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C4/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';
DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C6/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';

These two statements will basically make linstor forget about rolling back those two resources.

Edit: In case one-image-46 only has test-data which can be thrown away, you can also remove that resource instead of modifying the database. You might have to restart the controller twice afterwards, and if the snapshot is still in DELETING state, try to issue another linstor s d ... - at least in my test-vms this got rid of the snapshot.

@kvaps
Copy link

kvaps commented Jun 17, 2019

the only way I could reproduce this issue was manually removing the snapshot, which I hope you did not do

It might be, I was try to delete this snapshot by multiple ways after it went into this situation. Even manually, I don't remember all the actions what I made, sorry.

When the controller connects to the satellite, it sends all data the satellite needs. In your case we have a strange combination of data such that the snapshot data/one-image-46_00000_snapshot-0 is marked for deletion AND the resource one-image-46 is configured to be rolled back from that snapshot that is just marked as deleted.

I don't remember it well, but presume this situation has occurred after the next actions:

  1. I had resource one-image-46 and the snapshots: snapshot-0, snapshot-1, snapshot-2, snapshot-3 created for it one by one
  2. Run the next operations quickly in one batch (flatten one-image-46 with snapshot-0):
linstor s d one-image-46 snapshot-3
linstor s d one-image-46 snapshot-2
linstor s d one-image-46 snapshot-1
linstor s rollback one-image-46 snapshot-0
linstor s d one-image-46 snapshot-0

I'll let you know if I could reproduce this and the problem will occur again.

I am still thinking about how to add a safety check for preventing this unrecoverable situation from happening again. However, this might take some time, but this is on my todo-list :)

OK, good to know, Thank you!

@kvaps
Copy link

kvaps commented Jun 17, 2019

Edit: In case one-image-46 only has test-data which can be thrown away, you can also remove that resource instead of modifying the database. You might have to restart the controller twice afterwards, and if the snapshot is still in DELETING state, try to issue another linstor s d ... - at least in my test-vms this got rid of the snapshot.

This operation wasn't work, linstor strongly restrict to remove resources which have snapshots, even after I removed this volume using lvremove command

DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C4/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';
DELETE FROM LINSTOR.PROPS_CONTAINERS WHERE PROPS_INSTANCE = '/RESOURCES/M1C6/ONE-IMAGE-46' AND PROP_KEY = 'RollbackTarget';

This was helpful. Thanks now snapshot has disappeared.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants