Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFD 148 Snapper: VM snapshots #109

Open
mgerdts opened this issue Jul 18, 2018 · 9 comments
Open

RFD 148 Snapper: VM snapshots #109

mgerdts opened this issue Jul 18, 2018 · 9 comments

Comments

@mgerdts
Copy link
Contributor

mgerdts commented Jul 18, 2018

This is for discussion of

RFD 148 VM Snapshots

https://github.com/joyent/rfd/blob/master/rfd/0148/README.md

@mgerdts
Copy link
Contributor Author

mgerdts commented Jul 18, 2018

@twhiteman asked via chat and I answered

  1. What happens when Manta is not available (e.g. COAL, some Triton customers), is there some fallback for that case?

TBD.

  1. Manta snapshot storage dir - could that be placed in the Manta account of the customer, e.g. ~/toddw/.snapshots/vms/$UUID/ ? That way they would be charged for the storage of their snapshots without additional billing changes, and they could also manage their snapshots through manta operations.

There is a concern about receiving snapshots from untrusted sources. Unless we have RFD 14 implemented, we can't put the image somewhere that the customer could tamper with it.

  1. After further reading, I'm wondering whether it makes sense for snapshots to become IMGAPI images. What differentiates this from the existing CreateImageFromMachine API

for 3. c.f. https://apidocs.joyent.com/cloudapi/#CreateImageFromMachine

IMGAPI images (I think) only cover the boot disk and require that the guest be rebooted to run a prepare-image script. That being said, it may make a lot of sense to extend IMGAPI to cover this use case.

@jussisallinen
Copy link

@mgerdts The subject has wrong RFD number, Snapper: VM snapshots is RFD 148.

@mgerdts mgerdts changed the title RFD 128 Snapper: VM snapshots RFD 148 Snapper: VM snapshots Jul 18, 2018
@papertigers
Copy link
Contributor

What will we do about snapshots that are extremely large? Say a customer has a 1TB or higher instance. Will we be able to reliably send such large snapshots to manta? Will the customer also be charged for data usage in manta?

@mgerdts
Copy link
Contributor Author

mgerdts commented Jul 18, 2018 via email

@marsell
Copy link

marsell commented Jul 23, 2018

One minor concern I have is regarding the Manta paths for snapshots. Overall I like the scheme, but there's one (potentially) common use-case it would flake out on: regular database snapshots.

Regularly snapshotting a database is a good idea, and since those snapshots will hopefully never be used, there's a monetary incentive to stick to incremental. This will result in a very deep directory structure.

I don't know what Manta's directory path limit in chars is, but in practice HTTP headers over 8K are asking for trouble.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jul 23, 2018

@marsell said:

Regularly snapshotting a database is a good idea, and since those snapshots will hopefully never be used, there's a monetary incentive to stick to incremental. This will result in a very deep directory structure.

I'm not so sure the monetary incentive is to always use incremental from the latest, as that means you can never remove any snapshot except the latest. If the source of an incremental is able to be chosen, it would allow for a scheme like a monthly full, daily incremental from the monthly full, hourly incremental from the previous daily or hourly.

I don't know what Manta's directory path limit in chars is, but in practice HTTP headers over 8K are asking for trouble.

Quite a valid point here. Presuming we use IMGAPI, we can probably leverage whatever support it already has for not deleting images that have children. The proposed hierarchy is clearly not the only way to accomplish this.

@rmustacc
Copy link
Contributor

Thanks for starting this. I realize that this is an early draft;
however, I think it would be really helpful to get the next round of
details fleshed out here as there are a lot of open problems which will
change things dramatically when we have a better understanding.

I think there are a couple of different classes of issues that are worth
discussing:

User Visible Aspects

First, while I understand the differentiation and practicality of a full
versus incremental versus differential, it's not clear how we're going
to clearly articulate this to a customer. It would really help to get
better sense of the UI and the actual API endpoints that are going to be
visible. It's not clear if I can take snapshots of individuals disks,
datasets, everything or nothing. Or how users will really get a sense of
the differences between

Next, I have a bunch of questions about when can snapshots be taken.
Does the instance have to be powered on or powered off? If it's powered
on, how do we make sure that the guests have properly quiesced their
disk state such that it makes sense to take a snapshot?

One of the main points of the introduction is that this is supposed to
take a snapshot of the VM's metadata. How does that work? What metadata
are we taking a snapshot of or not? If we're rolling back and instance
then are we also creating and destroying datasets on the host? What
about things like NICs, CNS names, and other context? The on-disk state
probably only makes sense in the context of everything else. For
example, servers will have configuration related to the network
configuration on disk to drive services. If I roll back to an older
image, what if we no longer have that IP address available or not? All
in all, I think this really deserves a lot more thought in the RFD.

Storage of Snapshots

In most cases writing to Manta will be done over the WAN. I think the
RFD is currently way to optimistic about performance over the WAN for
long, extended transfers. While MPU may help us with this, if we're
realistically talking about 1-2 TB transfers, that's going to take a long
time to actually transit the WAN, even if we can say get 100 Mbit.

Conversely, the local storage discussion isn't as straightforward. A
simple web server as discussed is probably not going to cut it (what
happens when you run out of space) and you're going to have a pretty
quick feature creep where on-prem will ask about things like NFS, CIFS,
etc.

It's not clear in the section that's talking about ZFS reservations as
to how long those refereservations will exist for the snapshot? What
happens if everyone wants to take that snapshot at the same time on a CN
that doesn't have a lot of available capacity? Does something fail? If
so, for whom? How does that come back and impact provisioning and DAPI?
Will this allocated temporary space be made clear to DAPI?

Intersection with Image Creation

Folks are also going to probably want to say something like can I take
this snapshot and turn it into a new instance somehow. It might be
worth addressing that to some extent or making clear that it's mostly
going to be punted on.

@mgerdts
Copy link
Contributor Author

mgerdts commented Jul 30, 2018

First, while I understand the differentiation and practicality of a full
versus incremental versus differential, it's not clear how we're going
to clearly articulate this to a customer. It would really help to get
better sense of the UI and the actual API endpoints that are going to be
visible.

I'll add specifics as to how I think https://apidocs.joyent.com/cloudapi/#CreateMachineSnapshot and related calls will be used.

It's not clear if I can take snapshots of individuals disks,
datasets, everything or nothing. Or how users will really get a sense of
the differences between

I think the introduction and Anatomy of a VM snapshot made that clear. In particular:

  • This is currently scoped for bhyve only, thus we do not have the complication of the placement of kvm's layout or the possibility that there are delegated datasets outside of the zonepath dataset.
  • "... it contains all of the information stored in the zone's dataset (zones/<uuid>) and its descendants".

There are limitations. In particular, the following are not part of the snapshot.

  • Core files
  • Configuration that is not persisted in <zonepath>/config.

These limitations match the limitations of snapshots currently supported with triton instance snapshot. Being able to snapshot all configuration items and revert back to them will likely have a lot of overlap with RFD 126. That is, we will need a PI-independent representation of the entire config.

Until such a time as we are able to roll back all configuration, do we need to block configuration changes while snapshots exist?

Next, I have a bunch of questions about when can snapshots be taken.
Does the instance have to be powered on or powered off? If it's powered
on, how do we make sure that the guests have properly quiesced their
disk state such that it makes sense to take a snapshot?

It is a crash-consistent image. Use snapshots if and only if your file system and consumers of raw disk can withstand an unexpected power outage.

One of the main points of the introduction is that this is supposed to
take a snapshot of the VM's metadata. How does that work? What metadata
are we taking a snapshot of or not? If we're rolling back and instance
then are we also creating and destroying datasets on the host? What
about things like NICs, CNS names, and other context? The on-disk state
probably only makes sense in the context of everything else. For
example, servers will have configuration related to the network
configuration on disk to drive services. If I roll back to an older
image, what if we no longer have that IP address available or not? All
in all, I think this really deserves a lot more thought in the RFD.

I think I already covered this above. These snapshots will have many of the same issues as the snapshots that we already support.

Storage of Snapshots
In most cases writing to Manta will be done over the WAN. I think the
RFD is currently way to optimistic about performance over the WAN for
long, extended transfers. While MPU may help us with this, if we're
realistically talking about 1-2 TB transfers, that's going to take a long
time to actually transit the WAN, even if we can say get 100 Mbit.

Conversely, the local storage discussion isn't as straightforward. A
simple web server as discussed is probably not going to cut it (what
happens when you run out of space)

I had initially proposed having some infrastructure zones with delegated datasets (snapper zones). There would be a set (minimum two, more over time) per data center. We would leverage the migration code (RFD 34) to send the VM's dataset hierarchy to the the delegated datasets of two snapper zones. The stream would be received into the snapper zone's delegate dataset.

@twhiteman suggested that things would be much simpler if we relied on Manta to handle replication and maintenance of redundancy in the face of failures. Further discussion led to the idea that storage in Manta may lead to a lot of overlap with IMGAPI. That would contribute nicely to another customer request - the ability to deploy clones from snapshots.

Manta Snapper
Size limit of one VM's snapshots Manta's limit One snapper's delegated dataset size
Snapshot store needs more space See Manta docs Resize snappers or add new snappers and rebalance
Rebalance built in Exercise for the developer
Snapshot host recovery automatic Exercise for the developer
Avoid WAN limits deploy manta in each datacenter (hard) Deploy more snapper zones in DC (easy)
Maintain redundancy built in Exercise for the developer
Remove intermediate snapshots not possible trivial to support
Recover directly to any snapshot without extra data transfer not possible trivial to support
Recover from interrupted transfer not possible possible
Development effort Minimal Significant

If we had some form of elastic storage, Snapper becomes much more practical because the per-snapper limitations become much more flexible and resilience can be delegated to the elastic storage. Elastic storage is not this project.

We need clarity on the requirements to know which path we should be pursuing.

and you're going to have a pretty
quick feature creep where on-prem will ask about things like NFS, CIFS,
etc.

If storing to a file not in manta, the expectation is that the customer's NFS server could be mounted on each CN. In no way is this project about providing NFS, SMB, etc. If using a CN's NFS client is for some reason problematic, then we may be at a point of requiring temporary space at least as large as the largest VM and the ability to use scp or similar to copy it off host.

It's not clear in the section that's talking about ZFS reservations as
to how long those refereservations will exist for the snapshot? What
happens if everyone wants to take that snapshot at the same time on a CN
that doesn't have a lot of available capacity? Does something fail? If
so, for whom? How does that come back and impact provisioning and DAPI?
Will this allocated temporary space be made clear to DAPI?

Will clarify

Intersection with Image Creation
Folks are also going to probably want to say something like can I take
this snapshot and turn it into a new instance somehow. It might be
worth addressing that to some extent or making clear that it's mostly
going to be punted on.

Will clarify

@ghost
Copy link

ghost commented Sep 10, 2018

I haven't had a chance to fully read and understand the RFD and all the discussion, as it is quite large and complex. But I'm a massive fan of KISS, and as an end user of Triton and SmartOS in production on our cloud, the core MVP functionality we're after is simply:

Easy

  • Take snapshots of KVM guests (currently missing - "snapshots are not supported for VMs of brand "kvm")
  • Rollback to snapshots

This is basic functionality that is missing which we already have on our non-Triton based SmartOS cloud and works fine - AFAIK it's trivial to implement. Having this would be exceptionally helpful!

We don't use delegated datasets, but if we did I imagine adding a "recursive" option for SmartOS zones would be handy which includes the delegated datasets, same for rollbacks. Otherwise, it just does the zone root.

Medium

  • Download a snapshot (so end users can pull their own backups)
  • Restore a VM from a snapshot (so end users can restore their own backups). Could be implemented as creating a new VM, passing an image flag. Since CloudAPI is https I'm guessing the image could be passed as an argument to the triton cli tooling and via the API. Might have scaling considerations (multiple CloudAPI instances for scaling).

Again the above seems fairly straight forward and plugs a big gap in functionality easily.

Hard

  • Store snapshots
  • Create VMs from stored snapshots

It would be nice if Manta wasn't needed, as we don't have an intention of spinning up a Manta instance (and AFAIK Joyent doesn't currently have Manta in eu-ams-1 and support told me there were no plans to in the near term).

KISS principle to me suggests that for a non-manta installation, images are pushed for storage on headnode via a similar mechanism to whatever imgadm uses.

Hope the above is helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants