RFD 153 Incremental metadata expansion for Manta buckets #117

kellymclaughlin · 2018-10-17T16:06:08Z

This is for discussion of RFD 153 Incremental metadata expansion for Manta buckets.

askfongjojo · 2018-11-01T19:30:05Z

Hi, I've re-read RFD 149 and then this RFD and have several questions around the sharding design and flow:

It sounds like we'll be using the same electric-morays (the existing ones for supporting hierarchical objects) to assign the vnode to use for a bucket object. If so, how do we make the hash function/algorithm spread out the objects to different vnodes? Currently the path of the object, up to the (lowest) parent directory name, is the input for hashing. The object name is not included in that (probably because there is some adjacency advantage for putting all objects under the same directory in the same vnode). For bucket objects, what would we use as the input for the hash calculation? Bucket name is probably not the best choice.
Do we intend to have separate hash rings for buckets and hierarchical objects? Given the proposed new way of expanding the metadata tier, it'll be good to be able to keep separate hash rings so that any expansion can be limited to specific set of shards using that hash ring. The main concern is that there may be no more need to expand the existing shards and we don't want the remap logic to consider those shards at all.

Overall it'll be helpful to have a description of how a bucket object is processed after it hits muskie.

kellymclaughlin · 2018-11-02T17:13:23Z

Thanks for reading those over and for the questions, @askfongjojo.

It sounds like we'll be using the same electric-morays (the existing ones for supporting hierarchical objects) to assign the vnode to use for a bucket object. If so, how do we make the hash function/algorithm spread out the objects to different vnodes? Currently the path of the object, up to the (lowest) parent directory name, is the input for hashing. The object name is not included in that (probably because there is some adjacency advantage for putting all objects under the same directory in the same vnode). For bucket objects, what would we use as the input for the hash calculation? Bucket name is probably not the best choice.

I think at this point it'll probably end up being a different set of electric-morays that function similarly, but that use the single source of truth hash ring server. It's possible we could change the existing electric-moray to work with multiple hash rings, but I think it is probably best to use separate instances. The input to the hashing function will include the account owner information, the bucket name, and the object name so that an object in a particular bucket for a particular account always hashes to the same location.

Do we intend to have separate hash rings for buckets and hierarchical objects? Given the proposed new way of expanding the metadata tier, it'll be good to be able to keep separate hash rings so that any expansion can be limited to specific set of shards using that hash ring. The main concern is that there may be no more need to expand the existing shards and we don't want the remap logic to consider those shards at all.

Definitely agree. There will be separate hash rings. Ideally the deployment of buckets will mostly leave existing manta unperturbed and the new expansion behavior is different enough to make me not want to attempt to blend the two hash rings together.

Overall it'll be helpful to have a description of how a bucket object is processed after it hits muskie.

That's great to know. I think we're close to having enough pieces sorted out to provide this now. That feels a bit beyond the scope of this RFD so it maybe deserves another document, but I would like to put that together very soon.

davepacheco · 2018-11-16T20:11:27Z

Thanks for writing this up! It's clearly well thought out.

I think it's good that we're considering the architecture from the point of view of adding shards without downtime, but I'd strongly recommend separating the implementation into two phases: one which requires (hopefully minimal) downtime followed by one which supports expansion without downtime. Today, in practice, we take about 30 minutes of read downtime per shard, but there's good reason to think that even with what we've got today, we could shrink this to much less time.

I'm imagining a process based on what you describe under "Add a new pnode", except that we roll out two phases of ring updates. In step 2, we distribute a ring update that marks a vnode read-only (without moving it). In step 6, since it's read only, we can just restore the pg_dump directly. Afterwards, we distribute a ring update that marks the vnode writable and on a different pnode. I think the total read downtime for the shard would be dominated by the pg_dump and pg_restore. There's also the advantage that if anything goes wrong at any point (i.e., a failure or partition of some electric-morays or the orchestration process), it's trivial to roll everything back. (We just drop the new data from the new pnode and reactivate the original ring version.)

I imagine phase 2 could implement the process you describe. There are a lot more pieces to implement here, they seem quite tricky (I'm still unclear on how conflict resolution works if writes have been issued to the new pnode before dependent data has been moved to it), and there are a bunch more failure modes to consider (including how we might rollback if something went wrong). I still think this is a good approach -- I'd just suggest doing these more complex pieces after we've delivered and gained operational confidence with the more basic pieces. What do you think?

I think we should more strongly emphasize the automation. We've discussed a number of principles (much more broadly than this RFD) that we haven't written down, so I figured I'd take this opportunity to write them down. You've implicitly covered a lot of these already.

Operators should interact with the system by expressing a holistic intent (i.e., a new topology) rather than piecemeal or non-idempotent requests (like "move this vnode to that pnode").
The system should carefully validate any request to change the topology.
The system should continue trying to make reality match the intended state and not come to rest until it's either done that or encountered a condition that it's unsure how to handle.
When the system encounters a condition that it's unsure how to handle, it should stop and clearly report the condition. It should provide primitives to pause execution at a particular point for validation and resume execution.
The system should proactively verify all the state that it expects (e.g., that it knows about all of the electric-morays; that they're running the expected ring versions; that all Manatee shards are healthy for the duration of the process; etc.) and stop with a clear message if it encounters unexpected state that it can't handle.

Like I said, you've got most of this implicitly in the RFD, but I'd suggest eliminating the suggestion of manual execution and maybe emphasizing the autonomy of the automation. (I'll plan to write these down somewhere more useful than this issue comment, but I wanted to include them here so readers are on the same page about what we mean by automation here.)

While not spelled out, the current resharder is built around these principles, and I think we may want to leverage a lot of it (specifically, the state machine execution engine that it defines, along with pause/pause-at-step/resume, status reporting, etc.). (It's the ideas that I think are important rather than the specific code.) I think this approach has been critical for our ability to expand to hundreds of shards. There have been quite a number of unexpected cases where the system came to rest where previous types of automation we have built would often barrel on and create cascading failures.

Anyway, thanks again for the great write-up!

jclulow · 2018-11-16T22:19:43Z

One option is to introduce some means of consensus or coordination by which all electric-moray instances can agree on the correct version of the ring to be used and vnode transfers may only proceed once the electric-moray instances have all agreed to use the new version.

I would argue that we already have a consensus mechanism as part of the
reshard process. The sequence of manipulations of the hash ring, and the
process by which we shift each electric-moray through various states including
receiving the update, is a coordinated process that happens as part of the
resharding engine today.

The downside to this is that it introduces more complexity into the routing layer of the system for something that still does not happen with extreme frequency. Also, we only avoid problems if all electric-moray instances agree in concert, but this could prove difficult in the case of network partitions or downed compute nodes.

So, I think using LevelDB in the way that we do here is unfortunate -- it's not
a fantastic database, and represents a source of technical debt at this stage.
There are, though, other local/embedded databases that have much better
properties; e.g., SQLite.

I don't agree that coordinated distribution of the hash ring archive, as we do
today, is inherently worse or more complicated than storing it in an online
service. Indeed, we already store the ring in a service today: IMGAPI, with a
pointer to the correct version in SAPI.

Electric Moray presently collects the current version of the hash ring at
first-boot:

https://github.com/joyent/electric-moray/blob/380d68bb2eeafd4b7b562a6645f9a29a8da1c719/boot/setup.sh#L41-L140

As with so much Manta, I think the actual specific lines of code we've written
to achieve that goal aren't very good -- but I don't think the architecture
is unsound. Once Electric Moray has the correct hash ring database, it can
continue to operate, even after a reboot, in the face of network partitions and
many intermittent faults in other parts of the stack.

I think the onus of coordination has to be placed on the operation that isn't
critical to availability; i.e., the barrier requirement, that every Electric
Moray process must agree on the update in order to move forward, has to be paid
by the reshard/rebalance process rather than by regular operation. That is,
it's true that coordination of this sort may not be possible in the face of a
partition, but if a partition is blocking resharding (still an exceptional, if
more frequent in future, event) we just need to fix that first.

We can even expose an API in electric-moray to query the current ring version in order to be certain rather than solely relying on time.

Though it's not a HTTP API, we do have an API for this today. The new hash
ring database includes a stamp which we can interrogate (remotely) to know if a
particular electric-moray instance is up-to-date:

https://github.com/joyent/manta-reshard/blob/2f2e4952909c145b547a35577d974964ce81d302/lib/phase_update_electric_moray_ring.js#L320-L382

I fully agree that this should be a HTTP API, rather requiring the resharder to
run commands directly in the Electric Moray zones. We started this process with the the "status API" we added to make it easier for the reshard system to determine whether Electric Moray had correctly received an update to the read-only flag for a particular pnode:

[root@00b74968 (electric-moray) ~]$ curl -sSf http://localhost:4021/status | json
{
  "smf_fmri": "svc:/smartdc/application/electric-moray:electric-moray-2021",
  "pid": 842,
  "start_time": "2018-09-12T19:57:20.821Z",
  "client_list": [
    "tcp://10.moray.us-east.scloud.host:2020",
    "tcp://11.moray.us-east.scloud.host:2020",
    "tcp://12.moray.us-east.scloud.host:2020",
    "tcp://13.moray.us-east.scloud.host:2020",
    "tcp://14.moray.us-east.scloud.host:2020",
...
    "tcp://74.moray.us-east.scloud.host:2020"
  ],
  "index_shards": [
    {
      "host": "2.moray.us-east.scloud.host"
    },
    {
      "host": "3.moray.us-east.scloud.host"
    },
...
    {
      "host": "74.moray.us-east.scloud.host"
    }
  ]
}

The two options above rely on being able to easily discern one ring version from another. Part of this proposal is to add versioning to the ring so that such a comparison is possible. Any mutation to the ring should result in the change of a version indicator as well as an update to a timestamp indicating the time of last mutation. The version update would need to be done atomically with the ring mutations to avoid any issues.

We actually have this today as well:

[root@00b74968 (electric-moray) ~]$ cat /electric-moray/chash/leveldb-2021/manta_hash_ring_stamp.json 
{
        "image_uuid": "c26087bf-4a8d-468b-92fb-53575da68b2c",
        "date": "2018-02-01T02:39:18Z",
        "source_zone": "2eb9353b-a8a5-425d-90b5-fe344ac8c399"
}

The stamp file includes the zone in which it was created, and the time at which
it was created. This stamp is generated here:

https://github.com/joyent/manta-reshard/blob/2f2e4952909c145b547a35577d974964ce81d302/templates/hashring-create-archive.sh#L26-L44

The update process downloads the current ring version (a tar file, stored in
IMGAPI), modifies it with fash, assigns the stamp, then uploads it again. The
pointer to the current new ring version (stored in SAPI) is then atomically
flipped to the new Image UUID. Any interruption to this process will not
result in a partial update, and the entire process can be retried until it
succeeds.

The status API in Electric Moray today could definitely be extended to expose
the hash ring version now included in the stamp. It could also be extended
with operations like "please check SAPI for an updated hash ring version,
download it with IMGAPI, and install it". Electric Moray is in a position to
download the file, and make an atomic switch by closing one and opening the new
one. The resharder, and any future remapping system, would then merely need to
make requests to each Electric Moray process in turn to either see that they
have already updated, or to ask them to update.

Today, we provide only a simple "this pnode is read only" intermediate state,
in order to enable the current coarse-grained resharding system. I think in
future we could extend this mechanism with the other details you'd need to
reflect that a particular vnode is currently being migrated, to where, etc.
This would enable us, as you describe in the subsequent sections on
organisation and transfer, to arrange for outage-free operation during
migration.

To summarise, a lot of this RFD sounds extremely promising, it's just that I
don't think we need to introduce another service into the stack to provide the
kind of coordinated reconfiguration of Electric Moray that you're looking for.
Indeed, I think the way we do this today -- where each Electric Moray zone gets
an on-disk file that will survive up to and including reboot of the host system
-- is a robust way to avoid the kinds of issues that network partitions and
online consensus systems can lead to.

kellymclaughlin · 2019-01-07T23:54:22Z

Thanks for the comments, Dave! Sorry for the delayed response.

I imagine phase 2 could implement the process you describe. There are a lot
more pieces to implement here, they seem quite tricky (I'm still unclear on
how conflict resolution works if writes have been issued to the new pnode
before dependent data has been moved to it), and there are a bunch more
failure modes to consider (including how we might rollback if something went
wrong). I still think this is a good approach -- I'd just suggest doing these
more complex pieces after we've delivered and gained operational confidence
with the more basic pieces. What do you think?

I think a phased approach should be fine. As you say, the correctness and
robustness of this process is critical and it likely represents the most
challenging part of this project to implement. For me this is describing where
I want to end up and if we have to get there in stages that is fine.

I think we should more strongly emphasize the automation.

Like I said, you've got most of this implicitly in the RFD, but I'd suggest
eliminating the suggestion of manual execution and maybe emphasizing the
autonomy of the automation.

This is good to hear. I also feel like it is a very important part of this. I'll
try to re-frame the document more in terms of the automation and make it seem
less optional.

I'm still unclear on how conflict resolution works if writes have been issued
to the new pnode before dependent data has been moved to it

I can try to explain this a bit better or perhaps draw a diagram that
illustrates it to help.

kellymclaughlin · 2019-01-07T23:56:40Z

Thanks for the feedback, Josh! Again, sorry for the delay in any response.

I would argue that we already have a consensus mechanism as part of the
reshard process. The sequence of manipulations of the hash ring, and the
process by which we shift each electric-moray through various states including
receiving the update, is a coordinated process that happens as part of the
resharding engine today.

I am not extremely familiar with the details of how the resharder handles
coordination. My superficial understanding is that the resharder uses database
locks to enforce serialization of certain parts of the process and that it is
the resharder that makes the determination of when it is safe to proceed. If
that's is so then I think this sort of coordination will still be needed. I was
thinking more of a case where each electric-moray instance worked to
collectively decide on a particular ring version to use when a toplogy change is
made and any additional supporting software or service that might be required
for that.

I don't agree that coordinated distribution of the hash ring archive, as we do
today, is inherently worse or more complicated than storing it in an online
service. Indeed, we already store the ring in a service today: IMGAPI, with a
pointer to the correct version in SAPI.

As with so much Manta, I think the actual specific lines of code we've written
to achieve that goal aren't very good -- but I don't think the architecture is
unsound. Once Electric Moray has the correct hash ring database, it can
continue to operate, even after a reboot, in the face of network partitions
and many intermittent faults in other parts of the stack.

I don't think anything is unsound with the current process and I don't want to
imply that one way of doing this process is necessarily better or worse. I am
thinking of it in terms different sets of trade-offs and how a different set
might help achieve the goal I stated of changing the topology without incurring
downtime of any components and perhaps provide some other benefits.

One of the benefits I see of adding a new service to persist and manage the
ring would be that the changes could flow to electric-moray instances
automatically and without any reprovisioning or process restarts needed. This
could of course also be done without this new ring service, but another consideration
is that for the information needed (i.e. the pnode to vnode mapping) having
each electric-moray instance require the infrastructure and overhead for a
database seems very much like overkill when a small amount of memory would
suffice. Plus the process of upgrading the database or making schema changes
only has to happen in a single place.

It's also an excellent opportunity to hide the internals of consistent
hashing from electric-moray and abstract the idea of data placement, but I think
this is a good idea regardless of this proposal. We may never use anything other than
consistent hashing for data placement, but there is no reason the internals of
of our choice need to be littered in the code. Some of
the work I did for the recent demo shows how this might look except I would
imagine this module extracted as a client library to the ring service. There are
still a lot of references to a ring service and the ring itself in the RFD, but
that is more for familiarity. I'm using it as a stand-in for a more general idea
and I would like to frame things in the more general terms of data placement.

As far a just the effort to write this new service I really just envision it as
an HTTP server bolted onto manatee so much of most challenging aspects are
already done.

The point about electric-moray currently being able to continue operation in the
face of reboots or network issues is a very good one. This would be a trade-off
we would have to accept, but I think it's not as severe as it might seem
considering other factors. The ring service need only be queried when
electric-moray starts. It should periodically poll for ring changes as it runs,
but I do not think that a failure of this poll should cause an electric-moray to
cease functioning. It's already the case that electric-moray cannot begin
functioning if it cannot successfully contact a moray for every metadata shard
at start time, but we accept that with the reasoning that a moray for each shard
should normally be reachable and I think the same reasoning can apply to the
ring service as I described it. Specifically the implementation that uses
manatee with postgres and the remote_apply replication setting. This is
essentially chain replication (in theory at least, I'd want to do some testing
to verify it holds up) so we can read from any member of the chain and therefore
reduce the chances that a booting electric-moray cannot get the required
information to start handling requests.

Now we could do a hybrid approach whereby we have a ring service, but also
persist the data to disk locally and use the locally stored data in the case of
a restart where the ring service is unavailable. This is fine except that it
introduces the possibility of conflicting rings and incorrect behavior during a
topology change. For example, if an electric-moray process starts up while a new
ring is being disseminated and begins using a local and outdated copy of the
ring data that could be a very bad situation. A workaround for this scenario
might be to add a configuration that can disable the disk fallback that could be
used during topology changes. This also might be reserved as an enhancement to be
done after the initial release.

Yet another way is to stay with something more similar to the current approach
with each electric-moray keeping a disk record of the placement data. We would
need to add a way for changes to the data to be made without restarting the
process and would need to also have the ability to disable the use of the disk
data in the case of an unexpected restart during a topology change. With those
two changes we ought to be able to proceed with topology changes in a similar
manner as with the ring server scenario. I suppose then that the canonical copy
of the data is stored in IMGAPI as is done now with the ring tar files. Then
we're back to the question of whether or not we really need to persist the data
to disk on each electric-moray or if IMGAPI is the ring server and we're
actually back to the first scenario. I confess to not knowing if IMGAPI
is truly up for that task or not, but it seems like a misuse of the intended
purpose of IMGAPI at least.

There seems to be multiple paths to achieve the end goals. I really do like the
idea of splitting what electric-moray does into more focused components, but I
need to ponder the options a bit more and it might even be that this ends up
being a phased process too where we take a step and evaluate it and decide what
else if anything should be done.

I think the onus of coordination has to be placed on the operation that isn't
critical to availability; i.e., the barrier requirement, that every Electric
Moray process must agree on the update in order to move forward, has to be
paid by the reshard/rebalance process rather than by regular operation. That
is, it's true that coordination of this sort may not be possible in the face
of a partition, but if a partition is blocking resharding (still an
exceptional, if more frequent in future, event) we just need to fix that
first.

I wouldn't say that the onus of coordination would be shifted to the the
critical path service, but that the detection of changes would happen
automatically and continuously. The burden of coordination would still lie with
the service managing the tologogy change. The advantage is not having to lose
any processing capacity while the changes are effected. Once the topology change
manager decided it was safe to proceed the changes to the ring to take it to the
transitioning state would be made, each electric-moray would detect the
changes through the periodic polling, and then it would be up to the topology
change manager to ensure (through status API calls) that each electric-moray was
using the proper ring version before proceeding.

We actually have this today as well:

Thanks for highlighting a lot of the existing features that might be used or
further enhanced and exposed. I even came across a few of those on my own as I
was hacking up electric-moray for the buckets demo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFD 153 Incremental metadata expansion for Manta buckets #117

RFD 153 Incremental metadata expansion for Manta buckets #117

kellymclaughlin commented Oct 17, 2018

askfongjojo commented Nov 1, 2018 •

edited

Loading

kellymclaughlin commented Nov 2, 2018

davepacheco commented Nov 16, 2018

jclulow commented Nov 16, 2018

kellymclaughlin commented Jan 7, 2019

kellymclaughlin commented Jan 7, 2019

RFD 153 Incremental metadata expansion for Manta buckets #117

RFD 153 Incremental metadata expansion for Manta buckets #117

Comments

kellymclaughlin commented Oct 17, 2018

askfongjojo commented Nov 1, 2018 • edited Loading

kellymclaughlin commented Nov 2, 2018

davepacheco commented Nov 16, 2018

jclulow commented Nov 16, 2018

kellymclaughlin commented Jan 7, 2019

kellymclaughlin commented Jan 7, 2019

askfongjojo commented Nov 1, 2018 •

edited

Loading