-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3ef20f3
commit 06c1588
Showing
1 changed file
with
166 additions
and
0 deletions.
There are no files selected for viewing
166 changes: 166 additions & 0 deletions
166
...on-a-new-primary-copy-method-to-support-highly-available-distributed-systems.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,166 @@ | ||
--- | ||
geometry: margin=3cm | ||
title: Notes for ``Viewstamped Replication - A New Primary Copy Method to Support Highly-Available Distributed Systems'' | ||
--- | ||
|
||
# Introduction | ||
|
||
- Assumptions | ||
- Failures are not byzantine | ||
- Nodes will eventually recover from crashes | ||
- Partitions are eventually repaired | ||
- Computations run as atomic transactions | ||
- Model of computation where a distributed program consists of _modules_, each of which resides at | ||
a single node of the network | ||
- Modules can recover with some of their state intact | ||
- Modules cannot access the data objects of another module directly and must use _remote | ||
procedure calls_ | ||
- Based on primary copy technique | ||
- One replica is designated as the primary and others as backups | ||
- Primary is responsible for processsing transactions and notifying backups | ||
- When a replica crashes/recovers, replicas are reorganized and a new primary is selected if | ||
necessary | ||
- Reorganization is called a _view_ | ||
- Primary copy techinque only works if node failures are distinguishable from network failures, | ||
but it is not true in the general case | ||
|
||
# Overview | ||
|
||
- Replicate individual modules to obtain _module groups_ which consists of several copies of the | ||
module called _cohorts_ | ||
- The set of cohorts is the group's _configuration_ | ||
- Each cohort has a unique name called an _mid_ | ||
- The group has a unique `groupid` | ||
- One cohort is designed the _primary_ and remaining cohorts are _backups_ | ||
- Each cohort runs in a _view_ which is a set of cohorts that are capable of communicating with each | ||
other | ||
- It is a a subset of the configuration and must contain a majority of the group members | ||
- A group switches to a new view by executing a _view change protocol_ | ||
- Each view is identified by a unique `viewid` (totally ordered) | ||
- If a majority of cohorts accept the new view, cohorts switch to the new active view, else they | ||
remain in their old views | ||
- Transactions are only processed in active views | ||
- Primary generates a timestamp (unique within a view and totally ordered) every time it needs to | ||
communicate information to its backups referred to as an _event_ | ||
- Primary maintains a communication buffer where it writes _event records_ | ||
- An event record identifies the type of event and other relevant information about the event | ||
- Information in the buffer is sent to backups in timestamp order | ||
- Timestamps are used as an inexpensive way of determining what information survives a view change | ||
- _Viewstamps_ are timestamps concatenated with the `viewid` of the view in which the timestamp is | ||
generated | ||
- Each cohort maintains a _history_ of viewstamps | ||
|
||
# Running Transactions | ||
|
||
- Viewstamps are used to determine if a transaction can commit | ||
- State of cohort: | ||
- `status`: A cohort is _active_ if it can participate in transaction processing, otherwise it is | ||
_inactive_ and involved in a view change | ||
- `gstate`: Set of objects that consistute the group state | ||
- `groupid`: Name of the module group | ||
- `cur_viewid`: Current `viewid` | ||
- `cur_view`: Current view | ||
- `history`: Sequence of events known to cohort | ||
- `timestamp`: Timestamp generator | ||
- `buffer`: Sequence of event records (communication buffer) | ||
- Transactions are synchronized by strict 2-phase locking | ||
- Transactions modify _tentative_ version which is discarded if transaction is aborted and becomes | ||
base version if it is committed | ||
- Communication buffer actions: | ||
- `add`: Takes an event record, atomically assigns it a timestamp and adds the event record to the | ||
buffer | ||
- `force-to`: Takes viewstamp $v$ has an argument. Returns immediately if $v$ is not for the | ||
current view and otherwise waits until a sub-majority of backups know about all events in the | ||
current view with timestamps less than or equal to $v.ts$ | ||
- Clients create transactions, make remote calls, and act as coordinators of two-phase commit | ||
- Servers process remote calls and participate in two-phase commit | ||
- Assumes that a highly-available location server that maps `groupids` to configurations exists | ||
|
||
## Active Primaries of Clients | ||
|
||
- When a server finishes processing a remote call on behalf of transaction, it assigns the call a | ||
viewstamp | ||
- When transaction is created, it receives a unique transaction id `aid` and an empty `pset` that | ||
keeps track of remote calls done in the transaction | ||
- To make a remote call, system looks up primary and `viewid` for the group and then sends a call | ||
message which contains a unique call id to the primary | ||
- Reply returns a `pset` of `<groupid, viewstamps>` pairs | ||
- If there is no reply, then abort transaction and send abort messages to all participants | ||
- If reply indicates that view has changed, update cache, retry call and if most recent view | ||
cannot be discovered, then abort the transaction | ||
- Client's primary acts as coordinate of two-phase commit protocol | ||
- Determines who the participants are from the `pset` | ||
- Sends prepare messages to participants of `pset` | ||
- Adds committing record to buffer if all participants agree to ensure that commit will be known | ||
across view change of coordinator | ||
- Sends commit messages and when all are acknowledged, adds done record to buffer | ||
- If transaction aborts, then coordinator sends abort messages to participants and adds an aborted | ||
record to buffer | ||
|
||
## Active Primaries of Servers | ||
|
||
- When remote call completes, server's primary assigns it a viewstamp and returns this information | ||
in the reply message | ||
- Primary can agree to prepare only if it knows about all remote calls its group has done on behalf | ||
of preparing transaction | ||
- Rejects call if call's `viewid` does not equal to `cur_viewid` | ||
- Otherwise, it creates empty `pset` and runs the call, possibly making nested calls | ||
- Writes a completed call record to buffer that identifies each atomic object that was read or | ||
written in processing the call with the type of lock obtained and the tentative version if any | ||
- When primary receives a prepare message, it must check that all calls made by transaction to its | ||
group is known and compatible with its history | ||
- If prepare is successful, then it takes the viewstamp of most recent completed call record and | ||
forces the buffer | ||
- If it receives a commit message, the primary forces a committed record and then sends | ||
acknowledgement to the coordinator | ||
- If it receives an abort message, it adds an aborted record to the buffer | ||
|
||
## Other Processing at Cohorts | ||
|
||
- Cohorts that are not active primaries reject messages sent to them by other module group (except | ||
for queries described in next section) and the response contains `cur_viewid` and the id of the | ||
primary if the cohort knows it. | ||
- Active backups process event records in timestamp order and update the state accordingly | ||
|
||
## Queries | ||
|
||
- Any cohort can response to a query whenever it knows the answer (E.G. a cohort that is not a | ||
primary may know about the abort of a transaction because it received the aborted event record | ||
from the primary) | ||
|
||
## Replicated Clients | ||
|
||
- It is desirable for coordinator to be highly available | ||
- If client is not replicated, we can still use a replicated coordinator-server | ||
- Client takes to such a server to start a transaction and the coordinator-server carries out | ||
two-phase commit on client's behalf | ||
|
||
## Nested Transactions | ||
|
||
- Use nested transactions as a check-pointing mechanism | ||
- Minimizes effects of view change | ||
- If there is no reply, then just abort the subaction and do the call again | ||
|
||
# Changing Views | ||
|
||
- If every view has at least a majority of cohorts, then it contains at least one cohort that knows | ||
about any event that was forced to a majority of cohorts | ||
- Must ensure that state of new view includes what that cohort knows | ||
- Can take state of cohort with highest viewstamp for the previous view | ||
- Event records are sent to backups in timestamp order | ||
- Cohorts send periodic keep-alive messages to other cohorts in the configuration | ||
- Cohort initiates view change if it notices that it is not communicating with some other cohort or | ||
if it notices that it is communicating with a cohort that it could not communicate with previously | ||
- Cohorts generate unique new `viewid` using their `mid` and a number greater than `max_viewid` and | ||
then it sends invitations to other cohorts | ||
- Cohorts will accept invitations from `viewids` that are greater than its largest recorded one | ||
- Cohorts send normal acceptance containing its current viewstamp and an indication of whether or | ||
not it is the primary in the current view if it is up to date, else it sends a crash acceptance | ||
- Rule for view formation: A majority of cohorts have accepted and | ||
1. A majority of cohorts accepted normally | ||
2. `crash_viewid` < `normal_viewid` (Ignore crashed acceptances from old views) | ||
3. `crash_viewid` = `normal_viewid` and the primary of view `normal_viewid` has done a normal | ||
acceptance of the invitation (Ignore crashed acceptance if we havve information from the | ||
primary of its view) | ||
- Cohort with the largest viewstamp is selected as the new primary | ||
- There can be several active primaries, but old primary cannot force effects to backups |