Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Snapshot INIT Step #55918

Merged

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Apr 29, 2020

With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.

@original-brownbear original-brownbear added >non-issue WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.8.0 labels Apr 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Apr 29, 2020
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/bwc

logger.warn("Failed to load snapshot metadata, assuming repository is in old format", e);
return OLD_SNAPSHOT_FORMAT;
}
return OLD_SNAPSHOT_FORMAT;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7.6 always adds the versions to the repository data. If we don't find one here then there's no point in doing any IO because we know that the repository data was written by a version older than 7.6. We can simply go ahead and assume an old version and be done with things. I had to make this change here because I needed this method to work on the CS thread, but we can make it safely in all versions in fact. Loading the individual SnapshotInfo here was never necessary.

@@ -840,7 +838,11 @@ public void run() {
scheduleNow(() -> testClusterNodes.stopNode(masterNode));
}
testClusterNodes.randomDataNodeSafe().client.admin().cluster().prepareCreateSnapshot(repoName, snapshotName)
.execute(snapshotStartedListener);
.execute(ActionListener.wrap(() -> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a revert of 93c6d77

Master fail-over would not lead to a failing snapshot response here with any observable frequency when we had the INIT state.
If the client (connected to a data node) lost the connection to the master node (because we disconnect the master temporarily in some runs), it would retry and the master meanwhile would have at the most made it to an INIT state in the CS (which the retry would simply ignore once a new master is elected). Now the first CS update will always be a STARTED state snapshot and a retry can again run into an in-progress snapshot.
We should fix clean retrying some other way (could do what we did for force merge UUIDs and generate the unique SnapshotId in the transport request already so that retries can know they're a retry). Since we haven't released 7.8 and this is a really minor UX win to revert here, I think this is an acceptable "regression".

@@ -74,84 +74,6 @@ protected Settings nodeSettings(int nodeOrdinal) {
.build();
}

public void testDisruptionOnSnapshotInitialization() throws Exception {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obsolete :)

null, userMeta, version);
}
return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE,
new SnapshotsInProgress(List.of(newEntry))).build();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simplified this code a bit since we don't have to find the INIT entry in the existing state any longer and I felt like there was no point in pretending we would have more than one snapshot here yet when we don't have any such thing. Concurrent snapshot operations will need some bigger changes to this step anyway so the pretend loop is of no use yet.

}
failureMessage.append("Indices are closed ");
}
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to make this change here yet, but I think it's a valid change. It's entirely pointless to keep writing these snapshots to the repo. It's against the spirit of what partial == false means IMO and it adds nothing but an unused cluster state and a bunch of unused index metadata to the repo for no good reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

@@ -1111,12 +950,7 @@ public void clusterStateProcessed(String source, ClusterState oldState, ClusterS
result.v1().getGenId(), null, Priority.IMMEDIATE, listener));
},
e -> {
if (abortedDuringInit) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to wait here anymore. If the snapshot was aborted during INIT, then that means it was already simply removed from the CS so no point in waiting here. This only made sense when we had the INIT state and would move INIT -> ABORT -> remove via https://github.com/elastic/elasticsearch/pull/55918/files#diff-a0853be4492c052f24917b5c1464003dL424 (or the apply CS method) which isn't a thing any longer without the INIT state.

});
}

private static class CleanupAfterErrorListener {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was only used to clean up INIT state snapshots

@original-brownbear original-brownbear marked this pull request as ready for review April 29, 2020 12:52
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Thanks for the extra comments which helped a lot. I think you nailed the corner cases, at least I can't find one which was already handled in some way.

}
failureMessage.append("Indices are closed ");
}
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/2

@original-brownbear
Copy link
Member Author

Thanks Tanguy!

@original-brownbear original-brownbear merged commit f4022c0 into elastic:master May 5, 2020
@original-brownbear original-brownbear deleted the no-more-snapshot-init branch May 5, 2020 16:06
@mfussenegger mfussenegger mentioned this pull request May 13, 2020
37 tasks
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request May 18, 2020
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
@original-brownbear
Copy link
Member Author

Backporting this to 7.x is extremely involved (because 7.x must still be able to go through the INIT step as long as there's a pre-7.5 version node in the cluster and correctly deal with all the resulting corner cases throughout the codebase). <= Just making a note here that this backport (and other snapshot PRs that depend on it) hasn't been forgotten. But I'd like to raise this in the snapshot resiliency meeting first to discuss strategy for this kind of BwC problem first before deciding on a technical solution for the backport.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jul 12, 2020
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
original-brownbear added a commit that referenced this pull request Jul 13, 2020
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.

This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
@original-brownbear original-brownbear restored the no-more-snapshot-init branch August 6, 2020 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed Meta label for distributed team v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants