-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Snapshot INIT Step #55918
Remove Snapshot INIT Step #55918
Conversation
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
Jenkins run elasticsearch-ci/bwc |
logger.warn("Failed to load snapshot metadata, assuming repository is in old format", e); | ||
return OLD_SNAPSHOT_FORMAT; | ||
} | ||
return OLD_SNAPSHOT_FORMAT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7.6
always adds the versions to the repository data. If we don't find one here then there's no point in doing any IO because we know that the repository data was written by a version older than 7.6. We can simply go ahead and assume an old version and be done with things. I had to make this change here because I needed this method to work on the CS thread, but we can make it safely in all versions in fact. Loading the individual SnapshotInfo
here was never necessary.
@@ -840,7 +838,11 @@ public void run() { | |||
scheduleNow(() -> testClusterNodes.stopNode(masterNode)); | |||
} | |||
testClusterNodes.randomDataNodeSafe().client.admin().cluster().prepareCreateSnapshot(repoName, snapshotName) | |||
.execute(snapshotStartedListener); | |||
.execute(ActionListener.wrap(() -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically a revert of 93c6d77
Master fail-over would not lead to a failing snapshot response here with any observable frequency when we had the INIT state.
If the client (connected to a data node) lost the connection to the master node (because we disconnect the master temporarily in some runs), it would retry and the master meanwhile would have at the most made it to an INIT
state in the CS (which the retry would simply ignore once a new master is elected). Now the first CS update will always be a STARTED
state snapshot and a retry can again run into an in-progress snapshot.
We should fix clean retrying some other way (could do what we did for force merge UUIDs and generate the unique SnapshotId
in the transport request already so that retries can know they're a retry). Since we haven't released 7.8
and this is a really minor UX win to revert here, I think this is an acceptable "regression".
@@ -74,84 +74,6 @@ protected Settings nodeSettings(int nodeOrdinal) { | |||
.build(); | |||
} | |||
|
|||
public void testDisruptionOnSnapshotInitialization() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Obsolete :)
null, userMeta, version); | ||
} | ||
return ClusterState.builder(currentState).putCustom(SnapshotsInProgress.TYPE, | ||
new SnapshotsInProgress(List.of(newEntry))).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simplified this code a bit since we don't have to find the INIT
entry in the existing state any longer and I felt like there was no point in pretending we would have more than one snapshot here yet when we don't have any such thing. Concurrent snapshot operations will need some bigger changes to this step anyway so the pretend loop is of no use yet.
} | ||
failureMessage.append("Indices are closed "); | ||
} | ||
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to make this change here yet, but I think it's a valid change. It's entirely pointless to keep writing these snapshots to the repo. It's against the spirit of what partial == false
means IMO and it adds nothing but an unused cluster state and a bunch of unused index metadata to the repo for no good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.
@@ -1111,12 +950,7 @@ public void clusterStateProcessed(String source, ClusterState oldState, ClusterS | |||
result.v1().getGenId(), null, Priority.IMMEDIATE, listener)); | |||
}, | |||
e -> { | |||
if (abortedDuringInit) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to wait here anymore. If the snapshot was aborted during INIT, then that means it was already simply removed from the CS so no point in waiting here. This only made sense when we had the INIT state and would move INIT
-> ABORT
-> remove via https://github.com/elastic/elasticsearch/pull/55918/files#diff-a0853be4492c052f24917b5c1464003dL424 (or the apply CS method) which isn't a thing any longer without the INIT state.
}); | ||
} | ||
|
||
private static class CleanupAfterErrorListener { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only used to clean up INIT
state snapshots
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks for the extra comments which helped a lot. I think you nailed the corner cases, at least I can't find one which was already handled in some way.
} | ||
failureMessage.append("Indices are closed "); | ||
} | ||
// TODO: We should just throw here instead of creating a FAILED and hence useless snapshot in the repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed this morning we may want to validate this point with Cloud as it is useful to know which snapshot have failed. Besides this I can't remember any valid reason to write FAILED snapshot into the repository if the snapshot did not even start.
Jenkins run elasticsearch-ci/2 |
Thanks Tanguy! |
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
Backporting this to |
With elastic#55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine. This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.
With #55773 the snapshot INIT state step has become obsolete. We can set up the snapshot directly in one single step to simplify the state machine.
This is a big help for building concurrent snapshots because it allows us to establish a deterministic order of operations between snapshot create and delete operations since all of their entries now contain a repository generation. With this change simple queuing up of snapshot operations can and will be added in a follow-up.