Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: Snapshot after cluster version downgrade #13686

Merged
merged 1 commit into from
Feb 21, 2022

Conversation

serathius
Copy link
Member

@serathius serathius commented Feb 11, 2022

Part of #13168

Etcd downgrades rely on both cluster version and storage version being lowered lowered. After cluster version is lowered peer-to-peer communication should become backward compatible with older etcd. This is prerequisite to downgrade storage version as there will not be new incompatible entries added to WAL. However WAL can still include older incompatible entries that make downgrading storage version impossible. To remove those entries we need to wait for snapshot. Waiting for automatic snapshot would take us too long (every 10000 proposals), so I propose to snapshot imminently after cluster version is lowered.

There is no internal Raft request for asking every cluster member to snapshot, so we do it in raft apply step.
Not the cleanest implementation, however I didn't find a reasonable way to send a signal from applier to apply function.

cc @ptabor @ahrtr @spzala

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2022

Codecov Report

Merging #13686 (e46aa98) into main (bcadd03) will increase coverage by 0.06%.
The diff coverage is 52.94%.

❗ Current head e46aa98 differs from pull request most recent head 863ed5a. Consider uploading reports for the commit 863ed5a to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main   #13686      +/-   ##
==========================================
+ Coverage   72.51%   72.57%   +0.06%     
==========================================
  Files         465      465              
  Lines       37865    37877      +12     
==========================================
+ Hits        27457    27491      +34     
+ Misses       8620     8588      -32     
- Partials     1788     1798      +10     
Flag Coverage Δ
all 72.57% <52.94%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/storage/mvcc/index.go 90.25% <33.33%> (-2.41%) ⬇️
server/auth/simple_token.go 88.46% <50.00%> (+7.35%) ⬆️
server/etcdserver/apply.go 89.10% <60.00%> (-0.22%) ⬇️
server/etcdserver/server.go 84.28% <100.00%> (-0.23%) ⬇️
client/pkg/v3/tlsutil/tlsutil.go 83.33% <0.00%> (-8.34%) ⬇️
server/storage/schema/cindex.go 95.34% <0.00%> (-4.66%) ⬇️
client/v3/leasing/cache.go 87.22% <0.00%> (-4.45%) ⬇️
server/etcdserver/api/v3rpc/member.go 93.54% <0.00%> (-3.23%) ⬇️
client/v3/experimental/recipes/key.go 75.34% <0.00%> (-2.74%) ⬇️
server/etcdserver/api/v3rpc/watch.go 83.55% <0.00%> (-2.35%) ⬇️
... and 18 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bcadd03...863ed5a. Read the comment docs.

@@ -291,6 +291,9 @@ type EtcdServer struct {
clusterVersionChanged *notify.Notifier

*AccessController
// shouldSnapshot informs if snapshot should be triggered after apply.
// Should only be set within apply.
shouldSnapshot bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it's better to use the versionChanged Notification, see cluster.go#L569.

You can trigger the snapshot in monitorStorageVersion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot to this without major rewrite.

Problem is that running snapshot requires access to etcdProgress which is passed through apply code. This means signal needs to be passed by checking variable and not receiving notification. Receiving notification requires reader to block on read channel. Apply cannot block on read as doesnt a lot of CPU computation as compared to monitorStorageVersion that just reads from channels in for loop. Not being able to block on read means that there is a chance that we will miss a signal in channel.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on this from implementation perspective. Please see my next comment.

@@ -291,6 +291,9 @@ type EtcdServer struct {
clusterVersionChanged *notify.Notifier

*AccessController
// shouldSnapshot informs if snapshot should be triggered after apply.
// Should only be set within apply.
shouldSnapshot bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on this from implementation perspective. Please see my next comment.

@@ -1081,9 +1084,10 @@ func (s *EtcdServer) applyEntries(ep *etcdProgress, apply *apply) {
}

func (s *EtcdServer) triggerSnapshot(ep *etcdProgress) {
if ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount {
if !s.shouldSnapshot && ep.appliedi-ep.snapi <= s.Cfg.SnapshotCount {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change may introduce additional readability complex. When other developers read this source code in future, they may need sometime to figure out when will the shouldSnapshot be true or false.

I suggest to keep all all the reasons to create a snapshot in one place, and change the if to

if !s.shouldSnapshot {
    return
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that shouldSnapshot could be confusing, however I don't agree that we should unify it under variable. I think checking both variable and etcdProgress makes more sense. Renamed to forceSnapshot.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

server/etcdserver/server.go Outdated Show resolved Hide resolved
Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

server/etcdserver/server.go Outdated Show resolved Hide resolved
@serathius serathius merged commit bf0b1a7 into etcd-io:main Feb 21, 2022
@serathius serathius deleted the snapshot branch June 15, 2023 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants