Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate peer recovery from translog to retention lease #49448

Merged
merged 21 commits into from
Dec 13, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Nov 21, 2019

Since 7.4, we switch from translog to Lucene as the source of history for peer recoveries. However, we reduce the likelihood of operation-based recoveries when performing a full cluster restart from pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if the recovering replica does not have a peer recovery retention lease, and the replication group hasn't fully migrated to PRRL.

Relates #45136

@dnhatn dnhatn added >bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v8.0.0 v7.6.0 v7.4.3 v7.5.1 labels Nov 21, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@dnhatn
Copy link
Member Author

dnhatn commented Nov 21, 2019

Hmm, a new test is failing. I am looking at it.

@dnhatn
Copy link
Member Author

dnhatn commented Nov 27, 2019

I have an implementation that fallbacks to translog if an index was created before 7.4, and the recovering replica does not have a PRRL. I think we should disable translog retention after every copy has established its PRRLs. However, this would require coordination. Another option is to make this decision locally. We also need to persist this decision so that we won't re-enable translog retention in a full cluster restart. WDYT?

@ywelsch
Copy link
Contributor

ywelsch commented Nov 27, 2019

ReplicationTracker already has this field hasAllPeerRecoveryRetentionLeases. Maybe we can use that to make this decision locally?

@dnhatn dnhatn changed the title Allow ops-based recovery without existing retention lease Migrate peer recovery from translog to retention lease Dec 1, 2019
@dnhatn
Copy link
Member Author

dnhatn commented Dec 2, 2019

Please hold off the review as the test failure relates to this change. I will ping after I have resolved it.

@dnhatn
Copy link
Member Author

dnhatn commented Dec 2, 2019

run elasticsearch-ci/packaging-sample-matrix

@dnhatn dnhatn requested a review from ywelsch December 2, 2019 16:25
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Nhat! Overall looking very good already. I've left some minor comments.

@dnhatn dnhatn requested a review from ywelsch December 13, 2019 06:07
Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dnhatn
Copy link
Member Author

dnhatn commented Dec 13, 2019

@ywelsch Thanks for reviewing.

@dnhatn dnhatn merged commit b9fbc8d into elastic:master Dec 13, 2019
@dnhatn dnhatn deleted the migrate-to-prrl branch December 13, 2019 18:56
dnhatn added a commit that referenced this pull request Dec 15, 2019
We turn off the translog retention policy asynchronously using
 the generic threadpool; hence, we need to assert busily here

Relates #49448
dnhatn added a commit that referenced this pull request Dec 15, 2019
Since 7.4, we switch from translog to Lucene as the source of history
for peer recoveries. However, we reduce the likelihood of
operation-based recoveries when performing a full cluster restart from
pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if
the recovering replica does not have a peer recovery retention lease,
and the replication group hasn't fully migrated to PRRL.

Relates #45136
dnhatn added a commit to dnhatn/elasticsearch that referenced this pull request Dec 16, 2019
Since 7.4, we switch from translog to Lucene as the source of history
for peer recoveries. However, we reduce the likelihood of
operation-based recoveries when performing a full cluster restart from
pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if
the recovering replica does not have a peer recovery retention lease,
and the replication group hasn't fully migrated to PRRL.

Relates elastic#45136
dnhatn added a commit that referenced this pull request Dec 16, 2019
Since 7.4, we switch from translog to Lucene as the source of history
for peer recoveries. However, we reduce the likelihood of
operation-based recoveries when performing a full cluster restart from
pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if
the recovering replica does not have a peer recovery retention lease,
and the replication group hasn't fully migrated to PRRL.

Relates #45136
@jasontedor jasontedor added v7.5.1 and removed v7.5.2 labels Dec 16, 2019
dnhatn added a commit that referenced this pull request Dec 24, 2019
We need to make sure that the global checkpoints and peer recovery
retention leases were advanced to the max_seq_no and synced; otherwise,
we can risk expiring some peer recovery retention leases because of the
file-based recovery threshold.

Relates #49448
dnhatn added a commit that referenced this pull request Dec 24, 2019
We need to make sure that the global checkpoints and peer recovery
retention leases were advanced to the max_seq_no and synced; otherwise,
we can risk expiring some peer recovery retention leases because of the
file-based recovery threshold.

Relates #49448
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Since 7.4, we switch from translog to Lucene as the source of history 
for peer recoveries. However, we reduce the likelihood of
operation-based recoveries when performing a full cluster restart from
pre-7.4 because existing copies do not have PPRL.

To remedy this issue, we fallback using translog in peer recoveries if 
the recovering replica does not have a peer recovery retention lease,
and the replication group hasn't fully migrated to PRRL.

Relates elastic#45136
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
We turn off the translog retention policy asynchronously using
 the generic threadpool; hence, we need to assert busily here

Relates elastic#49448
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. v7.5.1 v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants