-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZooKeeper to KRaft migration #9480
Conversation
44a5906
to
d361220
Compare
@ppatierno Should this be a draft? |
I can set as "draft" if you want but this PR is meant to be reviewed because it "closes" the migration implementation while waiting for implementing the rest (i.e. rollback) with a different PR (on top of this). Of course, this cannot be merged because of the Kafka 3.7 missing (until February). But this way should avoid overwhelming contributors, interested to review, with a bigger review at the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to use some separate packages for the migration related parts? It makes it easier to delete things later.
api/src/main/java/io/strimzi/api/kafka/model/status/KafkaStatus.java
Outdated
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaCluster.java
Outdated
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaCluster.java
Outdated
Show resolved
Hide resolved
...er-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java
Outdated
Show resolved
Hide resolved
...er-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java
Outdated
Show resolved
Hide resolved
...perator/src/main/java/io/strimzi/operator/cluster/operator/resource/KRaftMigrationState.java
Show resolved
Hide resolved
605c3b2
to
5927e51
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, but it is working fine for me.
I also tried to kill the CO while moving from KRaftDualWriting
to KRaftPostMigration
, and the new instance was able to recover and complete the migration.
This PR doesn't cover migration rollback and ZooKeeper removal as well which have different issues and will have different PRs for them.
Are we describing these issues somewhere?
api/src/main/java/io/strimzi/api/kafka/model/status/KafkaStatus.java
Outdated
Show resolved
Hide resolved
...er-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java
Outdated
Show resolved
Hide resolved
...er-operator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaReconciler.java
Outdated
Show resolved
Hide resolved
f4c35b1
to
d7eed02
Compare
9969872
to
e590474
Compare
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KafkaCluster.java
Show resolved
Hide resolved
7c6805d
to
3e134f3
Compare
75d1298
to
a118d91
Compare
a118d91
to
d5af439
Compare
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
386da52
to
c0eaee9
Compare
/azp run kraft-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run upgrade |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run feature-gates-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved KafkaMetadataStateManager into the resource package Signed-off-by: Paolo Patierno <ppatierno@live.com> Added KafkaAgent endpoint to retrive ZooKeeper migration state metric Updated KafkaAgent client to retrieve ZooKeeper migration state Added logic to KafkaReconciler to check when migration ends Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed spotbugs issue Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved KafkaMetadataConfigurationState in its own file Exposed metadata configuration state evaluation in corresponding KafkaMetadataConfigurationState instance methods Using loop on controllers to get the ZooKeeper migration metrics Other cleaning and refactor changes as per JS comments Signed-off-by: Paolo Patierno <ppatierno@live.com> Removed unused Random class Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved the KafkaMetadataStateManager creation to the ReonciliationState because needed in the assembly operator as well Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed not exposing control plane port 9090 on KRaft broker-only nodes Fixed still exposing replication port 9091 on KRaft controllers during migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed duplicate KRaft configuration section when mixed-node Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed tests Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed bug about keeping replication listener on controllers until post-migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Some refactoring (also based on FV comments) Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed validation for nodepools and features already in pre-migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Addition on the configuration migration table Removed not used method Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed checkstyle issue Signed-off-by: Paolo Patierno <ppatierno@live.com> Running spec checker validation only when migration happens to KRaft not before Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed test Signed-off-by: Paolo Patierno <ppatierno@live.com> Updated liveness, readiness and run scripts to take process.roles into account Signed-off-by: Paolo Patierno <ppatierno@live.com> Added warning conditions when strimzi.io/kraft annotation is not used in the right way Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactored the KafkaAgent and client in relation to the KRaft migration endpoint Signed-off-by: Paolo Patierno <ppatierno@live.com> Added versions validation before starting a KRaft migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Removed the useKRaft flag related to the feature gate flags Making reconciliation failing fast when useKRaft feature flag is not enabled Updated related tests Signed-off-by: Paolo Patierno <ppatierno@live.com> Addressing JS comments Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed FV feedback Signed-off-by: Paolo Patierno <ppatierno@live.com> Added rollback migration to the state machine Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed spotbugs check Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed check on ZooKeeper removal only in full KRaft mode Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactoring the migration path making KRaftDualWriting transient and KRaftPostMigration stable Refactored rollback to work from KRaftPostMigration Signed-off-by: Paolo Patierno <ppatierno@live.com> Removed leftover commented code Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved KafkaMetadataState to the api module Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed KRaft test Signed-off-by: Paolo Patierno <ppatierno@live.com> Added ZooKeeper removal at the end of the migration Signed-off-by: Paolo Patierno <ppatierno@live.com> [ST] ZK to KRaft migration - migration process (#9580) Signed-off-by: Lukas Kral <lukywill16@gmail.com> [ST] ZK to KRaft migration - rollback process (#9589) Signed-off-by: Lukas Kral <lukywill16@gmail.com> Added migration related tests for configuration builder Signed-off-by: Paolo Patierno <ppatierno@live.com> [ST] Add AZP for migration and few fixups to the migration tests (#9615) Signed-off-by: Lukas Kral <lukywill16@gmail.com> Added unit tests for State machine of `KafkaMetadataStateManager` class (#9591) Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com> Fixed feedback Signed-off-by: Paolo Patierno <ppatierno@live.com> Added migration related tests for KafkaCluster Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed comments by JS Fixed issue with wrong authorizer during migration Signed-off-by: Paolo Patierno <ppatierno@live.com> [ST] Add ACLs to migration STs (#9644) Signed-off-by: Lukas Kral <lukywill16@gmail.com> Refactoring by passing metadata config state across the cluster creator Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactored /controller znode deletion process Factored out a KRaftMigrationUtils class Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved KafkaMetadataStateManager to assembly package Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed MigrationST new label selector signature Signed-off-by: Paolo Patierno <ppatierno@live.com> Made Kafka agent client local to the metadata state manager Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed a wrongly reported good migration state as invalid Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed duplicated comment Signed-off-by: Paolo Patierno <ppatierno@live.com> Added tests on Kafka Agent and related client about migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Add Mock tests for `KafkaMetadataStateManager` class (#9624) Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com> Fixed OAuth with KRaft test Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed batch of comments from JS Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactoring versions validation to run a KRaft migration Added unit tests on the version validation on KRaft migration Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed migration tests with correct version change instance Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed version change in the state manager test Signed-off-by: Paolo Patierno <ppatierno@live.com> Removing check on migration status from the state manager to utils class Refactored migration state manager tests Signed-off-by: Paolo Patierno <ppatierno@live.com> Moved KafkaAgent client creation for migration state to KafkaReconciler Added class for testing migration through KafkaReconciler Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactored method to configure ZooKeeper, migration enabled and KRaft configuration on nodes Signed-off-by: Paolo Patierno <ppatierno@live.com> [ST] Migration - tests for migration with CO deletion during process, full path with migration and rollback (#9700) Signed-off-by: Paolo Patierno <ppatierno@live.com> Signed-off-by: Lukas Kral <lukywill16@gmail.com> Co-authored-by: Paolo Patierno <ppatierno@live.com> Added unit tests for `ZookeeperEraser` class (#9710) Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com> Added test on the broker configuration ZooKeeper migration flag Signed-off-by: Paolo Patierno <ppatierno@live.com> Added ZooKeeper migration flag as forbidden in node configuration Signed-off-by: Paolo Patierno <ppatierno@live.com> Added KafkaAssemblyOperator migration-related tests (#9720) Signed-off-by: Paolo Patierno <ppatierno@live.com> Removed logging related leftovers Signed-off-by: Paolo Patierno <ppatierno@live.com> Added unit tests for `ZookeeperReconciler` class (#9733) Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com> Fixed usage of MockKube3 in migration tests (#9744) Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed comments by JS Signed-off-by: Paolo Patierno <ppatierno@live.com> Refactoring around logging at right level Signed-off-by: Paolo Patierno <ppatierno@live.com> Fixed last JS comments Updated CHANGELOG Signed-off-by: Paolo Patierno <ppatierno@live.com> [ST] Bump resources for controller Pods in MigrationST and use AZP profile for migration tests (#9753) Signed-off-by: Lukas Kral <lukywill16@gmail.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
0928265
to
58ce66c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Great work. Thanks.
/azp run regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run kraft-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run feature-gates-regression |
Azure Pipelines successfully started running 1 pipeline(s). |
/azp run upgrade |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Paolo.
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KRaftUtils.java
Show resolved
Hide resolved
cluster-operator/src/main/java/io/strimzi/operator/cluster/model/KRaftUtils.java
Show resolved
Hide resolved
/** | ||
* @return if the Kafka metadata configuration state is from ZooKeeper-based up to KRaft migration (and dual-write) going on | ||
*/ | ||
public boolean isZooKeeperToMigration() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found these predictes testing for a range to be really confusing when I was reading the code. I didn't guess it was testing a range. I wonder if it would be clearer to just have a isBeforeOrEqual(KMCS other)
and isAfterOrEqual(KMCS other)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well it would be more complicated than that, because it's a range ... so it should be something like isAfterOrEqualeZooKeeperAndBeforeOrEqualeMigration
which looks worse imho. I think that Javadoc is there for helping on its meaning.
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/KRaftMigrationUtils.java
Show resolved
Hide resolved
...r/src/main/java/io/strimzi/operator/cluster/operator/assembly/KafkaMetadataStateManager.java
Show resolved
Hide resolved
* Class used for deleting the ZooKeeper ensemble when the ZooKeeper to KRaft migration is completed | ||
* and the Kafka cluster is full KRaft, not using ZooKeeper anymore for storing metadata | ||
*/ | ||
public class ZooKeeperEraser { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erase doesn't really mean the same thing as delete or destroy. If I "erase a data base' that suggests the data is gone but the server/infra remains. I think destroy would be clearer here. Delete is OK, but even then you really need to add 'resources' in there to make it unambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you call it? ZooKeeperDestroyer
? Anything better?
public Future<Void> reconcile() { | ||
LOGGER.infoCr(reconciliation, "Deleting all the ZooKeeper related resources"); | ||
return jmxSecret() | ||
.compose(i -> networkPolicy()) | ||
.compose(i -> serviceAccount()) | ||
.compose(i -> service()) | ||
.compose(i -> headlessService()) | ||
.compose(i -> certificateSecret()) | ||
.compose(i -> loggingAndMetricsConfigMap()) | ||
.compose(i -> podDisruptionBudget()) | ||
.compose(i -> podSet()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It mystifies me why none of these methods have the word delete in their names, but somehow PVCs are special.
Any why call this method reconcile()? It's not like you're overriding a reconcile method here. Why not call it deleteZooKeeperResources()
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left reconcile
to leave the overall pattern of reconciliation when it's called from the KafakAssemblyOperator.
Regarding the methods' names, I agree. I am going to call them all deleteXXXX
.
* The main reconciliation method which triggers the whole reconciliation pipeline. This is the method which is | ||
* expected to be called from the outside to trigger the reconciliation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"This is the method which is expected to be called from the outside" can't we infer that from the method visibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just copied from what we have in the ZooKeeperReconciler :-P
...perator/src/main/java/io/strimzi/operator/cluster/operator/assembly/ZooKeeperReconciler.java
Show resolved
Hide resolved
@tombentley this PR has been here for weeks/months and now it's finally ready to be merged. All pipelines are green (except one well known issue which Lukas will fix later). I am going to merge it and evaluate your suggestions for a different PR. Sorry but it's taking too long. |
@tombentley I opened #9770 to address your feedback. Can you review it please? Thanks! |
Type of change
Description
This PR fixes #9433 , #9447 and #9448 about implementing the ZooKeeper to KRaft migration process.
The current status is about starting from a ZooKeeper-based cluster, creating a node pool with controllers, applying the
strimzi.io/kraft: migration
annotation on theKafka
CR and allowing the operator FSM to start the migration up to the "dual-write" (so creating controllers, rolling brokers, start migration). Then the user can apply thestrimzi.io/kraft: enabled
annotation in order to finilize the process by having the full KRaft-based cluster,The PR won't address documentation. There will be a different PR for it. See #9633
This PR doesn't cover migration rollback and ZooKeeper removal as well which have different issues and will have different PRs for them.This PR covers migration rollback
but taking into account this Kafka issue https://issues.apache.org/jira/browse/KAFKA-16101 and based on deleting theusing/controller
znode manually as last step.rollback
annotation value from theKRaftPostMigration
state, thedisabled
from theKRaftDualWriting
.This PR also covers the ZooKeeper related resources removal when the migration ends.
Checklist
Please go through this checklist and make sure all applicable tasks have been done