Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZooKeeper to KRaft migration #9480

Merged
merged 2 commits into from
Mar 2, 2024
Merged

ZooKeeper to KRaft migration #9480

merged 2 commits into from
Mar 2, 2024

Conversation

ppatierno
Copy link
Member

@ppatierno ppatierno commented Dec 21, 2023

Type of change

  • Enhancement / new feature

Description

This PR fixes #9433 , #9447 and #9448 about implementing the ZooKeeper to KRaft migration process.

The current status is about starting from a ZooKeeper-based cluster, creating a node pool with controllers, applying the strimzi.io/kraft: migration annotation on the Kafka CR and allowing the operator FSM to start the migration up to the "dual-write" (so creating controllers, rolling brokers, start migration). Then the user can apply the strimzi.io/kraft: enabled annotation in order to finilize the process by having the full KRaft-based cluster,

The PR won't address documentation. There will be a different PR for it. See #9633
This PR doesn't cover migration rollback and ZooKeeper removal as well which have different issues and will have different PRs for them.
This PR covers migration rollback but taking into account this Kafka issue https://issues.apache.org/jira/browse/KAFKA-16101 and based on deleting the /controller znode manually as last step. using rollback annotation value from the KRaftPostMigration state, the disabled from the KRaftDualWriting.
This PR also covers the ZooKeeper related resources removal when the migration ends.

Checklist

Please go through this checklist and make sure all applicable tasks have been done

  • Write tests
  • Make sure all tests pass
  • Try your changes from Pod inside your Kubernetes and OpenShift cluster, not just locally
  • Reference relevant issue(s) and close them after merging
  • Update CHANGELOG.md

@scholzj
Copy link
Member

scholzj commented Dec 21, 2023

@ppatierno Should this be a draft?

@ppatierno
Copy link
Member Author

@ppatierno Should this be a draft?

I can set as "draft" if you want but this PR is meant to be reviewed because it "closes" the migration implementation while waiting for implementing the rest (i.e. rollback) with a different PR (on top of this). Of course, this cannot be merged because of the Kafka 3.7 missing (until February). But this way should avoid overwhelming contributors, interested to review, with a bigger review at the end.

@ppatierno ppatierno marked this pull request as draft December 22, 2023 08:44
Copy link
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to use some separate packages for the migration related parts? It makes it easier to delete things later.

@ppatierno ppatierno force-pushed the zk-kraft-migration branch 2 times, most recently from 605c3b2 to 5927e51 Compare January 3, 2024 14:36
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, but it is working fine for me.

I also tried to kill the CO while moving from KRaftDualWriting to KRaftPostMigration, and the new instance was able to recover and complete the migration.

This PR doesn't cover migration rollback and ZooKeeper removal as well which have different issues and will have different PRs for them.

Are we describing these issues somewhere?

@ppatierno ppatierno force-pushed the zk-kraft-migration branch 7 times, most recently from f4c35b1 to d7eed02 Compare January 12, 2024 10:31
@ppatierno ppatierno force-pushed the zk-kraft-migration branch 3 times, most recently from 9969872 to e590474 Compare January 18, 2024 14:38
@ppatierno
Copy link
Member Author

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ppatierno ppatierno force-pushed the zk-kraft-migration branch 2 times, most recently from 386da52 to c0eaee9 Compare February 29, 2024 14:52
@ppatierno
Copy link
Member Author

/azp run kraft-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ppatierno
Copy link
Member Author

/azp run upgrade

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Feb 29, 2024

/azp run feature-gates-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved KafkaMetadataStateManager into the resource package

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added KafkaAgent endpoint to retrive ZooKeeper migration state metric
Updated KafkaAgent client to retrieve ZooKeeper migration state
Added logic to KafkaReconciler to check when migration ends

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed spotbugs issue

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved KafkaMetadataConfigurationState in its own file
Exposed metadata configuration state evaluation in corresponding
KafkaMetadataConfigurationState instance methods
Using loop on controllers to get the ZooKeeper migration metrics
Other cleaning and refactor changes as per JS comments

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Removed unused Random class

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved the KafkaMetadataStateManager creation to the ReonciliationState
because needed in the assembly operator as well

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed not exposing control plane port 9090 on KRaft broker-only nodes
Fixed still exposing replication port 9091 on KRaft controllers during
migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed duplicate KRaft configuration section when mixed-node

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed tests

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed bug about keeping replication listener on controllers until
post-migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Some refactoring (also based on FV comments)

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed validation for nodepools and features already in pre-migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Addition on the configuration migration table
Removed not used method

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed checkstyle issue

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Running spec checker validation only when migration happens to KRaft not
before

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed test

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Updated liveness, readiness and run scripts to take process.roles into
account

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added warning conditions when strimzi.io/kraft annotation is not used in
the right way

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactored the KafkaAgent and client in relation to the KRaft migration
endpoint

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added versions validation before starting a KRaft migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Removed the useKRaft flag related to the feature gate flags
Making reconciliation failing fast when useKRaft feature flag is not
enabled
Updated related tests

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Addressing JS comments

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed FV feedback

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added rollback migration to the state machine

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed spotbugs check

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed check on ZooKeeper removal only in full KRaft mode

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactoring the migration path making KRaftDualWriting transient and
KRaftPostMigration stable
Refactored rollback to work from KRaftPostMigration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Removed leftover commented code

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved KafkaMetadataState to the api module

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed KRaft test

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added ZooKeeper removal at the end of the migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

[ST] ZK to KRaft migration - migration process (#9580)

Signed-off-by: Lukas Kral <lukywill16@gmail.com>

[ST] ZK to KRaft migration - rollback process (#9589)

Signed-off-by: Lukas Kral <lukywill16@gmail.com>

Added migration related tests for configuration builder

Signed-off-by: Paolo Patierno <ppatierno@live.com>

[ST] Add AZP for migration and few fixups to the migration tests (#9615)

Signed-off-by: Lukas Kral <lukywill16@gmail.com>

Added unit tests for State machine of `KafkaMetadataStateManager` class  (#9591)

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Fixed feedback

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added migration related tests for KafkaCluster

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed comments by JS
Fixed issue with wrong authorizer during migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

[ST] Add ACLs to migration STs (#9644)

Signed-off-by: Lukas Kral <lukywill16@gmail.com>

Refactoring by passing metadata config state across the cluster creator

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactored /controller znode deletion process
Factored out a KRaftMigrationUtils class

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved KafkaMetadataStateManager to assembly package

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed MigrationST new label selector signature

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Made Kafka agent client local to the metadata state manager

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed a wrongly reported good migration state as invalid

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed duplicated comment

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added tests on Kafka Agent and related client about migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Add Mock tests for `KafkaMetadataStateManager` class (#9624)

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Fixed OAuth with KRaft test

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed batch of comments from JS

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactoring versions validation to run a KRaft migration
Added unit tests on the version validation on KRaft migration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed migration tests with correct version change instance

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed version change in the state manager test

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Removing check on migration status from the state manager to utils class
Refactored migration state manager tests

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Moved KafkaAgent client creation for migration state to KafkaReconciler
Added class for testing migration through KafkaReconciler

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactored method to configure ZooKeeper, migration enabled and KRaft
configuration on nodes

Signed-off-by: Paolo Patierno <ppatierno@live.com>

[ST] Migration - tests for migration with CO deletion during process, full path with migration and rollback (#9700)

Signed-off-by: Paolo Patierno <ppatierno@live.com>
Signed-off-by: Lukas Kral <lukywill16@gmail.com>
Co-authored-by: Paolo Patierno <ppatierno@live.com>

Added unit tests for `ZookeeperEraser` class (#9710)

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Added test on the broker configuration ZooKeeper migration flag

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added ZooKeeper migration flag as forbidden in node configuration

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added KafkaAssemblyOperator migration-related tests (#9720)

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Removed logging related leftovers

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Added unit tests for `ZookeeperReconciler` class (#9733)

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Fixed usage of MockKube3 in migration tests (#9744)

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed comments by JS

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Refactoring around logging at right level

Signed-off-by: Paolo Patierno <ppatierno@live.com>

Fixed last JS comments
Updated CHANGELOG

Signed-off-by: Paolo Patierno <ppatierno@live.com>

[ST] Bump resources for controller Pods in MigrationST and use AZP profile for migration tests (#9753)

Signed-off-by: Lukas Kral <lukywill16@gmail.com>
Signed-off-by: Paolo Patierno <ppatierno@live.com>
Copy link
Contributor

@fvaleri fvaleri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Great work. Thanks.

@ppatierno
Copy link
Member Author

/azp run regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Mar 1, 2024

/azp run kraft-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Mar 1, 2024

/azp run feature-gates-regression

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@scholzj
Copy link
Member

scholzj commented Mar 1, 2024

/azp run upgrade

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@tombentley tombentley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Paolo.

/**
* @return if the Kafka metadata configuration state is from ZooKeeper-based up to KRaft migration (and dual-write) going on
*/
public boolean isZooKeeperToMigration() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found these predictes testing for a range to be really confusing when I was reading the code. I didn't guess it was testing a range. I wonder if it would be clearer to just have a isBeforeOrEqual(KMCS other) and isAfterOrEqual(KMCS other)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it would be more complicated than that, because it's a range ... so it should be something like isAfterOrEqualeZooKeeperAndBeforeOrEqualeMigration which looks worse imho. I think that Javadoc is there for helping on its meaning.

* Class used for deleting the ZooKeeper ensemble when the ZooKeeper to KRaft migration is completed
* and the Kafka cluster is full KRaft, not using ZooKeeper anymore for storing metadata
*/
public class ZooKeeperEraser {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erase doesn't really mean the same thing as delete or destroy. If I "erase a data base' that suggests the data is gone but the server/infra remains. I think destroy would be clearer here. Delete is OK, but even then you really need to add 'resources' in there to make it unambiguous

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you call it? ZooKeeperDestroyer ? Anything better?

Comment on lines +73 to +83
public Future<Void> reconcile() {
LOGGER.infoCr(reconciliation, "Deleting all the ZooKeeper related resources");
return jmxSecret()
.compose(i -> networkPolicy())
.compose(i -> serviceAccount())
.compose(i -> service())
.compose(i -> headlessService())
.compose(i -> certificateSecret())
.compose(i -> loggingAndMetricsConfigMap())
.compose(i -> podDisruptionBudget())
.compose(i -> podSet())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It mystifies me why none of these methods have the word delete in their names, but somehow PVCs are special.

Any why call this method reconcile()? It's not like you're overriding a reconcile method here. Why not call it deleteZooKeeperResources() or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left reconcile to leave the overall pattern of reconciliation when it's called from the KafakAssemblyOperator.
Regarding the methods' names, I agree. I am going to call them all deleteXXXX.

Comment on lines +68 to +69
* The main reconciliation method which triggers the whole reconciliation pipeline. This is the method which is
* expected to be called from the outside to trigger the reconciliation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"This is the method which is expected to be called from the outside" can't we infer that from the method visibility?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied from what we have in the ZooKeeperReconciler :-P

@ppatierno
Copy link
Member Author

@tombentley this PR has been here for weeks/months and now it's finally ready to be merged. All pipelines are green (except one well known issue which Lukas will fix later). I am going to merge it and evaluate your suggestions for a different PR. Sorry but it's taking too long.

@ppatierno ppatierno merged commit e1bcc24 into main Mar 2, 2024
41 of 43 checks passed
@ppatierno ppatierno deleted the zk-kraft-migration branch March 2, 2024 09:59
@ppatierno
Copy link
Member Author

@tombentley I opened #9770 to address your feedback. Can you review it please? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants