[PROPOSAL] Automatic rolling update test #7179

jihoonson · 2019-03-01T23:35:04Z

Motivation

Supporting rolling update is one of the most important things, but it's also one of the things people can make a lot of mistakes. These mistakes can block updating user cluster to the latest version of Druid, and so we should be able to catch them before it happens. (#6051)

Proposed changes

I would propose to add an integration test which checks various system status during rolling updates. As the existing integration tests, it would run tests against a druid cluster running on docker containers. It would also be able to run in Travis, so that we can find incompatible changes as soon as possible.

The rolling update test program accepts the below arguments:

Argument	IsMandatory	Default
Hash of commit for old version	false	HEAD^
Hash of commit for new version	false	HEAD
Path to configuration files for old version	true
Path to configuration files for new version	false	old configuration files

The test druid cluster consists of 1 overlord, 1 broker, 1 coordinator, 2 middleManagers, 1 historical.

Since stream ingestion is where unexpected incompatible changes usually happen, I would propose to test the below scenario. This would be executed sequentially.

Cluster initialization
1. Build both the old and the new versions.
2. Start a cluster of the old version.
3. Start a Kafka supervisor.
4. Produce some events, check task status & query results.
5. Checkpoint supervisor & wait for segments to be loaded in historicals.
Historical test
1. Update historicals to the new version.
2. Wait for segments to be loaded.
3. Check query results.
Overlord test
1. Produce some events
2. Update overlord to the new version.
3. Check task status & query results.
MiddleManager test
1. Checkpoint supervisor
2. Update 1 of 2 MMs
3. Produce some events
4. Check task status & query results.
5. Update another MM
6. Produce some events
7. Check task status & query results.
Broker test
1. Checkpoint supervisor to publish segments.
2. Update broker to the new version.
3. Produce some events
4. Check query results
Coordinator test
1. Checkpoint supervisor to publish segments.
2. Update coordinator to the new version.
3. Wait for segments to be loaded.
4. Run a compaction task
5. Check the segment version and query results

#6208 should be implemented for manually checkpointing.

If one of tests fails, all task logs and system logs would be preserved and docker containers wouldn't stop as in the integration tests.

Rationale

I think this is the easiest way to automate testing the rolling update.

Operational impact

There's no operational impact.

jihoonson added Area - Testing Proposal labels Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] Automatic rolling update test #7179

[PROPOSAL] Automatic rolling update test #7179

jihoonson commented Mar 1, 2019 •

edited

Loading

[PROPOSAL] Automatic rolling update test #7179

[PROPOSAL] Automatic rolling update test #7179

Comments

jihoonson commented Mar 1, 2019 • edited Loading

Motivation

Proposed changes

Rationale

Operational impact

jihoonson commented Mar 1, 2019 •

edited

Loading