Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROPOSAL] Automatic rolling update test #7179

Open
jihoonson opened this issue Mar 1, 2019 · 0 comments
Open

[PROPOSAL] Automatic rolling update test #7179

jihoonson opened this issue Mar 1, 2019 · 0 comments

Comments

@jihoonson
Copy link
Contributor

jihoonson commented Mar 1, 2019

Motivation

Supporting rolling update is one of the most important things, but it's also one of the things people can make a lot of mistakes. These mistakes can block updating user cluster to the latest version of Druid, and so we should be able to catch them before it happens. (#6051)

Proposed changes

I would propose to add an integration test which checks various system status during rolling updates. As the existing integration tests, it would run tests against a druid cluster running on docker containers. It would also be able to run in Travis, so that we can find incompatible changes as soon as possible.

The rolling update test program accepts the below arguments:

Argument IsMandatory Default
Hash of commit for old version false HEAD^
Hash of commit for new version false HEAD
Path to configuration files for old version true
Path to configuration files for new version false old configuration files

The test druid cluster consists of 1 overlord, 1 broker, 1 coordinator, 2 middleManagers, 1 historical.

Since stream ingestion is where unexpected incompatible changes usually happen, I would propose to test the below scenario. This would be executed sequentially.

  1. Cluster initialization
    1. Build both the old and the new versions.
    2. Start a cluster of the old version.
    3. Start a Kafka supervisor.
    4. Produce some events, check task status & query results.
    5. Checkpoint supervisor & wait for segments to be loaded in historicals.
  2. Historical test
    1. Update historicals to the new version.
    2. Wait for segments to be loaded.
    3. Check query results.
  3. Overlord test
    1. Produce some events
    2. Update overlord to the new version.
    3. Check task status & query results.
  4. MiddleManager test
    1. Checkpoint supervisor
    2. Update 1 of 2 MMs
    3. Produce some events
    4. Check task status & query results.
    5. Update another MM
    6. Produce some events
    7. Check task status & query results.
  5. Broker test
    1. Checkpoint supervisor to publish segments.
    2. Update broker to the new version.
    3. Produce some events
    4. Check query results
  6. Coordinator test
    1. Checkpoint supervisor to publish segments.
    2. Update coordinator to the new version.
    3. Wait for segments to be loaded.
    4. Run a compaction task
    5. Check the segment version and query results

#6208 should be implemented for manually checkpointing.

If one of tests fails, all task logs and system logs would be preserved and docker containers wouldn't stop as in the integration tests.

Rationale

I think this is the easiest way to automate testing the rolling update.

Operational impact

There's no operational impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant