Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test plan for SavedObject Migrations v2 #84141

Closed
9 tasks done
joshdover opened this issue Nov 23, 2020 · 12 comments
Closed
9 tasks done

Test plan for SavedObject Migrations v2 #84141

joshdover opened this issue Nov 23, 2020 · 12 comments
Assignees
Labels
Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc test-plan

Comments

@joshdover
Copy link
Contributor

joshdover commented Nov 23, 2020

Manual test plan for #75780

End state of successful upgrade

Once an upgrade has successfully completed, this should be the end state of all indices and aliases (8.0.0 should be replaced with the current version of Kibana):

index alias
.kibana_8.0.0_001 .kibana, .kibana_8.0.0
.kibana_task_manager_8.0.0_001 .kibana_task_manager, .kibana_task_manager_8.0.0

Scenarios

Multiple instances upgrading in parallel

Tester: @rudolf

  • Status: Completed

The goal of this test is to ensure that multiple Kibana instances can successfully attempt the migration in parallel. The sub-goal is to identify any scaling issues with multiple nodes (if any) and determine a threshold where adding more Kibana instances creates performance problems in Elasticsearch. This is important information to know because Cloud's upgrade logic will start all new nodes at once. Based on current customer clusters, we need to be able to scale to at least 10 Kibana instances.

It would be useful to test this against a larger .kibana index.

Use this bash script to run multiple Kibana instances on different ports.

Procedure

  1. Start 7.10 Kibana nodes with ES 7.11 (or load a snapshot of a large .kibana index)
  2. Shutdown all old nodes
  3. Start 7.11 nodes all at once

Expected Behavior

  • All nodes should have successfully completed the migration process, completed the bootup sequence, and should be serving traffic.
  • The state of the indices and aliases should match the successful end state from above.
  • There should be no error log messages in the log output of any of the nodes.

Kibana instance is killed while migrations are running

Tester: @bhavyarm

  • Status: completed

The goal of this test is to verify that a Kibana instance can successfully re-attempt and complete a migration that may have failed or been aborted before completing.

Procedure

  1. Start ES 7.12
  2. Start Kibana 7.12
  3. Kill Kibana instance (Ctrl+C) after log message "[info][savedobjects-service] [.kibana] INIT ->", but before "[info][savedobjects-service] [.kibana] Migration completed after ...ms"
  4. Restart Kibana instance

Expected Behavior

  • The second startup of Kibana should successfully re-attempt the migration and leave the cluster in the successful end state from above.
  • There should be a log message on the second startup indicating that the migration is being re-attempted.

Upgrading from Kibana 6.8.x to 7.12

Tester: @bhavyarm

  • Status: Completed

The goal of this test is to verify that we can successfully upgrade from the latest supported 6.8 version to the new migration system in 7.12.

Procedure

  1. Start ES 6.8
  2. Start Kibana 6.8, wait for it complete initialization
  3. Shutdown ES & Kibana
  4. Upgrade to and start ES 7.12
  5. Upgrade to and start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.

Upgrading from Kibana 6.0 to 7.12

Tester: @bhavyarm

  • Status: Completed

The goal of this test is to verify that we can successfully upgrade from an older 6.x version from before we had migrations at all to the new migration system in 7.12.

Procedure

  1. Start ES 6.0
  2. Start Kibana 6.0, wait for it complete initialization
  3. Shutdown ES & Kibana
  4. Upgrade to and start ES 7.12
  5. Upgrade to and start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.

Upgrading from Kibana 5.6 to 6.8.x to 7.12

Tester: @bhavyarm

  • Status: Completed

Similar to above, but this scenario specifically tests that the migration system will work with clusters that have gone through the 6.0 major upgrade. In this upgrade, the upgrade assistant reindexed the .kibana index into a .kibana-6 index behind an alias.

Procedure

  1. Start ES & Kibana 5.6
  2. Go to Upgrade Assistant in UI & reindex the Kibana index.
  3. Shutdown ES & Kibana
  4. Upgrade to and start ES 6.0
  5. Start Kibana 6.0, wait for it complete initialization
  6. Shutdown ES & Kibana
  7. Upgrade to and start ES 7.12
  8. Upgrade to and start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.

Upgrading from Kibana 7.3 to 7.12

Tester: @bhavyarm

  • Status: Completed

The goal of this test is to verify that we can successfully upgrade from 7.3 from before the .kibana_task_manager index was managed by migrations to the new migration system in 7.12.

Procedure

  1. Start ES 7.3
  2. Start Kibana 7.3, wait for it complete initialization
  3. Shutdown ES & Kibana
  4. Upgrade to and start ES 7.12
  5. Upgrade to and start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.

Large SO index

Tester: @rudolf

  • Status: completed

The goal of this test is to verify the performance of the new migration system when migrating an index with a large number of objects.

We have seen some deployments with ~200k documents in the .kibana indexed caused by telemetry data. This would be a good worst case performance scenario.

Procedure

  1. Start ES 7.12 with a large SO dataset: yarn es snapshot --data-archive=src/core/server/saved_objects/migrationsv2/integration_tests/archives/7.7.2_xpack_100k_obj.zip
  2. Start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.
  • The running time of the migration should be noted and reported back on this issue for discussion. It would be helpful to run this test multiple times to make sure the running time is consistent.

Results

I did some rudimentary testing by running migrations on 100k saved objects (much larger than "normal") and looking at the output of watch curl -s -X GET "elastic:changeme@localhost:9200/_cat/nodes?h=heap*\&v" -H 'Content-Type: application/json'. There isn't a notable impact on the current heap between 1 kibana node or 10 (the largest kibana deployment we know of). But the runtime of the migration doubles for 10 nodes, so there is a performance impact.

Migrations runtime with 100k objects:

Nodes Runtime
1 2.27 minutes
2 2.27 minutes
3 3.49 minutes
10 4.21 minutes

However, this worst case scenario is still better than our downtime target of < 10 minutes.

Migrating an index with corrupt Saved Objects

Tester: @bhavyarm

  • Status: Completed

The goal of this test is to verify the failure behavior when a corrupt saved object is in the index prior to migrations running. Corrupt saved objects are saved objects where the type and namespace in the _id don't match the type and namespace properties, or a document like ..., "type": "index-pattern", "dashboard": {...} which is of type index-pattern but doesn't have any attributes under the "index-pattern" property, but instead has it's attributes under a "dashboard" property.

Here are some example corrupt saved objects:

# type in _id doesn't match "type" property
curl -X PUT "elastic:changeme@localhost:9200/.kibana/_doc/dashboard:1234" -H 'Content-Type: application/json' -d '{"index-pattern":{"timeFieldName":"@timestamp","title":"logstash-*"},"migrationVersion":{"index-pattern":"7.1.0"},"references":[],"type":"index-pattern","updated_at":"2018-12-21T00:43:07.096Z"}'
# space in _id doesn't match "namespace" property
curl -X PUT "elastic:changeme@localhost:9200/.kibana/_doc/myspace:dashboard:1234" -H 'Content-Type: application/json' -d '{"index-pattern":{"timeFieldName":"@timestamp","title":"logstash-*"},"migrationVersion":{"index-pattern":"7.1.0"},"references":[],"type":"index-pattern","updated_at":"2018-12-21T00:43:07.096Z", "namespace": "another_space"}'
# no dashboard attributes for dashboard type
curl -X PUT "elastic:changeme@localhost:9200/.kibana/_doc/dashboard:1234" -H 'Content-Type: application/json' -d '{"index-pattern":{"timeFieldName":"@timestamp","title":"logstash-*"},"migrationVersion":{"dashboard":"7.1.0"},"references":[],"type":"dashboard","updated_at":"2018-12-21T00:43:07.096Z"}'

Use this command to delete the corrupted saved object:

curl -XDELETE 'http://localhost:9200/.kibana_7.12.0_001/_doc/dashboard:1234'

Procedure

  1. Start ES 7.12
  2. Start Kibana 7.11, wait for it complete initialization
  3. Stop Kibana 7.11
  4. Load one of the corrupt saved objects
  5. Upgrade to and start Kibana 7.12

Expected Behavior

  • The startup of Kibana 7.12 should fail with a log message like:
    • FATAL Error: Unable to migrate the corrupt saved object document with _id: 'dashboard:1234'. To allow migrations to proceed, please delete this document from the [.kibana_1] index.
  • Removing or fixing the corrupt object should allow Kibana to upgrade successfully on next startup

Disabling plugins

Tester: @bhavyarm

  • Status: Completed

The goal of this test is to verify that saved object documents that were created by a plugin that has since been disabled are copied into the new index and do not fail the migration.

Procedure

  1. Start ES 7.11
  2. Start Kibana 7.11, wait for it complete initialization
  3. Load sample data set
  4. Stop Kibana 7.11
  5. In the kibana.yml file, disable some plugins that have data in the sample data set. For example:
    xpack.maps.enabled: false
    xpack.canvas.enabled: false
    
  6. Start Kibana 7.12 with this kibana.yml

Expected Behavior

  • The startup of Kibana 7.12 should successfully complete the migration and leave the cluster in the successful end state from above.
  • The new .kibana index should still include some objects of the type map and canvas-workpad
  • Restarting Kibana after the successful initial upgrade (using the same config) should not trigger a migration.
  • Restarting Kibana after the successful initial upgrade (with the default settings) should trigger a new migration.
@joshdover joshdover added Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Feature:Saved Objects test-plan labels Nov 23, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@bhavyarm
Copy link
Contributor

bhavyarm commented Nov 24, 2020

@joshdover I went through the test cases and this is great. I will talk to @LeeDr (out this week) and see where our names should go up. Looks like they are all fairly straightforward once we get test data. Thanks!

@rudolf rudolf added the project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient label Nov 27, 2020
@bhavyarm
Copy link
Contributor

bhavyarm commented Dec 1, 2020

After talking to @LeeDr - a couple of questions. Thanks!

  1. What would be the test plan for OSS migrations?
  2. What would be the test plan for saved object import/export? Is it the same as migrations?

@bhavyarm
Copy link
Contributor

bhavyarm commented Dec 1, 2020

One more question - will license levels impact this process in anyway?

@joshdover
Copy link
Contributor Author

What would be the test plan for OSS migrations?
One more question - will license levels impact this process in anyway?

The only impact OSS vs X-Pack have is which plugins are enabled, so it should be covered by the "Disabling plugins" case above. License levels do not have any impact on which SO migrations are registered.

  • What would be the test plan for saved object import/export? Is it the same as migrations?

We're not making any changes to code that should impact import/export, but it's probably worth validating this by exporting data from a 6.8.x release into a 7.11 one.

@LeeDr
Copy link
Contributor

LeeDr commented Dec 2, 2020

@joshdover could you explain a bit about migrations vs saved object import? I think saved objects which were exported from a previous Kibana version would be migrated as they are imported into a newer version? Maybe it's not during import but after?


Also, could you please break down any steps within the migration process which we might see in the logging? Maybe an existing design or RFC doc already has this? If we want to abort the migration by killing Kibana it would be good to know how many timing windows there are where different states could exist.

Kill Kibana instance (Ctrl+C) after log message "TODO", but before "TODO"

For example, it might be like;

  1. Kibana starts up and accesses .kibana which could be the name of an index or an alias
  2. does it check some version in each saved object to decide if it's "current" or "needs migration"? Or does it just let each plugin check each of it's type of saved object? Any logging here? Or debug logging?
  3. Create new index with _<n+1> name
  4. Write migrated objects (regardless if anything changed in them or not)
  5. Change or create alias to point to new index

I don't know if Kibana sets some flag or version in a doc in Elasticsearch so that other Kibana instances know a migration is in progress.


How can we figure out what migrations will happen to various different saved objects? It would be good to check that the migrations actually ran in these test cases including importing saved objects.

@bhavyarm
Copy link
Contributor

bhavyarm commented Dec 9, 2020

@marius-dr spoke to @LeeDr and we have the test cases assigned between us right now for when the BC comes out. Depending on how it goes and where we land at after holidays - we can have status check. Thanks!

@rudolf
Copy link
Contributor

rudolf commented Dec 16, 2020

@joshdover could you explain a bit about migrations vs saved object import? I think saved objects which were exported from a previous Kibana version would be migrated as they are imported into a newer version? Maybe it's not during import but after?

When a user imports saved objects, we first migrate them before adding them to the .kibana index. This feature doesn't change any of the migration code, but focuses on the reliability of the migration process when a kibana install is upgraded.

Also, could you please break down any steps within the migration process which we might see in the logging? Maybe an existing design or RFC doc already has this? If we want to abort the migration by killing Kibana it would be good to know how many timing windows there are where different states could exist.

We have the algorithm documented here https://github.com/elastic/kibana/blob/master/rfcs/text/0013_saved_object_migrations.md#4212-migration-algorithm-cloned-index-per-version But when we implemented it some of the steps in this algorithm became two steps so I will rewrite it so that it's a better 1-1 match with the implementation (this will also be valuable documentation to maintain the feature). When migrations run we will print out each step so it's also easy to see what steps were executed or exactly at which step the process failed.

How can we figure out what migrations will happen to various different saved objects? It would be good to check that the migrations actually ran in these test cases including importing saved objects.

I've added a task to #75780 to add debug logging to print out this information.

@Bamieh
Copy link
Member

Bamieh commented Dec 18, 2020

I started testing Multiple instances upgrading in parallel

I created a bash script to help me run multiple kibanas in parallel without much hassle. I added full details on how to run the script in the file header.

parallel.sh
#!/bin/bash

#
# Script to run multiple kibana instances in parallel.
# Makre sure to run the script from kibana root directory.
#
# bash parallel.sh <function> [options]
# functions:
#   start [instances] - start multiple kibanas (3 default)
#   es - run elasticsearch with 7.7.2 snapshot data
#   tail - show logs of all kibanas
#   kill - kills all started kibana processes
#   clean - clean up nohup files
#   kibana_index - search .kibana index against es
#

FN="$1"
NUM="$2"

if [ "${FN}" == "kill" ]; then
  echo "killing main processes"
  for pid in $(cat processes.out); do kill -9 $pid; done
  echo "killing trailing processes"
  for pid in $(pgrep -f scripts/kibana); do kill -9 $pid; done
  exit 0;
fi

if [ "${FN}" == "tail" ]; then
  tail -f nohup_*
  exit 0;
fi

if [ "${FN}" == "clean" ]; then
  rm -r nohup_*.out
  rm processes.out
  exit 0;
fi

if [ "${FN}" == "es" ]; then
  yarn es snapshot --data-archive=src/core/server/saved_objects/migrationsv2/integration_tests/archives/7.7.2_xpack_100k_obj.zip
  exit 0;
fi

if [ "${FN}" == "kibana_index" ]; then
  # search the kibana index
  curl -XPOST http://elastic:changeme@localhost:9200/.kibana/_search -u elastic:changeme -d '' | jq
  exit 0;
fi

if [ "${FN}" == "start" ]; then
  if test ! "${NUM-}"; then
    NUM=3
  fi
  node scripts/build_kibana_platform_plugins --no-examples
  rm processes.out
  for i in $(seq 0 $(expr $NUM - 1))
  do
    PORT="56${i}1"
    PROXY="56${i}3"
    echo "starting kibana on port $PORT"
    nohup node scripts/kibana.js --dev.basePathProxyTarget=$PROXY --server.port=$PORT --migrations.enableV2=true --dev --no-watch --no-optimizer > nohup_$i.out &
    PROCESS_ID=$!
    echo "${PROCESS_ID}" >> processes.out
  done
  exit 0;
fi

@bhavyarm
Copy link
Contributor

bhavyarm commented Feb 15, 2021

Disabling plugin fails - #91445

@bhavyarm
Copy link
Contributor

bhavyarm commented Mar 4, 2021

All done. Also upgraded on cloud from 6.8.14->7.3.2->7.12.0 (Latest BC)

@joshdover
Copy link
Contributor Author

@rudolf Should we close this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Saved Objects project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc test-plan
Projects
None yet
Development

No branches or pull requests

6 participants