Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dashboard migrations fail silently during upgrade causing unexpected behaviour thats difficult to detect #139684

Closed
rudolf opened this issue Aug 30, 2022 · 2 comments
Labels
Feature:Dashboard Dashboard related features impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. loe:large Large Level of Effort Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas

Comments

@rudolf
Copy link
Contributor

rudolf commented Aug 30, 2022

Many (most?) of the dashboard migrations follow a pattern where we wrap the migration function in a try block and if an exception was caught or the dashboard has invalid data we just silently ignore it and return the document unmigrated.

E.g.
https://github.com/elastic/kibana/blob/main/src/plugins/dashboard/server/saved_objects/migrations_730.ts#L17-L21
https://github.com/elastic/kibana/blob/main/src/plugins/dashboard/server/saved_objects/dashboard_migrations.ts#L47-L52
https://github.com/elastic/kibana/blob/main/src/plugins/dashboard/server/saved_objects/dashboard_migrations.ts#L91-L96

This can cause several problems:

  • if we ever remove a field from the documents and mappings the whole upgrade will fail because we're trying to write unmigrated docs which still have the field present into an index where the field was removed
  • users upgrade their clusters and the upgrade succeeds, but then several weeks later they notice some of their dashboards are broken. Now they can no longer revert the upgrade and they also have no information about why the dashboard is broken.
  • documents are marked as being up to date with the latest migrationVersion but in fact aren't so there's no easy way to identify such documents

For these reasons our documentation states that migration functions should be written defensively but fail on invalid data:
https://github.com/elastic/kibana/blob/main/dev_docs/tutorials/saved_objects.mdx#L251-L253

Now that we're in this situation we cannot just go and rewrite all existing migration functions. So I would suggest the following course of action:

  1. short term: log an error message each time a dashboard migration catches an error and gets "skipped". All new migrations should not ignore errors. Monitor cloud logs for the frequency of such errors to understand how frequently this happens.
  2. medium term: add validation to dashboard saved object types using saved object schema validation Adds validations for Saved Object types when calling create or bulkCreate. #118969 to ensure that all new dashboards that are created or imported have the schema we expect them to have. This at least prevents further problems from being created.
  3. long term: Create a dashboard API layer to ensure that updates (and creates) especially by API users cannot cause a corrupt documents. Validate dashboards on read and log errors when the read dashboard fails validation.
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 30, 2022
@rudolf rudolf added the Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas label Aug 30, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-presentation (Team:Presentation)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Aug 30, 2022
@ThomThomson ThomThomson added Feature:Dashboard Dashboard related features loe:large Large Level of Effort impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. labels Sep 21, 2022
@ThomThomson
Copy link
Contributor

Closing this because the Dashboard content management onboard PR added a schema

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Dashboard Dashboard related features impact:medium Addressing this issue will have a medium level of impact on the quality/strength of our product. loe:large Large Level of Effort Team:Presentation Presentation Team for Dashboard, Input Controls, and Canvas
Projects
None yet
Development

No branches or pull requests

3 participants