Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Improve recoverability and stability of package installation #169147

Open
kpollich opened this issue Oct 17, 2023 · 8 comments
Open

[Fleet] Improve recoverability and stability of package installation #169147

kpollich opened this issue Oct 17, 2023 · 8 comments
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@kpollich
Copy link
Member

kpollich commented Oct 17, 2023

Meta issue tracking the work for recoverability and stability of package installation

Currently, it's hard to recover a failed package installation as our only recourse is typically to reinstall the package. There's no granular recovery steps we can take, and we often lack visibility into which particular steps failed. It'd be ideal if we could build a more "state machine" like implementation for packages with specific recovery steps for each state transition along the way.

Ref #166857
Ref #166798

@kpollich kpollich added the Team:Fleet Team label for Observability Data Collection Fleet team label Oct 17, 2023
@kpollich kpollich self-assigned this Oct 17, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@kpollich kpollich changed the title [Fleet] Improve recoverability of failed package installations [Fleet] Improve recoverability and stability of package installation Oct 17, 2023
@criamico
Copy link
Contributor

Adding some considerations as discussed with @kpollich

We could start by looking at the specific steps that are covered by the installation process and documenting it. We have a complex state machine and we go through those steps (and a lot of side effect) every time an integration is installed, but we don't really have it documented anywhere and whole install process is a little opaque.

This brings me to the second point: whenever an integration goes to a bad state (like failed_install) we don't really have a way to restart from the failed step, but we need to force doing it all over. As highlighted in this comment, we could even implement retries on those steps, but currently we don't even have granularity on the steps. It's just a single endpoint and what we ask users to do is usually this:

# Force uninstall
DELETE kbn:api/fleet/epm/packages/<integration>/<version>
{
  "force": true
}

# Force reinstall
POST kbn:api/fleet/epm/packages/<integration>/<version>
{
  "force": true
}

Third consideration is that we could maybe reuse the new input template endpoint to simplify the installation process. The endpoint only returns the inputs, but we could easily reuse part of the logic to return the rest of the integration info and simplify the whole install flow. We could easily add an endpoint under the same namespace that returns the rest of the integrations info and not only the inputs.

@criamico
Copy link
Contributor

criamico commented Jan 2, 2024

Adding some comments per discussion with @nchaulet:

  • We should identify what are the different components of a package installation and propose some endpoints to reinstall only those parts, for example for kibana assets something like: POST kbn:/api/fleet/epm/packages<pkg>/<version</kibana_assets>
  • Another thing could be to restart a failed install from where it got stopped the previous time instead of restarting from scratch every single time. We already store some information about the failure so we could expand on that. This could be an option to set up explicitly with a flag, since in some cases is better to restart an install from scratch.

@criamico
Copy link
Contributor

@kpollich @nchaulet I converted this ticket to a "meta" one and wrote some more tickets based on our discussion. Feel free to comment/update as needed.

@criamico
Copy link
Contributor

@kpollich I split the items in phase 1 and added some further details in the descriptions as we discussed recently.

criamico added a commit that referenced this issue Aug 27, 2024
…90986)

Closes #189353

## Summary

Small change that implements a precondition function for package install
state machine. This is needed for the subsequent work planned in
#169147.

Note that this code is added and tested, but it's not currently used and
it will actually be used only when
#175597 will be implemented.


### Checklist
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@nimarezainia
Copy link
Contributor

@kpollich what should we do with the remaining issues here? should we split them into another meta to be dealt with later?

@kpollich
Copy link
Member Author

@kpollich what should we do with the remaining issues here? should we split them into another meta to be dealt with later?

I don't think there's any reason to split them out into a new issue. This has sort of just become a tracking issue for tech debt to fill out quality sprints which I think is fine.

@nimarezainia
Copy link
Contributor

@kpollich what should we do with the remaining issues here? should we split them into another meta to be dealt with later?

I don't think there's any reason to split them out into a new issue. This has sort of just become a tracking issue for tech debt to fill out quality sprints which I think is fine.

ok I have moved it out to Q1 for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

No branches or pull requests

5 participants