Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet] Implement state machine behavior for package install #175592

Closed
4 of 8 tasks
criamico opened this issue Jan 25, 2024 · 7 comments · Fixed by #178657
Closed
4 of 8 tasks

[Fleet] Implement state machine behavior for package install #175592

criamico opened this issue Jan 25, 2024 · 7 comments · Fixed by #178657
Assignees
Labels
Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@criamico
Copy link
Contributor

criamico commented Jan 25, 2024

Phase 1 of #169147

Implement a state machine like behavior for the package install:

  • Refactor the existing steps as steps that can be executed independently. Main steps:
    • kibana_assets
    • ilm_policies
    • index_templates_and_pipelines
    • transforms
    • ml_models
  • Keep track of the current step and save it, so that the next install can be restarted from there. To do that we need to persist the status in the installation saved object (exact format TBD)
  • For each state transition:
    • Implement and check initial conditions that need to be verified prior to execute the state
    • Implement retries and recovery steps to execute when the state failed
    • Expose the current state and errors, if present, in the install endpoint response (POST api/fleet/epm/packages/<pkgName>/<pkgVersion>)
    • Introduce a flag that allows to do the whole reinstall from scratch (we could reuse force for that)
  • Add tests covering the new functionality
@criamico criamico added the Team:Fleet Team label for Observability Data Collection Fleet team label Jan 25, 2024
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@nchaulet
Copy link
Member

Refactor the existing steps as steps that can be executed independently

@criamico I do not think this is needed until we have independent API no?

@criamico
Copy link
Contributor Author

I do not think this is needed until we have independent API no?

@nchaulet I didn't intend that those functions need to be exposed but just tweaked the strict necessary to make them part of a state machine. Some work will be needed in those functions, although minimal, and I wanted to highlight that. But we can rephrase it to make it clearer.

@criamico
Copy link
Contributor Author

criamico commented Mar 13, 2024

@nchaulet and I had a discussion about how to implement the state machine behavior. In the RFC we didn't get to discuss this aspect.

We considered using a library like XState. It is already used inside Kibana (see https://github.com/elastic/kibana/tree/main/x-pack/plugins/observability_solution/infra/docs/state_machines), but it's actually pretty complex and we agreed that we should strive to to keep the implementation simple. Plus we probably don't need all these functionalities for our case.

From our discussion it came out that we need the following:

  • Provide the states and the functions to execute on each state in a way that's simple to write and maintain
  • Persist the current install state (similar to what we already do with install_failed state)
  • Have a way to restart the steps from any state when the install failed
  • Keep the state machine code decoupled from the actual lifecycle functionality, so it can be fully tested
  • Define states based on the existing functions in

Based on these considerations I created a branch with a small POC. All it does is to create a state machine based on a data structure - note that for now the functions are empty as it's just an example. The unit tests show how it should work with a generic data structure.

@kpollich I'd also like to have your take on this before going too far with it.

@kpollich
Copy link
Member

XState is very good but I worry moving to it would be too large a refactor here, so I agree with going a different direction.

I'll take a look at the draft PR and let you know if I have any other thoughts. Thanks for checking in here.

criamico added a commit that referenced this issue Apr 8, 2024
Closes #175592

## Summary
Implement state machine behavior for package install. It keeps track of
the current step and save it in the SO , then exposes it in
the`installationInfo` property.

- Implemented a generic state machine function that can automatically
handle state transitions based on a simple data structure:
https://github.com/elastic/kibana/pull/178657/files#diff-f350d9630cd1f22cd1b3e70c9e95388d72dc877190bbeb33c739cb0433949e95R1-R88.
In theory, this state machine could be reused for something else, since
is generic enough and it's decoupled from the transition functions that
we pass to it.
- The state transitions passed to the state machine are defined in
[services/epm/packages/install_steps.ts](https://github.com/elastic/kibana/blob/5f09e58ae7a300f459c3d1157fb747cfbb0c11aa/x-pack/plugins/fleet/server/services/epm/packages/install_steps.ts)
and are based off the existing steps in
https://github.com/elastic/kibana/blob/10d5167fa78c1a4c65f8607dad1e6a681e39f4b0/x-pack/plugins/fleet/server/services/epm/packages/_install_package.ts#L61
I simply divided that long function in smaller steps and wrapped them to
accept a common parameter, based off
[InstallContext](https://github.com/elastic/kibana/pull/178657/files#diff-39d1f59e77a329eb06241c220156e5cf2d350649bb548707b0b0f54365ea91bfR49-R72)
- Defined a **feature flag** `enablePackagesStateMachine` and called the
new
[installPackageWitStateMachine](https://github.com/elastic/kibana/pull/178657/files#diff-cf9cec44de2ad0a6a3b74cca05e5308231d57d5c3e180ae4bc5114c2bf9af4ebR466-R483)
only when it's enabled.
- For now this new function is only applied to
`InstallPackageFromRegistry`, so `upload` and `bundled` case don't use
it yet.

### Testing
- Enable `enablePackagesStateMachine` in kibana.dev.yml
- Try to install an integration from registry, either from API or UI.
For instance

```
POST kbn:api/fleet/epm/packages/nginx/1.20.0
```
The installation process should succeed and the installationInfo
property will expose the `latest_executed_state` along with the error.

<details>
  <summary>Screenshots</summary>

### Logging
With `logger.debug` enabled:
![Screenshot 2024-03-27 at 16 12
33](https://github.com/elastic/kibana/assets/16084106/75fb4af8-675e-483e-a51f-eb4adbf9d2aa)

![Screenshot 2024-03-27 at 16 12
48](https://github.com/elastic/kibana/assets/16084106/74092f6d-528c-4e8f-85ee-85e2852487b8)

### InstallationInfo object
Content of `installationInfo` property when install process was
successful:
![Screenshot 2024-03-27 at 16 13
54](https://github.com/elastic/kibana/assets/16084106/c2535c8f-24f7-4b6c-8f58-dadf4c9b4b28)

### Errors during install process

I manually triggered an error inside `stepInstallIndexTemplatePipelines`
and it's reported in the `installationInfo` property along with the
latest executed step (latest successful state) and error message:

![Screenshot 2024-03-27 at 17 26
29](https://github.com/elastic/kibana/assets/16084106/47d77330-bcbb-4608-9e42-c9f46e8831a1)


</details>



### Checklist
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [ ] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [ ] [Flaky Test
Runner](https://ci-stats.kibana.dev/trigger_flaky_test_runner/1) was
used on any tests changed

---------

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
@criamico
Copy link
Contributor Author

criamico commented Apr 8, 2024

PR was merged. I added the tasks not covered by #178657 in #175597

@criamico
Copy link
Contributor Author

criamico commented Apr 9, 2024

PR to add more APM spans: #180355

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants