Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make local cluster start idempotent #1031

Open
tjtelan opened this issue Apr 28, 2021 · 8 comments
Open

Make local cluster start idempotent #1031

tjtelan opened this issue Apr 28, 2021 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Good issue for community involvement no-stale Opt-out of closing issue due to no activity technical debt

Comments

@tjtelan
Copy link
Contributor

tjtelan commented Apr 28, 2021

If we try to start a cluster twice, we don't always get useful error messages.

$ fluvio cluster start --local
Performing pre-flight checks
✅ ok: Supported helm version is installed
✅ ok: Supported kubernetes version is installed
✅ ok: Kubernetes config is loadable
✅ ok: Fluvio system charts are installed
Successfully installed Fluvio!

$ fluvio cluster start --local
Performing pre-flight checks
✅ ok: Supported helm version is installed
✅ ok: Supported kubernetes version is installed
✅ ok: Kubernetes config is loadable
✅ ok: Fluvio system charts are installed
Error:
   0: Fluvio cluster error
   1: Fluvio cluster error
   2: Failed to install Fluvio locally
   3: Kubernetes client error
   4: client error: 409 Conflict

This error manifests during development often. We might forget that we have a local cluster started. Or we used to have a cluster started locally, but we've killed the process.

In my opinion, there are 2 different levels to address this.

  1. Behave similar to fluvio cluster start (Kubernetes).
    If a cluster appears to be running, we just say so. Otherwise, we suggest running fluvio cluster delete --local to clean up metadata.

ex. Successful kubernetes startup idempotency

$ fluvio cluster start
✅ ok: Kubernetes config is loadable
✅ ok: Supported helm version is installed
✅ ok: Load balancer is up
✅ ok: Fluvio system charts are installed
💙 note: Fluvio is already running
Successfully installed Fluvio!
  1. We attempt to clean up however we can, and try to start a local cluster.

In my mind, this is effectively running fluvio cluster delete --local && fluvio cluster start --local on behalf of the user. We can inform them while we are cleaning up.

We could have option 1 be the default behavior, and provide option 2 behind a --force flag. The goal in mind is a scriptable path to starting a local cluster w/o needing to handle errors.

@tjtelan tjtelan added enhancement New feature or request technical debt labels Apr 28, 2021
@nicholastmosher
Copy link
Contributor

One possible solution to this is to make the SC create a pid lockfile when it starts up, something like /usr/local/var/fluvio/sc.pid, which contains the process id of the currently-running SC in it. Then before starting up, the cluster installer can check for a lockfile and for a process with the same pid. If the lockfile or the process do not exist, then we can preemptively clean out any old metadata from the store so that we do not encounter the 409 conflict.

@sehz
Copy link
Contributor

sehz commented Apr 28, 2021

This needs to work across different OS and cluster types (local,k8). We certainly need fluvio cluster status API/Command similar to: https://minikube.sigs.k8s.io/docs/commands/status/.

Key points

  • User may have multiple clusters which may or may not be running locally
  • They need to switch between different clusters.

Note that this is admin management of clusters different from users's perspective of clusters which they are just consumer/producer.

@github-actions
Copy link

Stale issue message

@github-actions
Copy link

Stale issue message

@tjtelan tjtelan added help wanted Good issue for community involvement good first issue Good for newcomers and removed no-issue-activity labels Nov 23, 2021
@github-actions
Copy link

Stale issue message

@tjtelan tjtelan added the no-stale Opt-out of closing issue due to no activity label Apr 1, 2022
@BatmanAoD
Copy link
Contributor

@sehz Since #2044 added a check for a running local cluster, is a status API still necessary to resolve this? Or is it sufficient to simply make the nonrecoverable AlreadyInstalled and ExistingLocalCluster errors (in check/mod.rs) cause the CLI to exit with a 0 exit code (rather than an error code) and a message stating that the cluster is already running?

@sehz
Copy link
Contributor

sehz commented Aug 28, 2022

Yes, it's OK to return a non-recoverable error.

@BatmanAoD
Copy link
Contributor

@sehz Sorry, I think I wasn't clear; those are errors that already exist in the code, and are considered "nonrecoverable" (they're part of the UnrecoverableCheckStatus enum). My question is whether changing the behavior when those are encountered, so that the process exits with a "success" status (0), would be sufficient to satisfy option 1 for this issue: "If a cluster appears to be running, we just say so."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Good issue for community involvement no-stale Opt-out of closing issue due to no activity technical debt
Projects
None yet
Development

No branches or pull requests

4 participants