Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow in blocked stage after restarting the cluster node #226

Open
Barteus opened this issue Feb 19, 2024 · 1 comment
Open

MLflow in blocked stage after restarting the cluster node #226

Barteus opened this issue Feb 19, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@Barteus
Copy link
Contributor

Barteus commented Feb 19, 2024

Bug Description

MLflow is in Blocked stage because of: "Error with default S3 artifact store - bucket not accessible or cannot be created. Caught error: 'An error occurred ..."

Meanwhile the MLflow server works correctly.

This might impact upgrades and security patches.

To Reproduce

  1. Deploy Kubeflow and integrate it with MLflow
  2. restart the node on which mlflow server is deployed
  3. juju status -> mlflow-server in Blocked state

Environment

Charmed Kubernetes: (charms 1.29) k8s API 1.28
Charmed Kubeflow 1.8
juju: 3.1.7-genericlinux-amd64

Relevant Log Output

$ kubectl logs mlflow-server-0 -n kubeflow
Defaulted container "charm" out of: charm, mlflow-prometheus-exporter, mlflow-server, charm-init (init)
2024-02-19T15:50:31.125Z [pebble] HTTP API server listening on ":38812".
2024-02-19T15:50:31.125Z [pebble] Started daemon.
2024-02-19T15:50:31.127Z [pebble] POST /v1/services 1.28029ms 202
2024-02-19T15:50:31.127Z [pebble] Started default services with change 1.
2024-02-19T15:50:31.128Z [pebble] Service "container-agent" starting: /charm/bin/containeragent unit --data-dir /var/lib/juju --append-env "PATH=$PATH:/charm/bin" --show-log --charm-modified-version 0
2024-02-19T15:50:31.160Z [container-agent] 2024-02-19 15:50:31 INFO juju.cmd supercommand.go:56 running containerAgent [3.1.7 0cd207d999fef1fc8b965c410e9f58fafe7ee335 gc go1.21.5]
2024-02-19T15:50:31.160Z [container-agent] starting containeragent unit command
2024-02-19T15:50:31.161Z [container-agent] containeragent unit "unit-mlflow-server-0" start (3.1.7 [gc])
2024-02-19T15:50:31.161Z [container-agent] 2024-02-19 15:50:31 INFO juju.cmd.containeragent.unit runner.go:578 start "unit"
2024-02-19T15:50:31.161Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.upgradesteps worker.go:60 upgrade steps for 3.1.7 have already been run.
2024-02-19T15:50:31.163Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.probehttpserver server.go:157 starting http server on 127.0.0.1:65301
2024-02-19T15:50:31.186Z [container-agent] 2024-02-19 15:50:31 INFO juju.api apiclient.go:707 connection established to "wss://172.29.200.18:17070/model/6090e281-ad3e-455b-8881-3926203ec9cf/api"
2024-02-19T15:50:31.191Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.apicaller connect.go:163 [6090e2] "unit-mlflow-server-0" successfully connected to "172.29.200.18:17070"
2024-02-19T15:50:31.214Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.migrationminion worker.go:142 migration phase is now: NONE
2024-02-19T15:50:31.214Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.logger logger.go:120 logger worker started
2024-02-19T15:50:31.227Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.leadership tracker.go:194 mlflow-server/0 promoted to leadership of mlflow-server
2024-02-19T15:50:31.228Z [container-agent] 2024-02-19 15:50:31 WARNING juju.worker.proxyupdater proxyupdater.go:241 unable to set snap core settings [proxy.http=http://172.29.200.6:8000/ proxy.https=http://172.29.200.6:8000/ proxy.store=]: exec: "snap": executable file not found in $PATH, output: ""
2024-02-19T15:50:31.241Z [container-agent] 2024-02-19 15:50:31 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-mlflow-server-0
2024-02-19T15:50:31.242Z [container-agent] 2024-02-19 15:50:31 WARNING juju.worker.proxyupdater proxyupdater.go:241 unable to set snap core settings [proxy.http=http://172.29.200.6:8000/ proxy.https=http://172.29.200.6:8000/]: exec: "snap": executable file not found in $PATH, output: ""
2024-02-19T15:50:31.247Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.caasupgrader upgrader.go:113 abort check blocked until version event received
2024-02-19T15:50:31.247Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.caasupgrader upgrader.go:119 unblocking abort check
2024-02-19T15:50:31.555Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.uniter uniter.go:363 unit "mlflow-server/0" started
2024-02-19T15:50:31.568Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.uniter uniter.go:389 hooks are retried true
2024-02-19T15:50:31.697Z [container-agent] 2024-02-19 15:50:31 INFO juju.worker.uniter.charm bundles.go:81 downloading ch:amd64/focal/mlflow-server-466 from API server
2024-02-19T15:50:31.697Z [container-agent] 2024-02-19 15:50:31 INFO juju.downloader download.go:109 downloading from ch:amd64/focal/mlflow-server-466
2024-02-19T15:50:32.091Z [container-agent] 2024-02-19 15:50:32 INFO juju.downloader download.go:92 download complete ("ch:amd64/focal/mlflow-server-466")
2024-02-19T15:50:32.128Z [container-agent] 2024-02-19 15:50:32 INFO juju.downloader download.go:172 download verified ("ch:amd64/focal/mlflow-server-466")
2024-02-19T15:50:34.268Z [container-agent] 2024-02-19 15:50:34 INFO juju.worker.uniter resolver.go:165 found queued "upgrade-charm" hook
2024-02-19T15:50:35.085Z [container-agent] 2024-02-19 15:50:35 INFO juju-log Running legacy hooks/upgrade-charm.
2024-02-19T15:50:35.509Z [container-agent] 2024-02-19 15:50:35 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:50:41.132Z [pebble] Check "readiness" failure 1 (threshold 3): received non-20x status code 418
2024-02-19T15:50:45.225Z [container-agent] 2024-02-19 15:50:45 INFO juju-log Event <UpgradeCharmEvent via MlflowCharm/on/upgrade_charm[1]> stopped early with message: Error with default S3 artifact store - bucket not accessible or cannot be created.  Caught error: 'An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
2024-02-19T15:50:45.336Z [container-agent] 2024-02-19 15:50:45 INFO juju-log HTTP Request: GET https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:50:45.420Z [container-agent] 2024-02-19 15:50:45 INFO juju-log HTTP Request: PATCH https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:50:45.457Z [container-agent] 2024-02-19 15:50:45 INFO juju-log Kubernetes service 'mlflow-server' patched successfully
2024-02-19T15:50:45.778Z [container-agent] 2024-02-19 15:50:45 INFO juju.worker.uniter.operation runhook.go:186 ran "upgrade-charm" hook (via hook dispatching script: dispatch)
2024-02-19T15:50:45.796Z [container-agent] 2024-02-19 15:50:45 INFO juju.worker.uniter resolver.go:165 found queued "config-changed" hook
2024-02-19T15:50:46.199Z [container-agent] 2024-02-19 15:50:46 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:50:51.131Z [pebble] Check "readiness" failure 2 (threshold 3): received non-20x status code 418
2024-02-19T15:50:56.283Z [container-agent] 2024-02-19 15:50:56 INFO juju-log Event <ConfigChangedEvent via MlflowCharm/on/config_changed[6]> stopped early with message: Error with default S3 artifact store - bucket not accessible or cannot be created.  Caught error: 'An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
2024-02-19T15:50:56.403Z [container-agent] 2024-02-19 15:50:56 INFO juju-log HTTP Request: GET https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:50:56.499Z [container-agent] 2024-02-19 15:50:56 INFO juju-log HTTP Request: PATCH https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:50:56.538Z [container-agent] 2024-02-19 15:50:56 INFO juju-log Kubernetes service 'mlflow-server' patched successfully
2024-02-19T15:50:56.878Z [container-agent] 2024-02-19 15:50:56 INFO juju.worker.uniter.operation runhook.go:186 ran "config-changed" hook (via hook dispatching script: dispatch)
2024-02-19T15:50:56.898Z [container-agent] 2024-02-19 15:50:56 INFO juju.worker.uniter resolver.go:76 reboot detected; triggering implicit start hook to notify charm
2024-02-19T15:50:57.296Z [container-agent] 2024-02-19 15:50:57 INFO juju-log Running legacy hooks/start.
2024-02-19T15:50:57.707Z [container-agent] 2024-02-19 15:50:57 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:50:57.991Z [container-agent] 2024-02-19 15:50:57 INFO juju.worker.uniter.operation runhook.go:186 ran "start" hook (via hook dispatching script: dispatch)
2024-02-19T15:50:58.430Z [container-agent] 2024-02-19 15:50:58 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:51:08.107Z [container-agent] 2024-02-19 15:51:08 INFO juju-log Event <PebbleReadyEvent via MlflowCharm/on/mlflow_server_pebble_ready[16]> stopped early with message: Error with default S3 artifact store - bucket not accessible or cannot be created.  Caught error: 'An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
2024-02-19T15:51:08.431Z [container-agent] 2024-02-19 15:51:08 INFO juju.worker.uniter.operation runhook.go:186 ran "mlflow-server-pebble-ready" hook (via hook dispatching script: dispatch)
2024-02-19T15:51:08.878Z [container-agent] 2024-02-19 15:51:08 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:51:09.150Z [container-agent] 2024-02-19 15:51:09 INFO juju.worker.uniter.operation runhook.go:186 ran "mlflow-prometheus-exporter-pebble-ready" hook (via hook dispatching script: dispatch)
2024-02-19T15:55:47.340Z [container-agent] 2024-02-19 15:55:47 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T15:55:47.469Z [container-agent] 2024-02-19 15:55:47 INFO juju-log HTTP Request: GET https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:55:47.556Z [container-agent] 2024-02-19 15:55:47 INFO juju-log HTTP Request: PATCH https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T15:55:47.596Z [container-agent] 2024-02-19 15:55:47 INFO juju-log Kubernetes service 'mlflow-server' patched successfully
2024-02-19T15:56:00.112Z [container-agent] 2024-02-19 15:56:00 INFO juju-log Event <UpdateStatusEvent via MlflowCharm/on/update_status[26]> stopped early with message: Error with default S3 artifact store - bucket not accessible or cannot be created.  Caught error: 'An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
2024-02-19T15:56:00.617Z [container-agent] 2024-02-19 15:56:00 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)
2024-02-19T16:00:45.865Z [container-agent] 2024-02-19 16:00:45 WARNING juju-log 2 containers are present in metadata.yaml and refresh_event was not specified. Defaulting to update_status. Metrics IP may not be set in a timely fashion.
2024-02-19T16:00:45.989Z [container-agent] 2024-02-19 16:00:45 INFO juju-log HTTP Request: GET https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T16:00:46.079Z [container-agent] 2024-02-19 16:00:46 INFO juju-log HTTP Request: PATCH https://172.30.0.1/api/v1/namespaces/kubeflow/services/mlflow-server "HTTP/1.1 200 OK"
2024-02-19T16:00:46.117Z [container-agent] 2024-02-19 16:00:46 INFO juju-log Kubernetes service 'mlflow-server' patched successfully
2024-02-19T16:00:55.328Z [container-agent] 2024-02-19 16:00:55 INFO juju-log Event <UpdateStatusEvent via MlflowCharm/on/update_status[31]> stopped early with message: Error with default S3 artifact store - bucket not accessible or cannot be created.  Caught error: 'An error occurred (BucketAlreadyOwnedByYou) when calling the CreateBucket operation: Your previous request to create the named bucket succeeded and you already own it.
2024-02-19T16:00:55.756Z [container-agent] 2024-02-19 16:00:55 INFO juju.worker.uniter.operation runhook.go:186 ran "update-status" hook (via hook dispatching script: dispatch)

Additional Context

No response

@Barteus Barteus added the bug Something isn't working label Feb 19, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5351.

This message was autogenerated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant